Skip to content

Commit 612e27e

Browse files
authored
refactor: improve experimental source code pattern analysis of pypi packages (#965)
Include support for using Semgrep for analysis of source code to detect malicious code patterns, specified using Semgrep's YAML files. Signed-off-by: Carl Flottmann <[email protected]>
1 parent 1c65d5f commit 612e27e

35 files changed

+2245
-720
lines changed

.pre-commit-config.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ repos:
3030
- id: isort
3131
name: Sort import statements
3232
args: [--settings-path, pyproject.toml]
33+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
3334

3435
# Add Black code formatters.
3536
- repo: https://github.com/ambv/black
@@ -38,6 +39,7 @@ repos:
3839
- id: black
3940
name: Format code
4041
args: [--config, pyproject.toml]
42+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
4143
- repo: https://github.com/asottile/blacken-docs
4244
rev: 1.19.1
4345
hooks:
@@ -65,6 +67,7 @@ repos:
6567
files: ^src/macaron/|^tests/
6668
types: [text, python]
6769
additional_dependencies: [flake8-bugbear==22.10.27, flake8-builtins==2.0.1, flake8-comprehensions==3.10.1, flake8-docstrings==1.6.0, flake8-mutable==1.2.0, flake8-noqa==1.4.0, flake8-pytest-style==1.6.0, flake8-rst-docstrings==0.3.0, pep8-naming==0.13.2]
70+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
6871
args: [--config, .flake8]
6972

7073
# Check GitHub Actions workflow files.
@@ -82,6 +85,7 @@ repos:
8285
entry: pylint
8386
language: python
8487
files: ^src/macaron/|^tests/
88+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
8589
types: [text, python]
8690
args: [--rcfile, pyproject.toml]
8791

@@ -94,6 +98,7 @@ repos:
9498
language: python
9599
files: ^src/macaron/|^tests/
96100
types: [text, python]
101+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
97102
args: [--show-traceback, --config-file, pyproject.toml]
98103

99104
# Check for potential security issues.
@@ -106,6 +111,7 @@ repos:
106111
files: ^src/macaron/|^tests/
107112
types: [text, python]
108113
additional_dependencies: ['bandit[toml]']
114+
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
109115

110116
# Enable a whole bunch of useful helper hooks, too.
111117
# See https://pre-commit.com/hooks.html for more hooks.
@@ -197,6 +203,18 @@ repos:
197203
always_run: true
198204
pass_filenames: false
199205

206+
# Checks that tests/malware_analyzer/pypi/resources/sourcecode_samples files do not have executable permissions
207+
# This is another measure to make sure the files can't be accidentally executed
208+
- repo: local
209+
hooks:
210+
- id: sourcecode-sample-permissions
211+
name: Sourcecode sample executable permissions checker
212+
entry: scripts/dev_scripts/samples_permissions_checker.sh
213+
language: system
214+
always_run: true
215+
pass_filenames: false
216+
217+
200218
# A linter for Golang
201219
- repo: https://github.com/golangci/golangci-lint
202220
rev: v1.64.6

.semgrepignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# Items added to this file will be ignored by Semgrep.

CONTRIBUTING.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,10 @@ See below for instructions to set up the development environment.
7272
- PRs should be merged using the `Squash and merge` strategy. In most cases a single commit with
7373
a detailed commit message body is preferred. Make sure to keep the `Signed-off-by` line in the body.
7474

75+
### PyPI Malware Detection Contribution
76+
77+
Please see the [README for the malware analyzer](./src/macaron/malware_analyzer/README.md) for information on contributing Heuristics and code patterns.
78+
7579
## Branching model
7680

7781
* The `main` branch should be used as the base branch for pull requests. The `release` branch is designated for releases and should only be merged into when creating a new release for Macaron.

docker/Dockerfile.final

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ RUN : \
4646
&& . .venv/bin/activate \
4747
&& pip install --no-compile --no-cache-dir --upgrade pip setuptools \
4848
&& find $HOME/dist -depth \( -type f \( -name "macaron-*.whl" \) \) -exec pip install --no-compile --no-cache-dir '{}' \; \
49-
&& pip uninstall semgrep \
49+
&& pip uninstall semgrep -y \
5050
&& find $HOME/dist -depth \( -type f \( -name "semgrep-*.whl" \) \) -exec pip install --no-compile --no-cache-dir '{}' \; \
5151
&& rm -rf $HOME/dist \
5252
&& deactivate

docs/source/pages/developers_guide/apidoc/macaron.malware_analyzer.pypi_heuristics.sourcecode.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,14 @@ macaron.malware\_analyzer.pypi\_heuristics.sourcecode package
99
Submodules
1010
----------
1111

12+
macaron.malware\_analyzer.pypi\_heuristics.sourcecode.pypi\_sourcecode\_analyzer module
13+
---------------------------------------------------------------------------------------
14+
15+
.. automodule:: macaron.malware_analyzer.pypi_heuristics.sourcecode.pypi_sourcecode_analyzer
16+
:members:
17+
:undoc-members:
18+
:show-inheritance:
19+
1220
macaron.malware\_analyzer.pypi\_heuristics.sourcecode.suspicious\_setup module
1321
------------------------------------------------------------------------------
1422

pyproject.toml

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ dependencies = [
3737
"beautifulsoup4 >= 4.12.0,<5.0.0",
3838
"problog >= 2.2.6,<3.0.0",
3939
"cryptography >=44.0.0,<45.0.0",
40+
"semgrep == 1.113.0",
4041
]
4142
keywords = []
4243
# https://pypi.org/classifiers/
@@ -119,12 +120,14 @@ Issues = "https://github.com/oracle/macaron/issues"
119120
[tool.bandit]
120121
tests = []
121122
skips = ["B101"]
122-
123+
exclude_dirs = ['tests/malware_analyzer/pypi/resources/sourcecode_samples']
123124

124125
# https://github.com/psf/black#configuration
125126
[tool.black]
126127
line-length = 120
127-
128+
force-exclude = '''
129+
tests/malware_analyzer/pypi/resources/sourcecode_samples/
130+
'''
128131

129132
# https://github.com/commitizen-tools/commitizen
130133
# https://commitizen-tools.github.io/commitizen/bump/
@@ -170,7 +173,6 @@ exclude = [
170173
"SECURITY.md",
171174
]
172175

173-
174176
# https://pycqa.github.io/isort/
175177
[tool.isort]
176178
profile = "black"
@@ -181,7 +183,6 @@ skip_gitignore = true
181183

182184
# https://mypy.readthedocs.io/en/stable/config_file.html#using-a-pyproject-toml
183185
[tool.mypy]
184-
# exclude=
185186
show_error_codes = true
186187
show_column_numbers = true
187188
check_untyped_defs = true
@@ -209,7 +210,6 @@ module = [
209210
]
210211
ignore_missing_imports = true
211212

212-
213213
# https://pylint.pycqa.org/en/latest/user_guide/configuration/index.html
214214
[tool.pylint.MASTER]
215215
fail-under = 10.0
@@ -261,6 +261,7 @@ addopts = """-vv -ra --tb native \
261261
--doctest-modules --doctest-continue-on-failure --doctest-glob '*.rst' \
262262
--cov macaron \
263263
--ignore tests/integration \
264+
--ignore tests/malware_analyzer/pypi/resources/sourcecode_samples \
264265
""" # Consider adding --pdb
265266
# https://docs.python.org/3/library/doctest.html#option-flags
266267
doctest_optionflags = "IGNORE_EXCEPTION_DETAIL"
Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/usr/bin/env bash
2+
3+
# Copyright (c) 2022 - 2025, Oracle and/or its affiliates. All rights reserved.
4+
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
5+
6+
#
7+
# Checks if the files in tests/malware_analyzer/pypi/resources/sourcecode_samples have executable permissions,
8+
# failing if any do.
9+
#
10+
11+
# Strict bash options.
12+
#
13+
# -e: exit immediately if a command fails (with non-zero return code),
14+
# or if a function returns non-zero.
15+
#
16+
# -u: treat unset variables and parameters as error when performing
17+
# parameter expansion.
18+
# In case a variable ${VAR} is unset but we still need to expand,
19+
# use the syntax ${VAR:-} to expand it to an empty string.
20+
#
21+
# -o pipefail: set the return value of a pipeline to the value of the last
22+
# (rightmost) command to exit with a non-zero status, or zero
23+
# if all commands in the pipeline exit successfully.
24+
#
25+
# Reference: https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html.
26+
set -euo pipefail
27+
28+
MACARON_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && cd ../.. && pwd)"
29+
SAMPLES_PATH="${MACARON_DIR}/tests/malware_analyzer/pypi/resources/sourcecode_samples"
30+
31+
# any files have any of the executable bits set
32+
executables=$( ( find "$SAMPLES_PATH" -type f -perm -u+x -o -type f -perm -g+x -o -type f -perm -o+x | sed "s|$MACARON_DIR/||"; git ls-files "$SAMPLES_PATH" --full-name) | sort | uniq -d)
33+
if [ -n "$executables" ]; then
34+
echo "The following files should not have any executable permissions:"
35+
echo "$executables"
36+
exit 1
37+
fi

src/macaron/__main__.py

Lines changed: 20 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -96,6 +96,10 @@ def analyze_slsa_levels_single(analyzer_single_args: argparse.Namespace) -> None
9696

9797
global_config.local_maven_repo = user_provided_local_maven_repo
9898

99+
if analyzer_single_args.force_analyze_source and not analyzer_single_args.analyze_source:
100+
logger.error("'--force-analyze-source' requires '--analyze-source'.")
101+
sys.exit(os.EX_USAGE)
102+
99103
analyzer = Analyzer(global_config.output_path, global_config.build_log_path)
100104

101105
# Initiate reporters.
@@ -172,8 +176,9 @@ def analyze_slsa_levels_single(analyzer_single_args: argparse.Namespace) -> None
172176
analyzer_single_args.sbom_path,
173177
deps_depth,
174178
provenance_payload=prov_payload,
175-
validate_malware=analyzer_single_args.validate_malware,
176179
verify_provenance=analyzer_single_args.verify_provenance,
180+
analyze_source=analyzer_single_args.analyze_source,
181+
force_analyze_source=analyzer_single_args.force_analyze_source,
177182
)
178183
sys.exit(status_code)
179184

@@ -477,10 +482,22 @@ def main(argv: list[str] | None = None) -> None:
477482
)
478483

479484
single_analyze_parser.add_argument(
480-
"--validate-malware",
485+
"--analyze-source",
481486
required=False,
482487
action="store_true",
483-
help=("Enable malware validation."),
488+
help=(
489+
"For improved malware detection, analyze the source code of the"
490+
+ " (PyPI) package using a textual scan and dataflow analysis."
491+
),
492+
)
493+
494+
single_analyze_parser.add_argument(
495+
"--force-analyze-source",
496+
required=False,
497+
action="store_true",
498+
help=(
499+
"Forces PyPI sourcecode analysis to run regardless of other heuristic results. Requires '--analyze-source'."
500+
),
484501
)
485502

486503
single_analyze_parser.add_argument(

src/macaron/config/defaults.ini

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -611,3 +611,27 @@ scaling = 0.15
611611
cost = 1.0
612612
# The path to the file that contains the list of popular packages.
613613
popular_packages_path =
614+
615+
# ==== The following sections are for source code analysis using Semgrep ====
616+
# rulesets: a reference to a 'ruleset' in this section refers to a Semgrep .yaml file containing one or more rules.
617+
# rules: a reference to a 'rule' in this section refers to an individual rule ID, specified by the '- id:' field in
618+
# the Segmrep .yaml file.
619+
# default rulesets: these are a collection of rulesets provided with Macaron which are run by default with the sourcecode
620+
# analyzer. These live in src/macaron/resources/pypi_malware_rules.
621+
# custom rulesets: this is a collection of user-provided rulesets, living inside the path provided to 'custom_semgrep_rules_path'.
622+
623+
# disable default semgrep rulesets here (i.e. all rule IDs in a Semgrep .yaml file) using ruleset names, the name
624+
# without the .yaml prefix. Currently, we disable the exfiltration rulesets by default due to a high false positive rate.
625+
# This list may not contain duplicated elements. Macaron's default ruleset names are all unique.
626+
disabled_default_rulesets = exfiltration
627+
# disable individual rules here (i.e. individual rule IDs inside a Semgrep .yaml file) using rule IDs. You may also
628+
# provide the IDs of your custom semgrep rules here too, as all Semgrep rule IDs must be unique. This list may not contain
629+
# duplicated elements.
630+
disabled_rules =
631+
# absolute path to a directory where a custom set of semgrep rules for source code analysis are stored. These will be included
632+
# with Macaron's default rules. The path will be normalised to the OS path type.
633+
custom_semgrep_rules_path =
634+
# disable custom semgrep rulesets here (i.e. all rule IDs in a Semgrep .yaml file) using ruleset names, the name without the
635+
# .yaml prefix. Note, this will be ignored if a path to custom semgrep rules is not provided. This list may not contain
636+
# duplicated elements, meaning that ruleset names must be unique.
637+
disabled_custom_rulesets =

src/macaron/errors.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,3 +109,7 @@ class HeuristicAnalyzerValueError(MacaronError):
109109

110110
class LocalArtifactFinderError(MacaronError):
111111
"""Happens when there is an error looking for local artifacts."""
112+
113+
114+
class SourceCodeError(MacaronError):
115+
"""Error for operations on package source code."""

0 commit comments

Comments
 (0)