Skip to content

feat: add a check to analyze malicious Python packages #750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 47 commits into from
Jul 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
56470f8
chore: Add PyPI heuristics check
Yao-Wen-Chang May 26, 2024
c48712a
chore: change check name
Yao-Wen-Chang May 29, 2024
3973606
chore: add the suspicious result combo
Yao-Wen-Chang May 29, 2024
5d95aab
chore: add detail info to DB table
Yao-Wen-Chang May 30, 2024
08dee0b
chore: add heuristic fail list to table
Yao-Wen-Chang May 30, 2024
c12c681
chore: modify the DB table and remove unnecessary function
Yao-Wen-Chang May 31, 2024
f3125cb
chore: modify temp directory create method
Yao-Wen-Chang May 31, 2024
e73d13a
chore: update bs4 dependency and refine heuristic results and depends…
Yao-Wen-Chang May 31, 2024
84b2d65
chore: remove redundant function
Yao-Wen-Chang May 31, 2024
fd938d6
chore: add tests for EmptyProjectLinkAnalyzer
Yao-Wen-Chang Jun 8, 2024
fef557e
chore: fix depends_on error
Yao-Wen-Chang Jun 8, 2024
9c99c95
chore: add tests for EmptyOneReleaseAnalyzer
Yao-Wen-Chang Jun 8, 2024
b515db9
chore: fix incorrect return project links
Yao-Wen-Chang Jun 9, 2024
ffa2fd4
chore: add tests for UnreachableProjectLinks
Yao-Wen-Chang Jun 9, 2024
25a4cef
chore: handle strptime errors and improve upload_time extraction
Yao-Wen-Chang Jun 11, 2024
e0353a7
chore: update description of empty project link
Yao-Wen-Chang Jun 11, 2024
3e7d8a3
chore: remove unnecessary function and update docstring and type anno…
Yao-Wen-Chang Jun 12, 2024
18d05fd
chore: refine should_skip function and update function name
Yao-Wen-Chang Jun 12, 2024
8d915ce
chore: remove the any type
Yao-Wen-Chang Jun 12, 2024
1ddde49
chore: add depends_on description and refine type annotation
Yao-Wen-Chang Jun 12, 2024
4458197
chore: modify enum name
Yao-Wen-Chang Jun 12, 2024
fc973e1
chore: refine maintainer join date return None
Yao-Wen-Chang Jun 12, 2024
f278feb
chore: refine base analyzer
Yao-Wen-Chang Jun 12, 2024
4e4a903
chore: refine one release analyzer and modify base analyzer name
Yao-Wen-Chang Jun 12, 2024
6f8ff8e
chore: refine unchanged releases heuristic
Yao-Wen-Chang Jun 12, 2024
5f3724d
chore: implement PyPI registry
Yao-Wen-Chang Jun 14, 2024
b9b07ee
chore: add handler for strptime
Yao-Wen-Chang Jun 14, 2024
965a7f9
chore: modify test of empty project link analyzer
Yao-Wen-Chang Jun 14, 2024
c73beea
chore: modify test of one release analyzer and refine empty project l…
Yao-Wen-Chang Jun 14, 2024
98c04d2
chore: refine test and fix bug
Yao-Wen-Chang Jun 14, 2024
a588762
chore: move heuristics check to another folder
Yao-Wen-Chang Jun 15, 2024
23fcc52
chore: change module imported path
Yao-Wen-Chang Jun 16, 2024
cb0ae1a
chore: fix project link test
Yao-Wen-Chang Jun 16, 2024
4b6b25d
chore: refine requests handler for source code download
Yao-Wen-Chang Jun 17, 2024
07c6729
chore: fix source file name extraction
Yao-Wen-Chang Jun 17, 2024
9af8138
chore: modify HEURISTIC to Heuristics
Yao-Wen-Chang Jun 17, 2024
66fe586
chore: update bs4 dep configuration
Yao-Wen-Chang Jun 19, 2024
0ce67b9
chore: implement test for suspicious_setup heuristic
Yao-Wen-Chang Jun 20, 2024
272b4a7
chore: refine docstrings for HeuristicResult enum
Yao-Wen-Chang Jun 21, 2024
de7fc72
chore: handle the package with one release
Yao-Wen-Chang Jun 22, 2024
bc2d921
chore: refine the pypi registry docstring
Yao-Wen-Chang Jun 22, 2024
58243a4
chore: refine heuristics enum
Yao-Wen-Chang Jun 25, 2024
cbd409b
chore: add and refine tests for heuristics analysis
Yao-Wen-Chang Jun 25, 2024
713fc12
chore: refine docstrings and fix confidence logic of the heuristics
Yao-Wen-Chang Jun 25, 2024
8fd511b
chore: add the suspicious combo to detect unreachable links
Yao-Wen-Chang Jun 30, 2024
1a54ed8
chore: add docstrings for suspicious combo and modify confidence
Yao-Wen-Chang Jul 3, 2024
bd1e010
chore: remove suspicious combo
Yao-Wen-Chang Jul 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ dependencies = [
"jsonschema >= 4.22.0,<5.0.0",
"cyclonedx-bom >=4.0.0,<5.0.0",
"cyclonedx-python-lib[validation] >=7.3.4,<8.0.0",
"beautifulsoup4 >= 4.12.0,<5.0.0",
]
keywords = []
# https://pypi.org/classifiers/
Expand Down Expand Up @@ -74,6 +75,7 @@ dev = [
"pip-audit >=2.5.6,<3.0.0",
"pylint >=3.0.3,<4.0.0",
"cyclonedx-bom >=4.0.0,<5.0.0",
"types-beautifulsoup4 >= 4.12.0,<5.0.0",
]
docs = [
"sphinx >=7.0.0,<8.0.0",
Expand Down
10 changes: 10 additions & 0 deletions src/macaron/config/defaults.ini
Original file line number Diff line number Diff line change
Expand Up @@ -512,6 +512,10 @@ hostname = registry.npmjs.org
attestation_endpoint = -/npm/v1/attestations
request_timeout = 20

[package_registry.pypi]
request_timeout = 20
hostname = pypi.org

# Configuration options for selecting the checks to run.
# Both the exclude and include are defined as list of strings:
# - The exclude list is used to specify the checks that will not run.
Expand Down Expand Up @@ -547,3 +551,9 @@ request_timeout = 20
exclude =
# By default, we run all checks available.
include = *

[heuristic.pypi]
releases_frequency_threshold = 2
# The gap threshold.
# The timedelta indicate the gap between the date maintainer registers their pypi's account and the date of latest release.
timedelta_threshold_of_join_release = 5
67 changes: 67 additions & 0 deletions src/macaron/malware_analyzer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Implementation of Heuristic Malware Detector

## Check

We schedule the heuristics sequentially:

1. **Empty Project Link**: If the package contains project links (e.g., documentation, Git Repositories),
the analyzer will further operate the heuristic `Unreachable Project Links` to analyze if all the project links are not reachable.
2. **One Release**: Checks if there is only one release of the package. If the package contains multiple
releases, the checker will further check the release frequency through `High Release Frequency` and
`Unchanged Release` to see if the maintainers release multiple times in a short timeframe (threshold) and
whether the released contents are identical.
3. **Closer Release Join Date**: Considers the date when the maintainer registered their account (if
available). The checker will calculate the gap between the latest release date and the maintainer's account
registration date.
4. **Suspicious Setup**: Checks whether the `setup.py` includes suspicious imports, such as `base64` for
encryption and `requests` for data exfiltration.

## Supported Ecosystem: PyPI

Define Seven Heuristics: `False` means suspicious and `True` means non-suspicious. `SKIP` means some metadata are missing, and the checker skips the heuristic.

1. **Empty Project Link**
- **Description**: Checks whether the package contains any project links (e.g., documents or Git
Repositories). Many malicious activities do not include any project link.
- **Rule**: Return `FALSE` when there is only one project link; otherwise, return `TRUE`.

2. **Unreachable Project Links**
- **Description**: Checks the accessibility of the project links. This is considered an auxiliary
heuristic since no cases have met this heuristic.
- **Rule**: Return `FALSE` if all project links are not reachable; otherwise, return `TRUE`.

3. **One Release**
- **Description**: Checks whether the package has only one release.
- **Rule**: Return `FALSE` if the package contains only one release; otherwise, return `TRUE`.

4. **High Release Frequency**
- **Description**: Checks if the package released multiple versions within a short period. We calculate
the release frequency and define a default frequency threshold of 2 days.
- **Rule**: Return `FALSE` if the frequency is higher than the threshold; otherwise, return `TRUE`.

5. **Unchanged Release**
- **Description**: Checks if the content of releases remains unchanged.
- **Rule**: Return `FALSE` if the content of releases is identical; otherwise, return `TRUE`.

6. **Closer Release Join Date**
- **Description**: Checks the gap between the date the maintainer registered their account and the date
of the latest release. A default threshold of 5 days is defined.
- **Rule**: Return `FALSE` if the gap is less than the threshold; otherwise, return `TRUE`.

7. **Suspicious Setup**
- **Description**: Checks the `setup.py` to see if there are suspicious imported modules and the
`install_requires` packages installed during the package installation process. We define two suspicious
keywords as the blacklist.
- **Rule**: Return `FALSE` if the package name contains suspicious keywords; otherwise, return `TRUE`.

## Heuristics-Based Analyzer: Scanning 1167 Packages from Trusted Organizations

| Heuristic Name | Count |
| ------------------ | ----- |
| Lower Release | 102 |
| Empty Link | 45 |
| Links Missing | 24 |
| Frequent Release | 14 |
| Suspicious Setup | 5 |

**The result is used as a reference for the confidence score to lower the false positive rate.**
2 changes: 2 additions & 0 deletions src/macaron/malware_analyzer/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Copyright (c) 2022 - 2024, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
2 changes: 2 additions & 0 deletions src/macaron/malware_analyzer/checks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Copyright (c) 2022 - 2024, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
Loading
Loading