Skip to content

Commit 4f20e9b

Browse files
authored
docs: include new heuristics in malware analyzer readme (#987)
include details for the anomalous version and wheel absence heuristics in the malware analyzer README, with docstring updates.
1 parent 58b1a8b commit 4f20e9b

File tree

7 files changed

+48
-48
lines changed

7 files changed

+48
-48
lines changed
Lines changed: 32 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,60 +1,60 @@
11
# Implementation of Heuristic Malware Detector
22

3-
## Check
3+
## PyPI Ecosystem
44

5-
We schedule the heuristics sequentially:
5+
Malware detection is achieved using a combination of metadata and source code heuristics. Certain combinations of the results of these heuristics are indicators of a malicious package.
66

7-
1. **Empty Project Link**: If the package contains project links (e.g., documentation, Git Repositories),
8-
the analyzer will further operate the heuristic `Unreachable Project Links` to analyze if all the project links are unreachable.
9-
2. **One Release**: Checks if there is only one release of the package. If the package contains multiple
10-
releases, the checker will further check the release frequency through `High Release Frequency` and
11-
`Unchanged Release` to see if the maintainers release multiple times in a short timeframe (threshold), and
12-
whether the contents of the releases are identical.
13-
3. **Closer Release Join Date**: Considers the date when the maintainer registered their account (if
14-
available). The checker will calculate the gap between the latest release date and the maintainer's account
15-
registration date.
16-
4. **Suspicious Setup**: Checks whether the `setup.py` includes suspicious imports, such as `base64` for
17-
encryption and `requests` for data exfiltration.
18-
19-
## Supported Ecosystem: PyPI
20-
21-
Define Seven Heuristics: `False` means suspicious and `True` means benign. `SKIP` means some metadata is missing, and the checker will skip the heuristic.
7+
When a heuristic fails, with `HeuristicResult.FAIL`, then that is an indicator by that heuristic of suspicious behaviour. When a heuristic passes, with `HeuristicResult.PASS`, then that is an indicator of benign behavior. When a heuristic is skipped, returning `HeuristicResult.SKIP`, then this means that heuristic was not applicable to the package, due to either package details or dependencies on other heuristics. When a heuristic encounters a malformed package, a `HeuristicAnalyzerValueError` is raised. The following heuristics are currently run sequentially to gauge package maliciousness.
228

239
1. **Empty Project Link**
2410
- **Description**: Checks whether the package contains any project links (e.g., documents or Git
25-
Repositories). Many malicious activities do not include any project links.
26-
- **Rule**: Return `FALSE` when there is only one project link; otherwise, return `TRUE`.
11+
Repositories). Many malicious packages do not include any project links.
12+
- **Rule**: Return `HeuristicResult.FAIL` when there are no project links; otherwise, return `HeuristicResult.PASS`.
2713

2814
2. **Unreachable Project Links**
2915
- **Description**: Checks the accessibility of the project links. This is considered an auxiliary
3016
heuristic since no cases have met this heuristic.
31-
- **Rule**: Return `FALSE` if all project links are unreachable; otherwise, return `TRUE`.
17+
- **Rule**: Return `HeuristicResult.FAIL` if all project links are unreachable; otherwise, return `HeuristicResult.PASS`.
18+
- **Dependency**: Will be run if the Empty Project Link heuristic passes.
3219

3320
3. **One Release**
3421
- **Description**: Checks whether the package has only one release.
35-
- **Rule**: Return `FALSE` if the package contains only one release; otherwise, return `TRUE`.
22+
- **Rule**: Return `HeuristicResult.FAIL` if the package contains only one release; otherwise, return `HeuristicResult.PASS`.
3623

3724
4. **High Release Frequency**
3825
- **Description**: Checks if the package released multiple versions within a short timeframe. We calculate
3926
the release frequency and define a default frequency threshold of 2 days.
40-
- **Rule**: Return `FALSE` if the frequency is higher than the threshold; otherwise, return `TRUE`.
27+
- **Rule**: Return `HeuristicResult.FAIL` if the frequency is higher than the threshold; otherwise, return `HeuristicResult.PASS`.
28+
- **Dependency**: Will be run if the One Release heuristic passes.
4129

4230
5. **Unchanged Release**
43-
- **Description**: Checks if the content of releases remains unchanged.
44-
- **Rule**: Return `FALSE` if the content of releases is identical; otherwise, return `TRUE`.
31+
- **Description**: Checks if the content of releases remains unchanged using the `sha256` digest of the package source.
32+
- **Rule**: Return `HeuristicResult.FAIL` if the content of any two releases is identical; otherwise, return `HeuristicResult.PASS`.
33+
- **Dependency**: Will be run if the High Release Frequency heuristic fails.
4534

4635
6. **Closer Release Join Date**
47-
- **Description**: Checks the gap between the date the maintainer registered their account and the date
36+
- **Description**: Checks the gap between the date the maintainer(s) registered their account and the date
4837
of the latest release. A default threshold of 5 days is defined.
49-
- **Rule**: Return `FALSE` if the gap is less than the threshold; otherwise, return `TRUE`.
38+
- **Rule**: Return `HeuristicResult.FAIL` if the gap is less than the threshold for any maintainer; otherwise, return `HeuristicResult.PASS`.
5039

5140
7. **Suspicious Setup**
52-
- **Description**: Checks the `setup.py` to see if there are suspicious imported modules, or
53-
`install_requires` packages that are installed during the package installation process. We define two suspicious
54-
keywords as the blacklist.
55-
- **Rule**: Return `FALSE` if the package name contains suspicious keywords; otherwise, return `TRUE`.
41+
- **Description**: Checks `setup.py` to see if there are suspicious imported modules, or
42+
`install_requires` packages that are installed during the package installation process. Current blacklisted packages are `base64` and `requests`. This heuristic is skipped if no `setup.py` file can be found in the package.
43+
- **Rule**: Return `HeuristicResult.FAIL` if the package name contains suspicious keywords; otherwise, return `HeuristicResult.PASS`.
44+
- **Dependency**: Will be run if the Closer Release Join Date heuristic fails.
45+
46+
8. **Wheel Absence**
47+
- **Description**: Checks for the presence of a wheel (`.whl`) file distributed with the specified package release.
48+
- **Rule**: Return `HeuristicResult.FAIL` if there is no wheel file present with that package release; otherwise, return `HeuristicResult.PASS`.
49+
50+
9. **Anomalous Version**
51+
- **Description**: Checks if the version number is abnormally high, checking the epoch and major version against threshold values. This does account for common date-based version number (calendar versioning) patterns.
52+
- **Rule**: Return `HeuristicResult.FAIL` if the major or epoch is abnormally high; otherwise, return `HeuristicResult.PASS`.
53+
- **Dependency**: Will be run if the One Release heuristic fails.
54+
55+
### Confidence Score Motivation
5656

57-
## Heuristics-Based Analyzer: Scanning 1167 Packages from Trusted Organizations
57+
The original seven heuristics which started this work were Empty Project Link, Unreachable Project Links, One Release, High Release Frequency, Unchange Release, Closer Release Join Date, and Suspicious Setup. These heuristics (excluding those with a dependency) were run on 1167 packages from trusted organizations, with the following results:
5858

5959
| Heuristic Name | Count |
6060
|------------------| ----- |
@@ -64,4 +64,4 @@ Define Seven Heuristics: `False` means suspicious and `True` means benign. `SKIP
6464
| Frequent Release | 14 |
6565
| Suspicious Setup | 5 |
6666

67-
**The result is used as a reference for the confidence score to lower the false positive rate.**
67+
These results were used as a reference for the confidence score provided in each suspicious combination.

src/macaron/malware_analyzer/pypi_heuristics/metadata/anomalous_version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ class AnomalousVersionAnalyzer(BaseHeuristicAnalyzer):
2323
"""
2424
Analyze the version number (if there is only a single release) to detect if it is anomalous.
2525
26-
A version number is anomalous if any of its values are greater than the epoch, major, or minor threshold values.
26+
A version number is anomalous if any of its values are greater than the epoch or major threshold values.
2727
If the version does not adhere to PyPI standards (PEP 440, as per the 'packaging' module), this heuristic
2828
cannot analyze it.
2929

src/macaron/malware_analyzer/pypi_heuristics/metadata/closer_release_join_date.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2024 - 2024, Oracle and/or its affiliates. All rights reserved.
1+
# Copyright (c) 2024 - 2025, Oracle and/or its affiliates. All rights reserved.
22
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
33

44
"""Analyzer checks whether the maintainers' join date closer to latest package's release date."""
@@ -14,7 +14,7 @@
1414

1515

1616
class CloserReleaseJoinDateAnalyzer(BaseHeuristicAnalyzer):
17-
"""Analyzer checks the heuristic.
17+
"""Check whether the maintainers' join date closer to package's latest release date.
1818
1919
If any maintainer's date duration is larger than threshold, we consider it as "PASS".
2020
"""
@@ -82,7 +82,7 @@ def _get_latest_release_date(self, pypi_package_json: PyPIPackageJsonAsset) -> d
8282
return parse_datetime(upload_time, datetime_format)
8383

8484
def analyze(self, pypi_package_json: PyPIPackageJsonAsset) -> tuple[HeuristicResult, dict[str, JsonType]]:
85-
"""Check whether the maintainers' join date closer to package's latest release date.
85+
"""Analyze the package.
8686
8787
Parameters
8888
----------

src/macaron/malware_analyzer/pypi_heuristics/metadata/empty_project_link.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2024 - 2024, Oracle and/or its affiliates. All rights reserved.
1+
# Copyright (c) 2024 - 2025, Oracle and/or its affiliates. All rights reserved.
22
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
33

44
"""Analyzer checks there is no project link of the package."""
@@ -10,13 +10,13 @@
1010

1111

1212
class EmptyProjectLinkAnalyzer(BaseHeuristicAnalyzer):
13-
"""Analyzer checks heuristic."""
13+
"""Check whether the PyPI package has no project links."""
1414

1515
def __init__(self) -> None:
1616
super().__init__(name="empty_project_link_analyzer", heuristic=Heuristics.EMPTY_PROJECT_LINK, depends_on=None)
1717

1818
def analyze(self, pypi_package_json: PyPIPackageJsonAsset) -> tuple[HeuristicResult, dict[str, JsonType]]:
19-
"""Check whether the PyPI package has no project link.
19+
"""Analyze the package.
2020
2121
Parameters
2222
----------

src/macaron/malware_analyzer/pypi_heuristics/metadata/high_release_frequency.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2024 - 2024, Oracle and/or its affiliates. All rights reserved.
1+
# Copyright (c) 2024 - 2025, Oracle and/or its affiliates. All rights reserved.
22
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
33

44
"""Analyzer checks the frequent release heuristic."""
@@ -17,7 +17,7 @@
1717

1818

1919
class HighReleaseFrequencyAnalyzer(BaseHeuristicAnalyzer):
20-
"""Analyzer checks heuristic."""
20+
"""Check whether the release frequency is high."""
2121

2222
def __init__(self) -> None:
2323
super().__init__(
@@ -36,7 +36,7 @@ def _load_defaults(self) -> int:
3636
return 2
3737

3838
def analyze(self, pypi_package_json: PyPIPackageJsonAsset) -> tuple[HeuristicResult, dict[str, JsonType]]:
39-
"""Check whether the release frequency is high.
39+
"""Analyze the package.
4040
4141
Parameters
4242
----------

src/macaron/malware_analyzer/pypi_heuristics/metadata/one_release.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2024 - 2024, Oracle and/or its affiliates. All rights reserved.
1+
# Copyright (c) 2024 - 2025, Oracle and/or its affiliates. All rights reserved.
22
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
33

44

@@ -11,13 +11,13 @@
1111

1212

1313
class OneReleaseAnalyzer(BaseHeuristicAnalyzer):
14-
"""Analyzer checks heuristic."""
14+
"""Determine if there is only one release of the package."""
1515

1616
def __init__(self) -> None:
1717
super().__init__(name="one_release_analyzer", heuristic=Heuristics.ONE_RELEASE, depends_on=None)
1818

1919
def analyze(self, pypi_package_json: PyPIPackageJsonAsset) -> tuple[HeuristicResult, dict[str, JsonType]]:
20-
"""Check the releases' total is one.
20+
"""Analyze the package.
2121
2222
Parameters
2323
----------

src/macaron/malware_analyzer/pypi_heuristics/sourcecode/suspicious_setup.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Copyright (c) 2024 - 2024, Oracle and/or its affiliates. All rights reserved.
1+
# Copyright (c) 2024 - 2025, Oracle and/or its affiliates. All rights reserved.
22
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.
33

44
"""This analyzer checks the suspicious pattern within setup.py."""
@@ -23,7 +23,7 @@
2323

2424

2525
class SuspiciousSetupAnalyzer(BaseHeuristicAnalyzer):
26-
"""Analyzer checks heuristic."""
26+
"""Check whether suspicious packages are imported in setup.py."""
2727

2828
def __init__(self) -> None:
2929
super().__init__(
@@ -119,7 +119,7 @@ def _get_setup_source_code(self, pypi_package_json: PyPIPackageJsonAsset) -> str
119119
return file.read()
120120

121121
def analyze(self, pypi_package_json: PyPIPackageJsonAsset) -> tuple[HeuristicResult, dict[str, JsonType]]:
122-
"""Analyze suspicious packages are imported in setup.py.
122+
"""Analyze the package.
123123
124124
Parameters
125125
----------

0 commit comments

Comments
 (0)