|
| 1 | +# Implementation of Heuristic Malware Detector |
| 2 | + |
| 3 | +## Check |
| 4 | + |
| 5 | +We schedule the heuristics sequentially: |
| 6 | + |
| 7 | +1. **Empty Project Link**: If the package contains project links (e.g., documentation, Git Repositories), |
| 8 | +the analyzer will further operate the heuristic `Unreachable Project Links` to analyze if all the project links are not reachable. |
| 9 | +2. **One Release**: Checks if there is only one release of the package. If the package contains multiple |
| 10 | +releases, the checker will further check the release frequency through `High Release Frequency` and |
| 11 | +`Unchanged Release` to see if the maintainers release multiple times in a short timeframe (threshold) and |
| 12 | +whether the released contents are identical. |
| 13 | +3. **Closer Release Join Date**: Considers the date when the maintainer registered their account (if |
| 14 | +available). The checker will calculate the gap between the latest release date and the maintainer's account |
| 15 | +registration date. |
| 16 | +4. **Suspicious Setup**: Checks whether the `setup.py` includes suspicious imports, such as `base64` for |
| 17 | +encryption and `requests` for data exfiltration. |
| 18 | + |
| 19 | +## Supported Ecosystem: PyPI |
| 20 | + |
| 21 | +Define Seven Heuristics: `False` means suspicious and `True` means non-suspicious. `SKIP` means some metadata are missing, and the checker skips the heuristic. |
| 22 | + |
| 23 | +1. **Empty Project Link** |
| 24 | + - **Description**: Checks whether the package contains any project links (e.g., documents or Git |
| 25 | + Repositories). Many malicious activities do not include any project link. |
| 26 | + - **Rule**: Return `FALSE` when there is only one project link; otherwise, return `TRUE`. |
| 27 | + |
| 28 | +2. **Unreachable Project Links** |
| 29 | + - **Description**: Checks the accessibility of the project links. This is considered an auxiliary |
| 30 | + heuristic since no cases have met this heuristic. |
| 31 | + - **Rule**: Return `FALSE` if all project links are not reachable; otherwise, return `TRUE`. |
| 32 | + |
| 33 | +3. **One Release** |
| 34 | + - **Description**: Checks whether the package has only one release. |
| 35 | + - **Rule**: Return `FALSE` if the package contains only one release; otherwise, return `TRUE`. |
| 36 | + |
| 37 | +4. **High Release Frequency** |
| 38 | + - **Description**: Checks if the package released multiple versions within a short period. We calculate |
| 39 | + the release frequency and define a default frequency threshold of 2 days. |
| 40 | + - **Rule**: Return `FALSE` if the frequency is higher than the threshold; otherwise, return `TRUE`. |
| 41 | + |
| 42 | +5. **Unchanged Release** |
| 43 | + - **Description**: Checks if the content of releases remains unchanged. |
| 44 | + - **Rule**: Return `FALSE` if the content of releases is identical; otherwise, return `TRUE`. |
| 45 | + |
| 46 | +6. **Closer Release Join Date** |
| 47 | + - **Description**: Checks the gap between the date the maintainer registered their account and the date |
| 48 | + of the latest release. A default threshold of 5 days is defined. |
| 49 | + - **Rule**: Return `FALSE` if the gap is less than the threshold; otherwise, return `TRUE`. |
| 50 | + |
| 51 | +7. **Suspicious Setup** |
| 52 | + - **Description**: Checks the `setup.py` to see if there are suspicious imported modules and the |
| 53 | + `install_requires` packages installed during the package installation process. We define two suspicious |
| 54 | + keywords as the blacklist. |
| 55 | + - **Rule**: Return `FALSE` if the package name contains suspicious keywords; otherwise, return `TRUE`. |
| 56 | + |
| 57 | +## Heuristics-Based Analyzer: Scanning 1167 Packages from Trusted Organizations |
| 58 | + |
| 59 | +| Heuristic Name | Count | |
| 60 | +| ------------------ | ----- | |
| 61 | +| Lower Release | 102 | |
| 62 | +| Empty Link | 45 | |
| 63 | +| Links Missing | 24 | |
| 64 | +| Frequent Release | 14 | |
| 65 | +| Suspicious Setup | 5 | |
| 66 | + |
| 67 | +**The result is used as a reference for the confidence score to lower the false positive rate.** |
0 commit comments