Skip to content

feat(query): support imports and packages in python udf scripts #18187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

sundy-li
Copy link
Member

@sundy-li sundy-li commented Jun 18, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

  1. supports packages and imports for python udf scripts

IMPORTS is used for unloading private stage files into sys._xoptions['databend_import_directory'] directory.
PACKAGES is used for third-party libs in pypi

example:

CREATE OR REPLACE FUNCTION gcd_py (INT, INT) RETURNS BIGINT LANGUAGE python 
IMPORTS = ('@s1/a.zip')
PACKAGES = ('numpy', 'pandas')
HANDLER = 'gcd' AS $$
import numpy as np
import pandas as pd
def gcd(a: int, b: int) -> int:
    while b:
        a, b = b, a % b
    return a
$$;
 

  1. add DATABEND RESTRICTED PYTHON codes when executing python scripts, only allow user to read/write /tmp/ directories.

If will throw error when reading/writing forbidden directory

PermissionError: Access denied: /home/sundy/work/databend/.vscode/query_parquet_1.toml is outside allowed directory.
  1. support pep723 scripts in python udf scripts

From https://x.com/charliermarsh/status/1934433431431139342
We can un uv add --script /path/to/script.py to add inline dependencies to a Python file. If the script header doesn't exist already, uv will generate it for you.

For example, we can modify the gcd function to introduce numpy and pandas packages and do some useless work with these packages.

CREATE OR REPLACE FUNCTION gcd_py (INT, INT) RETURNS BIGINT LANGUAGE python HANDLER = 'gcd' AS $$
# /// script
# requires-python = ">=3.12"
# dependencies = ["numpy", "pandas"]
# ///

import numpy as np
import pandas as pd
def gcd(a: int, b: int) -> int:
    x = int(pd.DataFrame(np.random.rand(3, 3)).sum().sum())
    a += x
    b -= x
    a -= x
    b += x
    while b:
        a, b = b, a % b
    return a
$$;

The python executor will work fine for the script


🐳 root@default:) select gcd_py(40, 12);
╭─────────────────╮
│  gcd_py(40, 12) │
│ Nullable(Int64) │
├─────────────────┤
│               4 │
╰─────────────────╯

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Jun 18, 2025
@sundy-li
Copy link
Member Author

After this pr, the docker image must contain uv tool if it's built with python-udf feature. @everpcpc

@everpcpc
Copy link
Member

everpcpc commented Jun 19, 2025

After this pr, the docker image must contain uv tool if it's built with python-udf feature. @everpcpc

Maybe we could just use python -m venv venv?

@sundy-li sundy-li marked this pull request as draft June 20, 2025 01:55
@sundy-li sundy-li changed the title feat(query): support pep723 scripts in python udf scripts feat(query): support imports and packages in python udf scripts Jun 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants