-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-1105] [Feature] Enable Snowflake imports (stages) in addition to packages for model config #245
Comments
see also: dbt-labs/dbt-core#5741 |
This just require us passing the generate import when creating Stored Procedure. link CREATE OR REPLACE PROCEDURE MYPROC(from_table STRING, to_table STRING, count INT)
RETURNS INT
LANGUAGE PYTHON
RUNTIME_VERSION = '3.8'
PACKAGES = ('snowflake-snowpark-python')
IMPORTS = ('@mystage/my_py_file.py')
HANDLER = 'my_py_file.run'; |
I have worked in a product that did similar things to this and have some concerns:
One way I worked on of solving this was to allow (or perhaps enforce) references to code from a - name: mymodel
config:
- packages:
- numpy
# - pandas
- scikit-learn
- imports:
- "git+https://github.com/pandas-dev/pandas@some-important-branch" # normal public repo at ref
- "git+https://github.com/lostmygithubaccount/some-project-code&subdirectory=path/to/code" # personal (possibly private repo) at a subdirectory For each import, the code is copied (cloned) to the working directory of the Python process and available locally. Note a distinction between installing a package, and mounting it to the working directory -- i.e. in the above example is Pandas actually installed, or just has the source code available? But anyway, I think there's a simpler way in dbt that alleviates any concerns I had. Since you're already in a - name: mymodel
config:
- packages:
- numpy
- pandas
- scikit-learn
- imports:
- "file1.py"
- "functions/file2.py" I do think at this point it could be useful to introduce a This still has the concern of conflicting code (and managing that code in object storage -- it does add up over time). Conflicting code (e.g. Chenyu and Cody both use Thoughts? CC: @ChenyuLInx @jtcohen6 @sfc-gh-ejohnson |
I'd mostly agree with this. I would keep the directory name as generic as possible. e.g. resources or dependencies. For Python stored procedures in Snowflake, files could be .py files, a python module (.zip format), model artifact or any other resource file. Here is the link that talks about the nuances (e.g. auto_compress = false) of uploading these files into a stage: https://docs.snowflake.com/en/sql-reference/stored-procedures-python.html#accessing-other-classes-and-resource-files. |
So we have a number of concerns about maintaining file sync when using stages.
@lostmygithubaccount We'd like to talk a bit more before we estimate this. |
Based on some past experience with similar problems:
Block calls that would upload over a certain amount (e.g. 100 MB). Is this concern for SQL files in a dbt project too? My understanding is they're compiled then sent via API requests as strings, is there any upper limit on how large a SQL file could be?
This is just user error. Can be avoided by separate dev/prod storage, with prod only having access granted to a service account identity (e.g. service principal in Azure) that goes through governed git repository and updates things via some CI/CD tool.
Not sure. Options may include A) always overwrite B) use GUID in path (i.e. job id) C) a fancy git-ish system with a merkel tree or something tha reduced storage.
Let's talk tomorrow :) |
Talked with @lostmygithubaccount and we decided the first iteration is going to just allow user to pass in imports in |
Lots of interesting things to think about here with regard to size limits, state checks, versioning these artifacts... For right now, that's going to be all on end users. Adding this is a way of simply unblocking users from pursuing use cases that require |
@ChenyuLInx @jtcohen6 just re-iterating prior discussions -- our decision for now is to allow this in Snowflake to unblock the numerous scenarios we've seen so far. dbt is not responsible for uploading any code or managing any object data, that is done outside the context of dbt. I'll open a new issue in |
Is this your first time submitting a feature request?
Describe the feature
Community slack discussion: https://getdbt.slack.com/archives/C03QUA7DWCW/p1661529180748549
Currently, Python models in dbt using Snowflake allow you to specify
packages
within the config as a list of Python packages to include. Snowflake is limited to this set of packages through this method: https://repo.anaconda.com/pkgs/snowflake/Custom packages, in addition to other scenarios (storing ML model artifacts, reading arbitrary config files) are enables via
IMPORT
statements in a stored procedure. There is a correspondingadd_import
method in the Python connector: https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/_autosummary/snowflake.snowpark.html#module-snowflake.snowparkWe should enable this for Snowflake in the config, so:
or in YAML:
The expectations is these files are available for use from the working directory of the Python process.
Describe alternatives you've considered
Allowing arbitrary object storage for use in the dbt DAG raises tracking/lineage and governance questions. You've now used a ML model binary uploaded manually to a stage to make predictions that power a key metric or dashboard and something goes wrong -- but the model was deleted. Is it traceable?
If you're using Python files and importing them (effectively package management), are they version-controlled with
git
? How do you manage the code?I think allowing this raises broader questions, but it seems relatively low-effort and high-impact to enable this sooner rather than later as-is. We can then think about standardizing across backends in another iteration.
Who will this benefit?
Snowflake Snowpark users of Python models. This is a critical feature for using non-included Python packages, and other scenarios like machine learning.
Are you interested in contributing this feature?
maybe ;)no, requires writing SQL and not just Python ;)Anything else?
No response
The text was updated successfully, but these errors were encountered: