[CT-1105] [Feature] Enable Snowflake imports (stages) in addition to packages for model config #245

lostmygithubaccount · 2022-08-30T23:03:19Z

Is this your first time submitting a feature request?

I have read the expectations for open source contributors
I have searched the existing issues, and I could not find an existing issue for this feature
I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Community slack discussion: https://getdbt.slack.com/archives/C03QUA7DWCW/p1661529180748549

Currently, Python models in dbt using Snowflake allow you to specify packages within the config as a list of Python packages to include. Snowflake is limited to this set of packages through this method: https://repo.anaconda.com/pkgs/snowflake/

Custom packages, in addition to other scenarios (storing ML model artifacts, reading arbitrary config files) are enables via IMPORT statements in a stored procedure. There is a corresponding add_import method in the Python connector: https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/_autosummary/snowflake.snowpark.html#module-snowflake.snowpark

We should enable this for Snowflake in the config, so:

    dbt.config(
        materialized = "table",
        packages = ["holidays"],
        imports = ["@mystage/myfile.py"]

or in YAML:

- name: mymodel
  config:
    - packages:
        - numpy
        - pandas
        - scikit-learn
    - imports:
       - "@mystage/file1"
       - "@myotherstage/file2"

The expectations is these files are available for use from the working directory of the Python process.

Describe alternatives you've considered

Allowing arbitrary object storage for use in the dbt DAG raises tracking/lineage and governance questions. You've now used a ML model binary uploaded manually to a stage to make predictions that power a key metric or dashboard and something goes wrong -- but the model was deleted. Is it traceable?

If you're using Python files and importing them (effectively package management), are they version-controlled with git? How do you manage the code?

I think allowing this raises broader questions, but it seems relatively low-effort and high-impact to enable this sooner rather than later as-is. We can then think about standardizing across backends in another iteration.

Who will this benefit?

Snowflake Snowpark users of Python models. This is a critical feature for using non-included Python packages, and other scenarios like machine learning.

Are you interested in contributing this feature?

~~maybe ;)~~ no, requires writing SQL and not just Python ;)

Anything else?

No response

The text was updated successfully, but these errors were encountered:

lostmygithubaccount · 2022-08-31T03:49:24Z

see also: dbt-labs/dbt-core#5741

ChenyuLInx · 2022-08-31T17:56:42Z

This just require us passing the generate import when creating Stored Procedure. link
example:

CREATE OR REPLACE PROCEDURE MYPROC(from_table STRING, to_table STRING, count INT)
  RETURNS INT
  LANGUAGE PYTHON
  RUNTIME_VERSION = '3.8'
  PACKAGES = ('snowflake-snowpark-python')
  IMPORTS = ('@mystage/my_py_file.py')
  HANDLER = 'my_py_file.run';

lostmygithubaccount · 2022-09-08T02:13:23Z

I have worked in a product that did similar things to this and have some concerns:

Using arbitrary object storage for code, i.e. managing code outside of git
Related, object conflicts and versioning
A related output artifact discussion

One way I worked on of solving this was to allow (or perhaps enforce) references to code from a git repository. pip already has a well-defined scheme for this, particularly git+https://<repo>@<ref>&subdirectory=<path/to/code>. Translating that into dbt, it might look like:

- name: mymodel
  config:
    - packages:
        - numpy
        # - pandas
        - scikit-learn
    - imports:
       - "git+https://github.com/pandas-dev/pandas@some-important-branch" # normal public repo at ref
       - "git+https://github.com/lostmygithubaccount/some-project-code&subdirectory=path/to/code" # personal (possibly private repo) at a subdirectory

For each import, the code is copied (cloned) to the working directory of the Python process and available locally. Note a distinction between installing a package, and mounting it to the working directory -- i.e. in the above example is Pandas actually installed, or just has the source code available?

But anyway, I think there's a simpler way in dbt that alleviates any concerns I had. Since you're already in a git repo, and you typically will want your project code there anyway, simply allow imports to refer to local Python files that dbt takes care of uploading to a stage (or corresponding object storage in other platforms) for use in the model's execution environment. That might look like:

- name: mymodel
  config:
    - packages:
        - numpy
        - pandas
        - scikit-learn
    - imports:
       - "file1.py"
       - "functions/file2.py"

I do think at this point it could be useful to introduce a functions or similar directory, similar to macros for SQL, aimed at enabling code re-use in Python. I think eventually we should allow Python code from outside the project, e.g. with the git scheme above. But for now I think simply pointing to local files and allowing dbt to abstract the details (uploading to a stage) from the user is ideal.

This still has the concern of conflicting code (and managing that code in object storage -- it does add up over time). Conflicting code (e.g. Chenyu and Cody both use file1.py in their projects, working separately) can be avoided by re-using the schema in the path in object storage, e.g. dbt-managed-python-files/dbt_chenyu/file1.py and dbt-managed-python-files/dbt_cody/file1.py. Something like this (exact naming tbd) could work.

Thoughts? CC: @ChenyuLInx @jtcohen6 @sfc-gh-ejohnson

sfc-gh-ejohnson · 2022-09-08T12:42:37Z

I'd mostly agree with this. I would keep the directory name as generic as possible. e.g. resources or dependencies. For Python stored procedures in Snowflake, files could be .py files, a python module (.zip format), model artifact or any other resource file. Here is the link that talks about the nuances (e.g. auto_compress = false) of uploading these files into a stage: https://docs.snowflake.com/en/sql-reference/stored-procedures-python.html#accessing-other-classes-and-resource-files.

iknox-fa · 2022-09-12T16:27:46Z

So we have a number of concerns about maintaining file sync when using stages.

What happens when the import is a huge file?
What happens when this file is updated from external processes?
How do we clean these when they change?

@lostmygithubaccount We'd like to talk a bit more before we estimate this.

lostmygithubaccount · 2022-09-12T21:18:51Z

Based on some past experience with similar problems:

What happens when the import is a huge file?

Block calls that would upload over a certain amount (e.g. 100 MB).

Is this concern for SQL files in a dbt project too? My understanding is they're compiled then sent via API requests as strings, is there any upper limit on how large a SQL file could be?

What happens when this file is updated from external processes?

This is just user error. Can be avoided by separate dev/prod storage, with prod only having access granted to a service account identity (e.g. service principal in Azure) that goes through governed git repository and updates things via some CI/CD tool.

How do we clean these when they change?

Not sure. Options may include A) always overwrite B) use GUID in path (i.e. job id) C) a fancy git-ish system with a merkel tree or something tha reduced storage.

@lostmygithubaccount We'd like to talk a bit more before we estimate this.

Let's talk tomorrow :)

ChenyuLInx · 2022-09-14T23:13:26Z

Talked with @lostmygithubaccount and we decided the first iteration is going to just allow user to pass in imports in dbt.config and we are going to pass that as argument for create python procedure

jtcohen6 · 2022-09-15T05:21:00Z

Lots of interesting things to think about here with regard to size limits, state checks, versioning these artifacts...

For right now, that's going to be all on end users. Adding this is a way of simply unblocking users from pursuing use cases that require imports (of which we anticipate several), without officially endorsing / supporting specific patterns, or providing any helpful abstractions over it. Not saying we shouldn't aim to do those things — I just think they can come later, when we have a better sense of the use cases and requirements.

lostmygithubaccount · 2022-09-16T02:40:23Z

@ChenyuLInx @jtcohen6 just re-iterating prior discussions -- our decision for now is to allow this in Snowflake to unblock the numerous scenarios we've seen so far. dbt is not responsible for uploading any code or managing any object data, that is done outside the context of dbt.

I'll open a new issue in dbt-core for a feature to support this generically across all backends, where we likely will take care of uploading artifacts

lostmygithubaccount added enhancement New feature or request python_models labels Aug 30, 2022

github-actions bot changed the title ~~[Feature] Enable Snowflake imports (stages) in addition to packages for model config~~ [CT-1104] [Feature] Enable Snowflake imports (stages) in addition to packages for model config Aug 30, 2022

lostmygithubaccount transferred this issue from dbt-labs/dbt-core Aug 30, 2022

github-actions bot changed the title ~~[CT-1104] [Feature] Enable Snowflake imports (stages) in addition to packages for model config~~ [CT-1105] [CT-1104] [Feature] Enable Snowflake imports (stages) in addition to packages for model config Aug 30, 2022

leahwicz changed the title ~~[CT-1105] [CT-1104] [Feature] Enable Snowflake imports (stages) in addition to packages for model config~~ [CT-1105] [Feature] Enable Snowflake imports (stages) in addition to packages for model config Sep 6, 2022

ChenyuLInx mentioned this issue Sep 15, 2022

support Imports in python model #263

Merged

6 tasks

ChenyuLInx closed this as completed in #263 Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1105] [Feature] Enable Snowflake imports (stages) in addition to packages for model config #245

[CT-1105] [Feature] Enable Snowflake imports (stages) in addition to packages for model config #245

lostmygithubaccount commented Aug 30, 2022 •

edited

Loading

lostmygithubaccount commented Aug 31, 2022

ChenyuLInx commented Aug 31, 2022

lostmygithubaccount commented Sep 8, 2022

sfc-gh-ejohnson commented Sep 8, 2022

iknox-fa commented Sep 12, 2022

lostmygithubaccount commented Sep 12, 2022 •

edited

Loading

ChenyuLInx commented Sep 14, 2022

jtcohen6 commented Sep 15, 2022

lostmygithubaccount commented Sep 16, 2022

[CT-1105] [Feature] Enable Snowflake imports (stages) in addition to packages for model config #245

[CT-1105] [Feature] Enable Snowflake imports (stages) in addition to packages for model config #245

Comments

lostmygithubaccount commented Aug 30, 2022 • edited Loading

Is this your first time submitting a feature request?

Describe the feature

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Anything else?

lostmygithubaccount commented Aug 31, 2022

ChenyuLInx commented Aug 31, 2022

lostmygithubaccount commented Sep 8, 2022

sfc-gh-ejohnson commented Sep 8, 2022

iknox-fa commented Sep 12, 2022

lostmygithubaccount commented Sep 12, 2022 • edited Loading

ChenyuLInx commented Sep 14, 2022

jtcohen6 commented Sep 15, 2022

lostmygithubaccount commented Sep 16, 2022

lostmygithubaccount commented Aug 30, 2022 •

edited

Loading

lostmygithubaccount commented Sep 12, 2022 •

edited

Loading