Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-1105] [Feature] Enable Snowflake imports (stages) in addition to packages for model config #245

Closed
3 tasks done
lostmygithubaccount opened this issue Aug 30, 2022 · 9 comments · Fixed by #263
Closed
3 tasks done
Labels
enhancement New feature or request python_models

Comments

@lostmygithubaccount
Copy link

lostmygithubaccount commented Aug 30, 2022

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Community slack discussion: https://getdbt.slack.com/archives/C03QUA7DWCW/p1661529180748549

Currently, Python models in dbt using Snowflake allow you to specify packages within the config as a list of Python packages to include. Snowflake is limited to this set of packages through this method: https://repo.anaconda.com/pkgs/snowflake/

Custom packages, in addition to other scenarios (storing ML model artifacts, reading arbitrary config files) are enables via IMPORT statements in a stored procedure. There is a corresponding add_import method in the Python connector: https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/_autosummary/snowflake.snowpark.html#module-snowflake.snowpark

We should enable this for Snowflake in the config, so:

    dbt.config(
        materialized = "table",
        packages = ["holidays"],
        imports = ["@mystage/myfile.py"]

or in YAML:

- name: mymodel
  config:
    - packages:
        - numpy
        - pandas
        - scikit-learn
    - imports:
       - "@mystage/file1"
       - "@myotherstage/file2"

The expectations is these files are available for use from the working directory of the Python process.

Describe alternatives you've considered

Allowing arbitrary object storage for use in the dbt DAG raises tracking/lineage and governance questions. You've now used a ML model binary uploaded manually to a stage to make predictions that power a key metric or dashboard and something goes wrong -- but the model was deleted. Is it traceable?

If you're using Python files and importing them (effectively package management), are they version-controlled with git? How do you manage the code?

I think allowing this raises broader questions, but it seems relatively low-effort and high-impact to enable this sooner rather than later as-is. We can then think about standardizing across backends in another iteration.

Who will this benefit?

Snowflake Snowpark users of Python models. This is a critical feature for using non-included Python packages, and other scenarios like machine learning.

Are you interested in contributing this feature?

maybe ;) no, requires writing SQL and not just Python ;)

Anything else?

No response

@github-actions github-actions bot changed the title [Feature] Enable Snowflake imports (stages) in addition to packages for model config [CT-1104] [Feature] Enable Snowflake imports (stages) in addition to packages for model config Aug 30, 2022
@lostmygithubaccount lostmygithubaccount transferred this issue from dbt-labs/dbt-core Aug 30, 2022
@github-actions github-actions bot changed the title [CT-1104] [Feature] Enable Snowflake imports (stages) in addition to packages for model config [CT-1105] [CT-1104] [Feature] Enable Snowflake imports (stages) in addition to packages for model config Aug 30, 2022
@lostmygithubaccount
Copy link
Author

see also: dbt-labs/dbt-core#5741

@ChenyuLInx
Copy link
Contributor

This just require us passing the generate import when creating Stored Procedure. link
example:

CREATE OR REPLACE PROCEDURE MYPROC(from_table STRING, to_table STRING, count INT)
  RETURNS INT
  LANGUAGE PYTHON
  RUNTIME_VERSION = '3.8'
  PACKAGES = ('snowflake-snowpark-python')
  IMPORTS = ('@mystage/my_py_file.py')
  HANDLER = 'my_py_file.run';

@leahwicz leahwicz changed the title [CT-1105] [CT-1104] [Feature] Enable Snowflake imports (stages) in addition to packages for model config [CT-1105] [Feature] Enable Snowflake imports (stages) in addition to packages for model config Sep 6, 2022
@lostmygithubaccount
Copy link
Author

I have worked in a product that did similar things to this and have some concerns:

  1. Using arbitrary object storage for code, i.e. managing code outside of git
  2. Related, object conflicts and versioning
  3. A related output artifact discussion

One way I worked on of solving this was to allow (or perhaps enforce) references to code from a git repository. pip already has a well-defined scheme for this, particularly git+https://<repo>@<ref>&subdirectory=<path/to/code>. Translating that into dbt, it might look like:

- name: mymodel
  config:
    - packages:
        - numpy
        # - pandas
        - scikit-learn
    - imports:
       - "git+https://github.com/pandas-dev/pandas@some-important-branch" # normal public repo at ref
       - "git+https://github.com/lostmygithubaccount/some-project-code&subdirectory=path/to/code" # personal (possibly private repo) at a subdirectory

For each import, the code is copied (cloned) to the working directory of the Python process and available locally. Note a distinction between installing a package, and mounting it to the working directory -- i.e. in the above example is Pandas actually installed, or just has the source code available?

But anyway, I think there's a simpler way in dbt that alleviates any concerns I had. Since you're already in a git repo, and you typically will want your project code there anyway, simply allow imports to refer to local Python files that dbt takes care of uploading to a stage (or corresponding object storage in other platforms) for use in the model's execution environment. That might look like:

- name: mymodel
  config:
    - packages:
        - numpy
        - pandas
        - scikit-learn
    - imports:
       - "file1.py"
       - "functions/file2.py"

I do think at this point it could be useful to introduce a functions or similar directory, similar to macros for SQL, aimed at enabling code re-use in Python. I think eventually we should allow Python code from outside the project, e.g. with the git scheme above. But for now I think simply pointing to local files and allowing dbt to abstract the details (uploading to a stage) from the user is ideal.

This still has the concern of conflicting code (and managing that code in object storage -- it does add up over time). Conflicting code (e.g. Chenyu and Cody both use file1.py in their projects, working separately) can be avoided by re-using the schema in the path in object storage, e.g. dbt-managed-python-files/dbt_chenyu/file1.py and dbt-managed-python-files/dbt_cody/file1.py. Something like this (exact naming tbd) could work.

Thoughts? CC: @ChenyuLInx @jtcohen6 @sfc-gh-ejohnson

@sfc-gh-ejohnson
Copy link

I'd mostly agree with this. I would keep the directory name as generic as possible. e.g. resources or dependencies. For Python stored procedures in Snowflake, files could be .py files, a python module (.zip format), model artifact or any other resource file. Here is the link that talks about the nuances (e.g. auto_compress = false) of uploading these files into a stage: https://docs.snowflake.com/en/sql-reference/stored-procedures-python.html#accessing-other-classes-and-resource-files.

@iknox-fa
Copy link
Contributor

So we have a number of concerns about maintaining file sync when using stages.

  • What happens when the import is a huge file?
  • What happens when this file is updated from external processes?
  • How do we clean these when they change?

@lostmygithubaccount We'd like to talk a bit more before we estimate this.

@lostmygithubaccount
Copy link
Author

lostmygithubaccount commented Sep 12, 2022

Based on some past experience with similar problems:

  • What happens when the import is a huge file?

Block calls that would upload over a certain amount (e.g. 100 MB).

Is this concern for SQL files in a dbt project too? My understanding is they're compiled then sent via API requests as strings, is there any upper limit on how large a SQL file could be?

  • What happens when this file is updated from external processes?

This is just user error. Can be avoided by separate dev/prod storage, with prod only having access granted to a service account identity (e.g. service principal in Azure) that goes through governed git repository and updates things via some CI/CD tool.

  • How do we clean these when they change?

Not sure. Options may include A) always overwrite B) use GUID in path (i.e. job id) C) a fancy git-ish system with a merkel tree or something tha reduced storage.

@lostmygithubaccount We'd like to talk a bit more before we estimate this.

Let's talk tomorrow :)

@ChenyuLInx
Copy link
Contributor

Talked with @lostmygithubaccount and we decided the first iteration is going to just allow user to pass in imports in dbt.config and we are going to pass that as argument for create python procedure

@jtcohen6
Copy link
Contributor

Lots of interesting things to think about here with regard to size limits, state checks, versioning these artifacts...

For right now, that's going to be all on end users. Adding this is a way of simply unblocking users from pursuing use cases that require imports (of which we anticipate several), without officially endorsing / supporting specific patterns, or providing any helpful abstractions over it. Not saying we shouldn't aim to do those things — I just think they can come later, when we have a better sense of the use cases and requirements.

@lostmygithubaccount
Copy link
Author

@ChenyuLInx @jtcohen6 just re-iterating prior discussions -- our decision for now is to allow this in Snowflake to unblock the numerous scenarios we've seen so far. dbt is not responsible for uploading any code or managing any object data, that is done outside the context of dbt.

I'll open a new issue in dbt-core for a feature to support this generically across all backends, where we likely will take care of uploading artifacts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python_models
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants