Python models: package, artifact/object storage, and UDF management in dbt #5741
Replies: 2 comments 5 replies
-
Hi, I see that this conversation is about six months old. Do you know if there's any progress on package and artifact management in dbt for Python models? I ask because I have some custom written functions in Python that transform data in the ETL methodology, but now that the business I work for is migrating to Snowflake, I would like to move to ELT using dbt. I think if dbt can help simplify this process as much as possible (vs—defining custom functions for each model even though they might be reused across models), it would be appreciated. |
Beta Was this translation helpful? Give feedback.
-
Hi, wouldn't it be feasible to use a custom docker image to run the model in bigquery (dataproc)? I believe this achieves the goal of having custom libraries used by the model, it will be version controlled and also allows for flexibility on which type of control the end-user wants over libraries. with some extra work within dbt adapter it could potentially even allow to have a local dev environment. |
Beta Was this translation helpful? Give feedback.
-
The purpose of this discussion is to talk about how we're thinking about package and artifact management in dbt for Python models for code re-use and other scenarios.
Before jumping into this discussion, first read:
The problem
In dbt, a single file represents a single model. Transformation code lives in that model. Traditionally, code re-use for SQL is achieved through macros, ephemeral models, or UDFs. The last one is not natively supported, but still commonly used and relevant for this discussion.
In Python, users typically leverage other packages via import statements at the top of the file:
Those packages are either installed from a remote index or locally. A remote index can be public -- PyPi or Anaconda -- or private to a given organization. Usually a developer will start with code in the same file, then refactor into a separate module in another local file, and if re-used across projects may ship as a package in a public or private index.
In dbt's Python models today, we allow specifying a list of packages for use in the model's run. However this misses the second case of custom code. In Snowpark, for instance, there is a limited set of packages available: https://repo.anaconda.com/pkgs/snowflake/. How do we support custom Python packages? In Snowpark, the recommendation is to use stages, effectively object storage. How do we standardize the interfaces for doing so across the backends we support?
How do we support code re-use in Python? How do we deal with related details of package and artifact management?
For Snowflake, we're starting with this: dbt-labs/dbt-snowflake#245 -- but will that work across all backends? Is it ideal?
Object storage
Relevant community slack discussion: https://getdbt.slack.com/archives/C03QUA7DWCW/p1661529180748549
Object storage in remote systems running Python processes can offer a pretty simple solution -- upload your Python files or an entire package, make it available from the Python process, and you're good to go. This has a couple of issues:
Beyond Python code, object storage is also useful for storing additional inputs (perhaps config files) and outputs (machine learning model artifacts) with Python code. Both still suffer from lack of versioning and governance with a simple object storage solution.
Enforcing
git
as a filesystemIn short, instead of directly allowing arbitrary references to code in object storage, enforce usage through
git
. This ensures code is still versioned -- even if not in the same repo --See https://pip.pypa.io/en/stable/topics/vcs-support/ for a viable existing format, including specifying subdirectories.
This works for code, but not quite for ML model artifacts.
UDFs
Thanks to @ChenyuLInx for spiking this internally and created a document that is serving as the basis here.
Prior art:
User-defined functions (UDFs) are typically in database/data warehouse systems for re-using code throughout. I am admittedly not that familiar with them. Some reasons for using them in Python include:
We may consider adding a canonical
functions
directory alongside (within)models
for their definition in a single file, just like models. So a user may have:The functions are automatically updated on each
dbt run
, or we provide some new sub-command for managing them.Some open questions include:
Defining Python environments
Python models lead to managing environments for Python. Overall these may consist of:
git+
) for project codeCan dbt play a role in providing a standard environment definition across Python backends? They should be defined close to the model they're used for, i.e. tracked in the same
git
repository at least. We may consider defining these as YAML or something else and, of course, setting smart defaults for each adapter.Comments, questions, concerns?
Let us know below!
Beta Was this translation helpful? Give feedback.
All reactions