Python models: package, artifact/object storage, and UDF management in dbt #5741

lostmygithubaccount · 2022-08-31T03:32:01Z

lostmygithubaccount
Aug 31, 2022

The purpose of this discussion is to talk about how we're thinking about package and artifact management in dbt for Python models for code re-use and other scenarios.

Before jumping into this discussion, first read:

The problem

In dbt, a single file represents a single model. Transformation code lives in that model. Traditionally, code re-use for SQL is achieved through macros, ephemeral models, or UDFs. The last one is not natively supported, but still commonly used and relevant for this discussion.

In Python, users typically leverage other packages via import statements at the top of the file:

# pip install numpy
import numpy as np

# my_file.py with def my_function():
from my_file import my_function

x = np.array([1, 2, 3])
y = my_function(x)

Those packages are either installed from a remote index or locally. A remote index can be public -- PyPi or Anaconda -- or private to a given organization. Usually a developer will start with code in the same file, then refactor into a separate module in another local file, and if re-used across projects may ship as a package in a public or private index.

In dbt's Python models today, we allow specifying a list of packages for use in the model's run. However this misses the second case of custom code. In Snowpark, for instance, there is a limited set of packages available: https://repo.anaconda.com/pkgs/snowflake/. How do we support custom Python packages? In Snowpark, the recommendation is to use stages, effectively object storage. How do we standardize the interfaces for doing so across the backends we support?

How do we support code re-use in Python? How do we deal with related details of package and artifact management?

For Snowflake, we're starting with this: dbt-labs/dbt-snowflake#245 -- but will that work across all backends? Is it ideal?

Object storage

Relevant community slack discussion: https://getdbt.slack.com/archives/C03QUA7DWCW/p1661529180748549

Object storage in remote systems running Python processes can offer a pretty simple solution -- upload your Python files or an entire package, make it available from the Python process, and you're good to go. This has a couple of issues:

code versioning and governance
syntactic and technical differences across backends

Beyond Python code, object storage is also useful for storing additional inputs (perhaps config files) and outputs (machine learning model artifacts) with Python code. Both still suffer from lack of versioning and governance with a simple object storage solution.

Enforcing `git` as a filesystem

In short, instead of directly allowing arbitrary references to code in object storage, enforce usage through git. This ensures code is still versioned -- even if not in the same repo --

See https://pip.pypa.io/en/stable/topics/vcs-support/ for a viable existing format, including specifying subdirectories.

This works for code, but not quite for ML model artifacts.

UDFs

Thanks to @ChenyuLInx for spiking this internally and created a document that is serving as the basis here.

Prior art:

User-defined functions (UDFs) are typically in database/data warehouse systems for re-using code throughout. I am admittedly not that familiar with them. Some reasons for using them in Python include:

re-using complex logic throughout a dbt project
inferencing (scoring) with a machine learning model

We may consider adding a canonical functions directory alongside (within) models for their definition in a single file, just like models. So a user may have:

# functions/my_function.py
def function(data):
    data["column"] += 1
    return data

The functions are automatically updated on each dbt run, or we provide some new sub-command for managing them.

Some open questions include:

specific naming: udfs? functions? macros?
standardization across SQL and Python in dbt
standardization and technical details across backends

Defining Python environments

Python models lead to managing environments for Python. Overall these may consist of:

container image
platform runtime
other platform-specific configuration options
Python and/or PySpark packages
nodes and their configuration options (SKU, size, memory, etc.)
a filesystem pointer (e.g. via git+) for project code
"side-loaded" files available from the process (object storage)

Can dbt play a role in providing a standard environment definition across Python backends? They should be defined close to the model they're used for, i.e. tracked in the same git repository at least. We may consider defining these as YAML or something else and, of course, setting smart defaults for each adapter.

Comments, questions, concerns?

Let us know below!

johnkangw · 2023-03-19T15:22:52Z

johnkangw
Mar 19, 2023

Hi, I see that this conversation is about six months old. Do you know if there's any progress on package and artifact management in dbt for Python models? I ask because I have some custom written functions in Python that transform data in the ETL methodology, but now that the business I work for is migrating to Snowflake, I would like to move to ELT using dbt. I think if dbt can help simplify this process as much as possible (vs—defining custom functions for each model even though they might be reused across models), it would be appreciated.

5 replies

n-batalha Mar 19, 2023

Also interested in this. I wanted to point out some overlap of this feature with Snowflake's experimental CLI snow. I was able to get some of the above functionality (packaging the users code, with its dependencies when not available in Snowflake's anaconda, and publishing the package to a stage to be imported and reused across Python models) via snowflakedb/snowflake-cli#99 (comment)

johnkangw Mar 19, 2023

@n-batalha Thanks for the heads up! It is encouraging that you got it working with Snowflake's experimental CLI. It seems like a 'solvable' problem if enough developmental/product resources are spent from Snowflake/dbt to solve this problem.

However, having an 'easy' way to do this would be nice. The alternative I'm considering for now is to continue orchestrating the entire ELT using Prefect but adding in dbt jobs where it makes sense (for staging and possibly the intermediate tables) while leaving the custom Python code for models that require it. Not ideal because I'll need to pull data down from Snowflake, transform, and then push back to Snowflake, but it looks like the 'easiest' from a developer perspective to implement.

ernestoongaro Aug 9, 2023
Collaborator

@mikkosulonen if you have any additional thoughts about your need for python UDF after our discussion, please add here!

mikkosulonen Sep 8, 2023

We also investigated our options of building our Python UDFs, having them in dbt, and uploading them to Snowflake one way or the other.

We investigated two options:

The (current as of 1.6) dbt-way
- Wrap udfs in a dbt macro. So using the Snowflake SQL syntax of creating named python UDFs.
- The benefit of having the same familiar set of tools in place, nothing more.
- The drawback of being limited to Snowflake SQL syntax for python UDFs
The pythonic-way
- Create a package and upload it.
- Benefit of having normal python way of creating and testing functions
- Drawback of adding tools to our toolbox

What we are now doing, is the middle-ground:

Have the python functions live in the same repo as our dbt project. This way we have nice visibility on what they contain and what's happening. And in easy cases, you can still use familiar dbt tooling to create / modify them.
Create an additional CI/CD pipeline for packaging and importing the functions to Snowflake staging
Use imported packages in "normal dbt python models"

lostmygithubaccount Sep 8, 2023
Author

idk if this is helpful but will drop it here: https://medium.com/snowflake/introducing-simple-workflow-for-using-python-packages-in-snowpark-928f667ff1aa

michelgalle · 2023-09-11T22:24:42Z

michelgalle
Sep 11, 2023

Hi, wouldn't it be feasible to use a custom docker image to run the model in bigquery (dataproc)?
It achieves the goal of having the virtual environment version controlled.
In dbt repo we could save requirements in any standards (requirements.txt poetry etc) and have a dockerfile buidling the environment
there would need to be an extra step with a deployment building the image into GCR (not related to dbt itself, this would be the work of someone who is going to use the image.
the only thing needed from dbt is the possibility of adding an extra arg with custom dockerfile on the api call to create the serverless job.
that argument can potentially be set at project level or model level.

I believe this achieves the goal of having custom libraries used by the model, it will be version controlled and also allows for flexibility on which type of control the end-user wants over libraries.

with some extra work within dbt adapter it could potentially even allow to have a local dev environment.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python models: package, artifact/object storage, and UDF management in dbt #5741

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Python models: package, artifact/object storage, and UDF management in dbt #5741

lostmygithubaccount Aug 31, 2022

The problem

Object storage

Enforcing git as a filesystem

UDFs

Defining Python environments

Comments, questions, concerns?

Replies: 2 comments · 5 replies

johnkangw Mar 19, 2023

n-batalha Mar 19, 2023

johnkangw Mar 19, 2023

ernestoongaro Aug 9, 2023 Collaborator

mikkosulonen Sep 8, 2023

lostmygithubaccount Sep 8, 2023 Author

michelgalle Sep 11, 2023

lostmygithubaccount
Aug 31, 2022

Enforcing `git` as a filesystem

Replies: 2 comments 5 replies

johnkangw
Mar 19, 2023

ernestoongaro Aug 9, 2023
Collaborator

lostmygithubaccount Sep 8, 2023
Author

michelgalle
Sep 11, 2023