-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run Development Models using Production Data #1612
Comments
Hey @nickymikail - thanks for making this issue! Check out the thread over here: #1603 #1603 describes something different than what you're asking for, but I think there might be a feature that we can pull out of that issue which would address your use case. The big idea is just that(some?) It's would be really hard to build the generalized version of this: models can have all sorts of environment-aware logic which changes the destination schema/table names in prod vs. dev, for instance. Just knowing the location of the prod version of a model is going to be really tough in development! Here are some alternative approaches that we recommend:
Curious what you think about all of this! |
Hi @drewbanin / @nickymikail : I think the use case I have just discussed with dylan baker on slack is similar in nature Initial state:
Change:
So, basically, when querying that pageview model in dev, my preferred way of handling this would be for dbt to do something like this _auto-magically :) as part of the JINJA parsing:
I could also imagine a new DBT command that @nickymikail suggested: OR a new run flag for parent models like |
@bashyroger You've outlined a really compelling use case. We've come a long way on our thinking here over the past several months. If you're on Snowflake, zero-copy cloning is a massive help because there's almost no added cost for cloning more than you need to. You can paint with the broadest possible brush. GitLab uses cloning as part of their CI process today; check out @emilieschario's recent Discourse comment for links. Copying is a much costlier operation, so you wouldn't want to copy any more objects than you absolutely need to. We're laying some significant groundwork to enable more precise approaches in the next release:
|
Feature
Run Development Models using Production Data
There should be a command that allows data analysts to test models using production data. With a DAG like this:
If an analyst wants to test a modification to model d (or create a new dependent model e), they would need to have models a, b, and c present in their development schema or render these models at runtime as well. If model a, b or c is a large dataset or if any of those models has a significant run time this could represent nontrivial costs in time and storage. With a new command
dbt develop --models model_d
that parses any refs to models not stated in the--models
argument to the command as referencing an up-to-date production dataset while still writing to a development target, these costs could be avoided and data model development could be significantly sped up.Who will this benefit?
This would benefit larger analytics teams who observe significant storage costs from duplicate data, and teams with more generations in their dependency graphs.
The text was updated successfully, but these errors were encountered: