Run Development Models using Production Data #1612

nickymikail · 2019-07-16T22:00:25Z

Feature

Run Development Models using Production Data

There should be a command that allows data analysts to test models using production data. With a DAG like this:

model_a \
          - model_c - model_d
model_b /

If an analyst wants to test a modification to model d (or create a new dependent model e), they would need to have models a, b, and c present in their development schema or render these models at runtime as well. If model a, b or c is a large dataset or if any of those models has a significant run time this could represent nontrivial costs in time and storage. With a new command dbt develop --models model_d that parses any refs to models not stated in the --models argument to the command as referencing an up-to-date production dataset while still writing to a development target, these costs could be avoided and data model development could be significantly sped up.

Who will this benefit?

This would benefit larger analytics teams who observe significant storage costs from duplicate data, and teams with more generations in their dependency graphs.

The text was updated successfully, but these errors were encountered:

drewbanin · 2019-07-17T16:52:22Z

Hey @nickymikail - thanks for making this issue!

Check out the thread over here: #1603

#1603 describes something different than what you're asking for, but I think there might be a feature that we can pull out of that issue which would address your use case. The big idea is just that(some?) refs should address models in your development schema, while others should address the models in the production schema.

It's would be really hard to build the generalized version of this: models can have all sorts of environment-aware logic which changes the destination schema/table names in prod vs. dev, for instance. Just knowing the location of the prod version of a model is going to be really tough in development!

Here are some alternative approaches that we recommend:

add filters to models in development, eg:

select * from my_really_big_table
{% if target.name == 'dev' %}
where created_at > getdate() - interval '3 days'
{% endif %}

Implement use-case specific logic in some specific models

{% if target.name == 'dev' %}
  -- just make a view that points to the prod model
  {{ config(materialized='view') }}

  select * from production.table_name

{% else %}
  -- build the model here
  {{ config(materialized='incremental') }}
  select ....

{% endif %}

utilize "zero copy clones" if possible (Snowflake only!)

Curious what you think about all of this!

bashyroger · 2020-07-08T14:28:50Z

Hi @drewbanin / @nickymikail : I think the use case I have just discussed with dylan baker on slack is similar in nature
What I am looking for is the following:

Initial state:

I have 2 years of snowplow event data
On top of that I have an incremental model in prod, aggregating it to the pageview level.

Change:

Now I want to start create a new model OR change an existing one called pageviews_per_month_aggregate on top of the pageview model
Initially, that pageview model does not exist in my dev env
instead of (partially) rebuilding that, I would like that pageview model to be copied (or cloned in snowflake) when the model is not found.

So, basically, when querying that pageview model in dev, my preferred way of handling this would be for dbt to do something like this _auto-magically :) as part of the JINJA parsing:

IF model_used_in_ref NOT FOUND IN dev_target AND model_used_in_ref EXISTS IN prod_target THEN COPY or CLONE model_used_in_ref FROM prod_target END IF
...and then execute the query
Obviously, where DBT would copy / clone the data from should be configurable, I now used prod_target in this example, but it might as well from an obfuscated copy / clone of prod (thinking about PII data)

I could also imagine a new DBT command that @nickymikail suggested:
dbt develop --models model_d
dbt develop --models pageviews_per_month_aggregate

OR a new run flag for parent models like
dbt run --models +model_d --initialize_parents
dbt run --models +pageviews_per_month_aggregate --initialize_parents
The --initialize_parents flag would then take care of cloning / copying any parent models of model_d / pageviews_per_month_aggregate

jtcohen6 · 2020-07-09T16:37:38Z

@bashyroger You've outlined a really compelling use case. We've come a long way on our thinking here over the past several months.

If you're on Snowflake, zero-copy cloning is a massive help because there's almost no added cost for cloning more than you need to. You can paint with the broadest possible brush. GitLab uses cloning as part of their CI process today; check out @emilieschario's recent Discourse comment for links.

Copying is a much costlier operation, so you wouldn't want to copy any more objects than you absolutely need to. We're laying some significant groundwork to enable more precise approaches in the next release:

nth-level parent/child selectors that allow you to select and copy only the first-degree parent of the model you want to run (Direct child model selector #2052, Direct child model selector #2485) (though this may require a custom materialization)
Even better: The ref() function could "defer" to a different namespace by pointing to a different run's artifacts (Defer to prod manifest for partial runs #2527), allowing you to select directly from the prod version of the pageview model and avoid copying/cloning entirely

jtcohen6 · 2020-09-10T18:22:06Z

I'm going to close this issue, since we added dbt run --defer in v0.18.0 (#2656, docs)

drewbanin mentioned this issue Jul 24, 2019

Allow configurable include_policy for snowflake #1603

Closed

drewbanin mentioned this issue Jun 10, 2020

Defer to prod manifest for partial runs #2527

Closed

jtcohen6 closed this as completed Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run Development Models using Production Data #1612

Run Development Models using Production Data #1612

nickymikail commented Jul 16, 2019

drewbanin commented Jul 17, 2019

bashyroger commented Jul 8, 2020 •

edited

Loading

jtcohen6 commented Jul 9, 2020 •

edited

Loading

jtcohen6 commented Sep 10, 2020 •

edited

Loading

Run Development Models using Production Data #1612

Run Development Models using Production Data #1612

Comments

nickymikail commented Jul 16, 2019

Feature