Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflow for explain queries? #401

Closed
drewbanin opened this issue Apr 26, 2017 · 20 comments
Closed

workflow for explain queries? #401

drewbanin opened this issue Apr 26, 2017 · 20 comments
Labels
stale Issues that have gone stale Team:Adapters Issues designated for the adapter area of the code

Comments

@drewbanin
Copy link
Contributor

Can this work with operations?

@adamhaney
Copy link

We're continuing to try to find a solution to this problem. We were considering adding a --explain argument to dbt run that would pre-pend EXPLAIN to the query and then dump the EXPLAIN output to stdout.

It looks like Redshift and Postgres both support just pre-pending EXPLAIN to the query itself, not sure about snowflake etc, (maybe that command just wouldn't be applicable for those backends?)

@drewbanin
Copy link
Contributor Author

drewbanin commented Jun 14, 2017

@adamhaney are you familiar with the STL_EXPLAIN Redshift system table? Unsure if it can help you here, but it was news to me and I think it's relevant.

I think your approach is a reasonable one given how dbt works currently. I think that long-term, this probably shouldn't be a CLI arg, but i'm not 100% certain what form it will take. I'm just thinking about other operations like vacuum and analyze.... we probably wouldn't want to add a different CLI arg for every conceivable operation like this!

I'm lumping explain in with vacuum and analyze because these are queries that aren't part of the dependency graph, and don't result in a materialized view or table. We plan on categorizing these queries as operations, and providing a way to run them from the CLI, eg (syntax not final):

$ dbt run operation explain --model my_model
# or
$ dbt run operation vacuum --model my_model

here, you'd have two operations (essentially macros) called explain and vacuum. These could be distributed through dbt packages.

We're a decent ways out from implementing this, but I'm grateful that you shared your use case with us! I'll keep this thread up-to-date -- let me know if you go ahead and implement this yourself!

@adamhaney
Copy link

We're actually using Postgres for the majority of our workload (fdw to redshift for some large tables) but I'll check that out for anything that uses Redshift.

To expand on our use case, we're running dbt in airflow and there are sometimes when we want airflow to execute a query in the same schema that dbt is using but we don't want to duplicate the dbt config into airflow in case we ever accidentally change one and not the other. Would it ever make sense to just have a direct

dbt execute $QUERY --target=prod

so we could still use our profiles.yml for connection info but we'd be able to more flexibly execute queries? (If I've trailed too far away from the topic of this ticket I can discuss this with you elsewhere).

@drewbanin
Copy link
Contributor Author

Wow! Does fdw mean Foreign Data Wrapper here? I've never played around with that and I'm super curious how it works with Redshift.

I don't think there's a world in which dbt execute $QUERY --target prod will ever exist. I am however very interested in exposing the dbt python module in a way that supports queries against configured profiles. Something like:

> import dbt
> client = dbt.get_profile(target='prod')
> client.execute("select * from ...")

We see this come up a lot in Jupyter Notebooks that query dbt models. Just for kicks, I've also played around with things like:

> import dbt
> model = dbt.models['my_model'] 
> print(model)
{"name": "my_model", "compiled_sql": "select * from ..."}

That's something we're super interested in, but we have a couple of high priority features that will likely take precedence here.

@michael-erasmus
Copy link

Curious if there has been any further thought or work on this? One more use case for this that could be really handy would be quickly test for errors in all models in dbt project without having to run all the models queries (explain would fail if there are any errors in the generated queries)

@drewbanin
Copy link
Contributor Author

drewbanin commented Mar 23, 2018

@michael-erasmus yeah, explain feels like a great way to "dry run" a model.

One of the challenges here is that models will need to already exist in order for this approach to work. For instance:

-- models/a.sql

{{ config(materialized='table') }}

select 1 as id

and

-- models/b.sql

{{ config(materialized='table') }}

select * from {{ ref('a') }}

If model a doesn't exist in the dbt, then running an explain plan for model b will fail, even though the code is valid.

So, I think there are two ways to handle this:

  1. Recognize that this approach is best suited for minor updates to existing models. Good for quick sanity checks, but not applicable in every scenario.
  2. Materialize all refs as ephemeral to validate models without knowing anything about the current state of dbt models in the database.

I think approach (2) is pretty slick, but I wonder if Redshift is going to have a hard time planning queries for deeply nested models. Plus, the cost values returned would not be representative of the actual cost to build the model. Maybe there's an option to turn this on?

Separately, I know some folks want the ability to run schema tests without materializing models. I think that could use a similar mechanism to (2) above, so it's definitely intriguing! What do you think @michael-erasmus?

@michael-erasmus
Copy link

Hey @drewbanin, thanks for the quick response!

One of the challenges here is that models will need to already exist in order for this approach to work.

TBH, I didn't even consider this when I wrote my comment. The use case that I had in mind came about from a recent small change we made to a pretty central model in our project (renaming a column that's used in a bunch of joins in other models). In this specific case, all the models do exist, but we wanted to make sure we're catching any errors made by the change.

I do like the idea though of making all the refs ephemeral, and that could totally still work for the case for dry-running models, but yeah the explain output would probably not mean much in that case.

I might be over-complicating things even more, but what if you had the option to also switch making all the models ephemeral as a flag too?

@drewbanin
Copy link
Contributor Author

@michael-erasmus I think that a CLI flag to make models ephemeral, plus a flag for a dry-run/explain is a pretty good idea. I worry though that the interface to dbt will start to become bloated, confusing, and inflexible. Plus, what if you only want some models to be ephemeral....

I think this is something we'll be able to tackle when we expose an API for dbt. In this world, you'd be able to define a dbt job in Python. That would make it possible to configure some models as ephemeral, run explains on other models, etc etc.

This is total pseudo code but:

def sanity_check():
  models = dbt.get_models()
  models = models.materialize("ephemeral")
  dbt.run(models, explain=True)

dbt.jobs.register("sanity-check", sanity_check)

....

$ dbt run job sanity-check

I'm not thrilled about blocking interesting new features behind our eventual stable API, but I also feel that features like this will be so much more powerful once they're configurable in code! You buy that?

@michael-erasmus
Copy link

I would love a Python API actually! We're really keen to plug dbt into the way we run some of our other scheduled Python scripts along with some other standard things we include in production (like Bugsnag integration, custom logging, etc). Being able to wrap a dbt job with custom code like this would be perfect, and I like the flexibility you then gain for things like running an explain, etc...

@drewbanin
Copy link
Contributor Author

Great! Obviously producing a stable API is going to require a lot of thought, but I feel confident this is a good decision based on feedback like this! Thanks! And let me know if you have any other ideas here

@lukas1994
Copy link

Any updates here? I'm trying to debug a VERY slow dbt query. Do you have any advice? It'd be super helpful to have a dbt explain tool.

@Olaktal
Copy link

Olaktal commented Mar 2, 2022

Hello !
Any updates on this feature ?
We would like to explain our queries in order to check they are not wrong (column missing, ...)
If any idea on how to do that in another way ...

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Aug 30, 2022
@tom-juntunen-adsk
Copy link

tom-juntunen-adsk commented Sep 1, 2022

Any updates on this issue?

I am looking to generate execution plans against queries before they go to production to see whether partition pruning is taking place and see if we can mitigate issues in advance, such as non-sargable join predicates or where filters containing function calls.

Here is my working macro. I have only figured out how to pass the query text by hand using the --vars CLI option.

Is there an easier way for me to provide compiled SQL text in this macro so that I can generate an explain plan for a dbt test?

{% macro metric_execution_plan (model) %}

    {% set execution_plan_query %}
        select 
            "operation" as operation, 
            sum("partitionsAssigned") as partitions_assigned, 
            sum("partitionsTotal") as partitions_total,
            div0(sum("partitionsAssigned"), sum("partitionsTotal")) as prune_factor
        from table(
          explain_json(
            select SYSTEM$EXPLAIN_PLAN_JSON($$ {{ var('sql') }} $$)
           )
        )
        where operation = 'GlobalStats' 
        group by operation
    {% endset %}

    {% set results = run_query(execution_plan_query) %}

    {% if execute %}
    {{ log("results_show: " ~ results.print_table(), True) }}
    {% set results_list = results.rows %}
    {% else %}
    {% set results_list = [] %}
    {{ log("results_list: " ~ results_list, True) }}
    {% endif %}

    {{ return(results_list) }}

{% endmacro %}

@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 1, 2022

@tom-juntunen-adsk No updates on this work from our end, though I know I've heard folks ask about it. I'm happy to remove the stale label for now.

I think the challenge is still around handling upstream refs:

  • either upstream models have to exist in the database
  • or they need to be rewritten as CTEs stacked on CTEs (ephemeral) — leading to a more complex query
  • or they need to be "deferred" to versions already existing in production

It would be possible to define a custom explain materialization using much the same code that you have now. The model's compiled SQL is available in the materialization context as {{ sql }}. The biggest difference from your proposal is that explain would be executed by dbt run, rather than dbt test.

The original proposal in this issue was to use run-operation with a custom explain macro. The problem is that the run-operation task doesn't include a model compilation step, and there isn't a good way today to access a model's compiled_sql, outside the materialization context.

@jtcohen6 jtcohen6 added Team:Adapters Issues designated for the adapter area of the code and removed stale Issues that have gone stale labels Sep 1, 2022
@tom-juntunen-adsk
Copy link

tom-juntunen-adsk commented Sep 2, 2022

@jtcohen6 Thanks for the quick responsiveness on this issue.

Regarding the upstream ref challenges:

  • In our case, upstream models will exist in the database already when running the explain plan macro for a downstream mode. For now we have only figured out how to generate the explain query using the query text from Snowflake, which we have to run dbt run for anyway to generate, so that is why this is not an issue for us.
  • It would be helpful for us to consider not having to make a query to Snowflake to get query history for this purpose if there is a better way to load in the sql compiled by dbt directly. You mentioned this here:

The model's compiled SQL is available in the materialization context as {{ sql }}.

How can I modify the above macro to materialize the sql properly for this query?

  • I am having a hard time figuring out whether I need to load the compiled sql from /target/compiled myself or if there is a way to, as you say, materialize the sql in a way that compiles the SQL to a string in advance of sending the query to snowflake, as the execution plan command I show requires a string literal as the argument.
  • If that would work, then I would move this macro call from dbt test over to dbt run as a post_hook command so it runs without having to parse the manifest.json twice.
  • In this scenario we would run into the problem of not having upstream tables modeled, but I think we can just setup a process whereby we make sure to call dbt run --target dev <model_name> for the upstream models before we attempt an execution plan analysis from a downstream model. By having the explain plan run as a post_hook command in dbt run we actually avoid that issue altogether as long as we run dependent models first using +

If dbt can solve explain plans in a more elegant way, this can parlay into other types of performance queries as well beyond the explain plan:

  • For instance, (for snowflake customers) clustering_tests from SYSTEM$CLUSTERING_INFORMATION can help identify beyond just number of partitions used / scanned, but also two additional benefits:

    • quickly evaluating multiple clustering configurations to decide how best to add new ones
    • review effective of clustering beyond partition counts by looking at "constant partitions", e.g. ones where all values for the cluster field are the same value on that partition. These partitions can be eliminated by snowflake using just metadata.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 2, 2023

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Mar 2, 2023
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 10, 2023
@mooperd
Copy link

mooperd commented Sep 23, 2023

+1

@asim-ahmed-zocdoc
Copy link

was this issue ever resolved? i'd to be able to pair this with slim ci & --empty to enable us to get performance evaluations of a model WITHOUT having to run the data in full in CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that have gone stale Team:Adapters Issues designated for the adapter area of the code
Projects
None yet
Development

No branches or pull requests

9 participants