-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workflow for explain
queries?
#401
Comments
We're continuing to try to find a solution to this problem. We were considering adding a --explain argument to dbt run that would pre-pend EXPLAIN to the query and then dump the EXPLAIN output to stdout. It looks like Redshift and Postgres both support just pre-pending EXPLAIN to the query itself, not sure about snowflake etc, (maybe that command just wouldn't be applicable for those backends?) |
@adamhaney are you familiar with the STL_EXPLAIN Redshift system table? Unsure if it can help you here, but it was news to me and I think it's relevant. I think your approach is a reasonable one given how dbt works currently. I think that long-term, this probably shouldn't be a CLI arg, but i'm not 100% certain what form it will take. I'm just thinking about other operations like I'm lumping
here, you'd have two operations (essentially macros) called We're a decent ways out from implementing this, but I'm grateful that you shared your use case with us! I'll keep this thread up-to-date -- let me know if you go ahead and implement this yourself! |
We're actually using Postgres for the majority of our workload (fdw to redshift for some large tables) but I'll check that out for anything that uses Redshift. To expand on our use case, we're running dbt in airflow and there are sometimes when we want airflow to execute a query in the same schema that dbt is using but we don't want to duplicate the dbt config into airflow in case we ever accidentally change one and not the other. Would it ever make sense to just have a direct
so we could still use our profiles.yml for connection info but we'd be able to more flexibly execute queries? (If I've trailed too far away from the topic of this ticket I can discuss this with you elsewhere). |
Wow! Does I don't think there's a world in which > import dbt
> client = dbt.get_profile(target='prod')
> client.execute("select * from ...") We see this come up a lot in Jupyter Notebooks that query dbt models. Just for kicks, I've also played around with things like: > import dbt
> model = dbt.models['my_model']
> print(model)
{"name": "my_model", "compiled_sql": "select * from ..."} That's something we're super interested in, but we have a couple of high priority features that will likely take precedence here. |
Curious if there has been any further thought or work on this? One more use case for this that could be really handy would be quickly test for errors in all models in dbt project without having to run all the models queries (explain would fail if there are any errors in the generated queries) |
@michael-erasmus yeah, One of the challenges here is that models will need to already exist in order for this approach to work. For instance: -- models/a.sql
{{ config(materialized='table') }}
select 1 as id and -- models/b.sql
{{ config(materialized='table') }}
select * from {{ ref('a') }} If model So, I think there are two ways to handle this:
I think approach (2) is pretty slick, but I wonder if Redshift is going to have a hard time planning queries for deeply nested models. Plus, the cost values returned would not be representative of the actual cost to build the model. Maybe there's an option to turn this on? Separately, I know some folks want the ability to run schema tests without materializing models. I think that could use a similar mechanism to (2) above, so it's definitely intriguing! What do you think @michael-erasmus? |
Hey @drewbanin, thanks for the quick response!
TBH, I didn't even consider this when I wrote my comment. The use case that I had in mind came about from a recent small change we made to a pretty central model in our project (renaming a column that's used in a bunch of joins in other models). In this specific case, all the models do exist, but we wanted to make sure we're catching any errors made by the change. I do like the idea though of making all the refs I might be over-complicating things even more, but what if you had the option to also switch making all the models |
@michael-erasmus I think that a CLI flag to make models ephemeral, plus a flag for a dry-run/explain is a pretty good idea. I worry though that the interface to dbt will start to become bloated, confusing, and inflexible. Plus, what if you only want some models to be ephemeral.... I think this is something we'll be able to tackle when we expose an API for dbt. In this world, you'd be able to define a dbt job in Python. That would make it possible to configure some models as ephemeral, run explains on other models, etc etc. This is total pseudo code but: def sanity_check():
models = dbt.get_models()
models = models.materialize("ephemeral")
dbt.run(models, explain=True)
dbt.jobs.register("sanity-check", sanity_check) ....
I'm not thrilled about blocking interesting new features behind our eventual stable API, but I also feel that features like this will be so much more powerful once they're configurable in code! You buy that? |
I would love a Python API actually! We're really keen to plug dbt into the way we run some of our other scheduled Python scripts along with some other standard things we include in production (like Bugsnag integration, custom logging, etc). Being able to wrap a dbt job with custom code like this would be perfect, and I like the flexibility you then gain for things like running an explain, etc... |
Great! Obviously producing a stable API is going to require a lot of thought, but I feel confident this is a good decision based on feedback like this! Thanks! And let me know if you have any other ideas here |
Any updates here? I'm trying to debug a VERY slow dbt query. Do you have any advice? It'd be super helpful to have a |
Hello ! |
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days. |
Any updates on this issue? I am looking to generate execution plans against queries before they go to production to see whether partition pruning is taking place and see if we can mitigate issues in advance, such as non-sargable join predicates or where filters containing function calls. Here is my working macro. I have only figured out how to pass the query text by hand using the Is there an easier way for me to provide compiled SQL text in this macro so that I can generate an explain plan for a dbt test?
|
@tom-juntunen-adsk No updates on this work from our end, though I know I've heard folks ask about it. I'm happy to remove the I think the challenge is still around handling upstream
It would be possible to define a custom The original proposal in this issue was to use |
@jtcohen6 Thanks for the quick responsiveness on this issue. Regarding the upstream
How can I modify the above macro to materialize the sql properly for this query?
If dbt can solve explain plans in a more elegant way, this can parlay into other types of performance queries as well beyond the explain plan:
|
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days. |
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers. |
+1 |
was this issue ever resolved? i'd to be able to pair this with slim ci & --empty to enable us to get performance evaluations of a model WITHOUT having to run the data in full in CI |
Can this work with operations?
The text was updated successfully, but these errors were encountered: