-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define source tables #814
Comments
I like this a lot. I have a couple of questions that popped into my head from reading this that we'll likely have to answer before completing this feature:
|
this looks great. the only thing i'm not 100% sure on is so instead of sql_table_name, we'd have: sources:
- name: snowplow
description: "A snowplow dataset"
tables:
- name: snowplow_event
description: An immutable log of events collected by Snowplow
schema: snowplow
identifier: event
columns:
- name: collector_tstamp
description: Timestamp for the event recorded by the collector
- name: domain_userid
description: User ID set by Snowplow using 1st party cookie
tests:
- unique
- name: web_page
description: "{{ docs('snowplow_web_page') }}"
# this used to be one var, which is unfortunate! hard to convert.
# sql_table_name: "{{ var('snowplow:web_page') }}"
schema: snowplow
identifier: "{{ var('snowplow:web_page') }}"
columns:
- name: event_id
description: {{ docs('snowplow_web_page.event_id') }} |
That was what the initial version looked like, but databases are different in terms of how they address namespace--do you need to specify a database as well or just a schema? is it even possible to select from a different database? verbiage is different also. Is there a benefit to actually defining |
Depends on the warehouse type. For Redshift and Postgres, you can't select from a different database. For Snowflake, you could. And, I think in Bigquery you have to specify the project.
We made some decisions in the relations rollout to accommodate this, the big one being that we use
This is true today, but a lot of our internal API for inspecting relations depends on this code being structured a certain way -- so for example you can imagine if we want to inspect the schema of a source table in Bigquery, we actually make an API call with the project, schema, and table name as arguments, instead of just running some arbitrary SELECT. OFC we could always convert this string back to a relation, it's just messy and there's room for bugs there. |
All that make sense. I think that we would have to require |
I'm definitely leaning more towards constructing the SQL identifier from a database + schema + table name. Generally, I think there are some real questions to answer around implementation. We also need to figure out how these things fit into I think we should kick this out of the 0.11.0 milestone, but we should definitely continue to shore up our thinking around implementation in this issue |
This is partially blocking on #550 for source selection. Otherwise, I think we may be in good shape to tackle this one soon. Note that there is a significant set of corresponding changes in the documentation site that will be enabled by this feature. |
This is no longer blocking on #550 and can be prioritized when ready |
@drewbanin I've got a few questions! I've worked around most of them while working out the schema and basic parsing, but I think I should address them before I do much more:
|
Yeah, you're correct to assume that there are no defaults. I think there are a ton of datasets where monitoring freshness is unnecessary (eg. manual loads) or otherwise not possible (no loaded_at date provided by the ETL tool). I like your instincts here. Users must override both the count and the period. In the schema.yml freshness contract, both the count and period should be required.
I can't think of a good use case for decimal values. Instead of .5 hours, you could just specify 30 minutes. My instinct is that we'd probably want to start with ints, just insofar as it will make rendering any UIs a little easier. These should be integers ≥ 1 I don't feel strongly about this though. If there's a good reason to support floats, that would be ok too.
Great question! Freshness may only be supplied if the user also supplied a loaded_at_field. Not every dataset will have a loaded_at field, and there are merits to making sources beyond just calculating data freshness. So, we should make loaded_at_field optional, but throw an error if freshness is supplied without a loaded_at_field.
I'm not sure I understand the implications of one approach over the other. Can you elaborate?
Yes, 100% correct!
Possibly? Since we don't have much in the way of namespaces, different function names indicate what kind of thing you're referencing! In a typical base model today, you'd write:
With sources, it would be:
I like the idea of preserving the fact that this is a reference to source data, and would hesitate to just use
This is a really good question! if sources have access to macros, does that mean that macros can't refer to sources? I'd rather be able to specify These should be mostly static/unchanging configs, and so I'm ok with keeping them pretty simple. The fewer moving parts the better IMO.
I think that all of the source-specific fields should be renderable. All of the configs that are shared with the Here's the rationale: sources can be defined inside of packages, and the package owners might not know anything about how the source tables are being loaded into the warehouse. So, you have
It would be good to allow these to be configured with variables (see below) as we do in most of our packages, like snowplow. Note: I don't love how we scope and supply variables currently, but that can be a future project.
Really good question. I think the following context variables should be supplied:
Let's make them nodes in the manifest. I don't feel strongly about this though, open to discussing.
While you're correct that they would not be referenced with the
Good question! Let's support both. I can imagine tests like "make sure this table isn't empty" or similar.
Ohhhhh boy. Are there complications in supporting both
Let's add a
We should tackle this at some point (the test names are unreservedly terrible). Changing this will impact the docs site, as we actually parse out referential integrity from these convoluted names :upside-down-smiley-face:. We'll need to give that some love in the future, but no need to poke the sleeping bear right this second. |
Early binding is a lot easier, pulling in SourceConfig sounds pretty difficult! When I asked this I thought it might make sense to have one file define a source table in another file's sources section, but given your answers elsewhere it makes less sense.
Without some big changes, I'm pretty confident that it has to be one or the other for macros, it's why we build up the partial macro-only manifest during parsing.
I think the different function names is a fine way to expose it to users. My problem with this is a bit larger - I'm not sure that I like what comes out of the obvious way to implement this. We've already got
In It sounds like you want it to include the vars defined in the dbt_project.yml, is that accurate? Currently, by the way, Finally, the current parser for
Sources just seem to exist in this weird half-in, half-out state. I guess the question here is - would archives be referenced via
I don't think so. As long as you don't form cycles I don't think it matters much. I do have to admit, I'm not clear on how exactly you'd use a |
is this true? how is this different from |
@cmcarthur oh, good point! I guess sources will work the same way as refs since the macro level does no interpretation. |
@drewbanin would be super useful if we could set the Please see this Slack thread for a concrete example: Any workarounds (apart from creating a multi-region dataset) would be appreciated. |
Hey @cosoare, do you mind opening a new issue and including all the relevant details? We can continue the discussion there. I want to better understand the pros/cons of single-region vs. multi-region datasets. |
Just pasting in the thread details from Slack since they disappear on us :)
You can indeed feel free to create a new issue for us to discuss this change, but I don't really think that dbt can help you here! BigQuery disallows querying datasets across regions. If you did provide a location for source datasets in BQ, there isn't really anything dbt could do with that information. The source data already lives in a specific region, and short of copying that data into a differently named dataset in a different region (something we would not want to do), I don't think there's anything dbt could do to make this work! So, please pop open a new issue to discuss in further detail if you'd like. In particular, I'd like to know what you think dbt should be doing with these dataset locations if provided! |
Related: #790
Presently, the link between source data and dbt models is pretty tenuous. dbt should provide a mechanism for defining source tables, and users should be able to select from these source tables with a mechanism similar to
ref
. Further, these source tables should be documented alongside other dbt resources.Proposed Syntax
Notes
For more information on the
description
andcolumns
fields, check out the discussion on the new schema.yml syntax.There are two new constructs here:
sql_table_name
This field will be responsible for pointing to a table in the warehouse. This syntax works for:
BigQuery:
project-name.dataset.tablename
Snowflake:
database.schema.tablename
pg/Redshift:
schema.tablename
Rather than trying to construct the table name automatically, users will need to enter the exact identifier for the source table.
loader
This will be used for documentation. Loader can be something like Airflow, Stitch, Fivetran, etc. User supplied.
loaded_at_field
This field indicates which timestamp field in the source table should be used as the "loaded at" timestamp. This varies by ETL loader, but for loaders which do supply such a field, we can calculate how long ago certain sources were loaded. This field can be specified at the
source
level or at thetable
level. If specified at thetable
level, the value overrides the value from thesource
if present.freshness
This field describes the acceptable latency for source tables. This field can be provided at the
source
ortable
level. If provided at thesource
level, it should apply to all of the tables in the source. If provided at the table level, it should override any freshness configuration at the source level. Thefreshness
value for a table can be set tonull
to override the value specified in the source.Acceptable formats:
Acceptable periods are
minute
,hour
, andday
.Usage
To select from a source table, users can use the
source
function, analogous toref
. This function accepts two arguments: a source and a table. For example:will render to:
When
source()
is used, an edge will be added to the graph. This will allow model selection like: "run everything that depends on Snowplow data". In the above examplesnowplow_event
is derived from thename:
parameter of the first table definition.Ex:
See #1014 for info about the selection syntax shown above.
See #1240 for info about model freshness
The text was updated successfully, but these errors were encountered: