-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: Database table/non-file dependencies #9945
Comments
This can be limited to databricks rather than a generic connection to delta lake. https://docs.databricks.com/en/dev-tools/index.html might be helpful to research. |
Also related: #2378. |
Edited to add generalized support for callbacks deps as a p2 item |
Snowflake has
For views and external tables, even though they have use schema snowflake_sample_data.tpcds_sf100tcl;
SELECT LAST_ALTERED
FROM snowflake_sample_data.INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = 'CUSTOMER' and TABLE_SCHEMA = 'TPCDS_SF100TCL'; Snowflake also has time-travel feature where you can query using # show data from 5 minutes ago
SELECT * FROM customer_table AT(OFFSET => -60*5); We can also use SELECT hash_agg(*) FROM table;
The first one is applicable only for snowflake, and the second one does not seem very useful by default. The |
Dolt is a full-fledged mysql database with some stored procedures for Git-like operations and versioning capabilities. See https://www.dolthub.com/blog/2023-04-19-dolt-architecture-intro/. cc @shcheklein Also, there have seen some attempts on integrating dolt with DVC. |
@skshetry |
Notes from discussion with @dmpetrov and @shcheklein: Other tools to research in this area: We also discussed starting even simpler and leave it to the user to decide when to check for updates and run the query again. In the basic use case of "query and dump," someone wants to run an expensive query, often across multiple tables, and dump the results to a file. DVC caches the results to recover later, and the user chooses when to update those results. A simple example of how to do this in DVC now would be to write a script to execute the query and dump the results: # dump.py
import pandas as pd
import sqlite3
conn = sqlite3.connect("mydb.db")
df = pd.read_sql_query("SELECT * FROM table WHERE ...", conn)
df.to_csv("dump.csv") Then wrap that in a DVC stage:
Running The simplest approach we can take is to document this pattern. Pending looking deeper into the technologies above, I think anything beyond that will require us to write some database-specific functions to either:
|
Additional considerations:
|
How much do users care about materializing as files compared to materializing as a transient table or a view that they use in the next stage? Materializing to a file is not always possible for large data warehouses databases, and compute/storage is cheap anyway. Although depends on what databases we are targeting and if we are focused on end-to-end scenarios or not. |
Discussed in the sprint meeting:
|
With |
BigQuery has SELECT
table_id,
TIMESTAMP_MILLIS(last_modified_time) AS last_modified
FROM
project_id.dataset.__TABLES__ There are other ways written on this blog post, which also mentions that there is a way using There is not a |
Postgresql>=9.5 has Also, this post suggests that it's not that reliable:
There is no metadata in |
Databricks has But no such fields exist for views (not even for DDL changes). |
MySQL has a There is also something called |
For
Since it is a system table, probably not everyone has access to it. |
dbt is conceptually analagous to dvc for databases. One approach to avoid having to build too much ourselves or maintain implementations for different databases is to be opinionated. In other words, we could suggest to use dbt for database/data engineering/ETL version control (and see if they are also interested in partnering on this), and then only provide an integration with dbt. For example, dbt already has a way to determine the freshness of a table that we could use to determine when data was last updated (when available). If we want to start with simply dumping a query result, we could provide a way to specify a dbt model to dump. Besides this approach being easier to implement, the reason to take this approach would be to have a holistic end-to-end workflow that supports versioning and reproducibility for structured data, starting from ETL all the way to model training and deployment (similar to how dvcx may be upstream of dvc for unstructured data workflows). Edit: I should also mention downsides to this approach. It's opinionated and requires learning another fairly complex tool. |
Different databases offer different authentication and configuration mechanisms. For example, Snowflake supports username/password auth, key-pair auths, MFA, SSO, etc. Redshift supports password-based auth and IAM based authentication. Similarly, bigquery supports oauth, service-account based login, etc. And these are only a few databases out of many that we'd have to support. Also, each database requires a Even if we do start with password-based auth, it is likely going to turn into something more complicated soon, so I don't find the complexity worth it for the I am trying to look into dbt, to see if we can reuse their "Connection Profiles" to run the database queries. They have the concept of Similarly, we could reuse
Also might be worth discussing from other integrations POV with dbt. |
@skshetry @shcheklein To provide a bit more concrete proposal from the thoughts above about dbt integration, let me try to start with the query and dump use case and explain how it might work with dbt. Assumptions:
The user can write their query as a dbt model. This looks like a templated select query, but dbt automatically saves it to a table (it's also possible to save it as a view or a few other variations of how to materialize the model). Then they can dump the results to a dvc-tracked file:
The arguments are a path to the dbt repo and the name of the dbt model to import. This could perform a Some benefits of this approach:
Questions:
|
There is also dbt-fal that provides a function to get a pandas dataframe from a dbt model. But it's one more thing on top of dbt that user should know (but could be a good starting point to take inspiration from). |
Relevant discord comment: https://discord.com/channels/485586884165107732/563406153334128681/1163602577552846998 |
Discussed with @skshetry to move forward with a proof of concept on this that we can use to:
|
@skshetry Should we consider Delta Lake tables done since Databricks is a supported backend for dbt? I don't think we can prioritize another effort specific to Delta Lake right now. |
Now that we have |
Update
See updated proposal below.
Original proposal (outdated)
Summary / Background
Tables/datasets from databases and similar systems can be specified as dependencies in dvc pipelines.
Related:
Scope
Assumptions
Open Questions
Blockers / Dependencies
General Approach
Example:
Steps
Must have (p1)
Optional / followup (p2)
Timelines
TBD by the assignee
The text was updated successfully, but these errors were encountered: