-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Databae queries/Views as data sources #2945
Comments
Hi @mtsz-thiago ! Thanks for the request. Could you please elaborate? Maybe share some thoughts on how you see that working in dvc. |
@mtsz-thiago DVC usually works when you already extracted data from DB to files and ready for ML phase (which usually consumes raw data). However, we are thinking about a tighter integration to DBs (see #1577 and #2378) when you can version and control ML phase and have some huck with some control over DB. If your scenario is not ML, but more analytical and you spend 100% time in DB then dbt (data build tool) might be a better fit for you. If I understand your question correctly - this issue is a duplicate of #1577. Please let me know if it's not. |
Data governance on S3 is painful, data access on s3 is slow, and s3 has no built in compute. Warehouse environments like Snowflake, Bigquery, and (recently) redshift all offer low cost storage pricing. In many cases a data science project involves performing data prep on a data warehouse, so why not let the database table be an option for a storage backend. this would greatly reduce the amount of glue code required to make DVC work. DBT is a good tool, but often times it's more overhead than data scientists want. Don't even get me started on airflow for development. It's becoming increasingly common for companies to implement snapshot strategies of their raw data because storage prices are so cheap on BQ, Snowflake, etc, and so a snapshot timestamp plus a set of tables is a great abstraction that DVC could leverage. Food for thought. |
As it is not always convenient to keep all data sources as text files, csv and/or related formats, i think it would be nice to declare views and queries from databases as project's data sources.
The text was updated successfully, but these errors were encountered: