Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Databae queries/Views as data sources #2945

Closed
mtsz-thiago opened this issue Dec 12, 2019 · 3 comments
Closed

Support for Databae queries/Views as data sources #2945

mtsz-thiago opened this issue Dec 12, 2019 · 3 comments
Labels
awaiting response we are waiting for your reply, please respond! :) feature request Requesting a new feature

Comments

@mtsz-thiago
Copy link

As it is not always convenient to keep all data sources as text files, csv and/or related formats, i think it would be nice to declare views and queries from databases as project's data sources.

@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Dec 12, 2019
@efiop
Copy link
Contributor

efiop commented Dec 12, 2019

Hi @mtsz-thiago ! Thanks for the request. Could you please elaborate? Maybe share some thoughts on how you see that working in dvc.

@efiop efiop added awaiting response we are waiting for your reply, please respond! :) feature request Requesting a new feature labels Dec 12, 2019
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Dec 12, 2019
@dmpetrov
Copy link
Member

@mtsz-thiago DVC usually works when you already extracted data from DB to files and ready for ML phase (which usually consumes raw data).

However, we are thinking about a tighter integration to DBs (see #1577 and #2378) when you can version and control ML phase and have some huck with some control over DB.

If your scenario is not ML, but more analytical and you spend 100% time in DB then dbt (data build tool) might be a better fit for you.

If I understand your question correctly - this issue is a duplicate of #1577. Please let me know if it's not.

@mike-weinberg
Copy link

Data governance on S3 is painful, data access on s3 is slow, and s3 has no built in compute. Warehouse environments like Snowflake, Bigquery, and (recently) redshift all offer low cost storage pricing. In many cases a data science project involves performing data prep on a data warehouse, so why not let the database table be an option for a storage backend. this would greatly reduce the amount of glue code required to make DVC work.

DBT is a good tool, but often times it's more overhead than data scientists want. Don't even get me started on airflow for development.

It's becoming increasingly common for companies to implement snapshot strategies of their raw data because storage prices are so cheap on BQ, Snowflake, etc, and so a snapshot timestamp plus a set of tables is a great abstraction that DVC could leverage. Food for thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :) feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

4 participants