Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow extraction of data via passed in SQL query #627

Open
gsantia opened this issue Aug 10, 2021 · 8 comments
Open

Allow extraction of data via passed in SQL query #627

gsantia opened this issue Aug 10, 2021 · 8 comments
Assignees
Labels
feature New functionality

Comments

@gsantia
Copy link
Contributor

gsantia commented Aug 10, 2021

Sometimes users will query the prod kf-postgres database for data using SQL and save the results into a format which the ingest library comprehends, then use these files in an extract config. We could allow .sql files to be input to the extract stage and bypass this manual work.

Not entirely sure how credentials would work here, and what steps may be needed to sanitize SQL statements before running them on production data.

@gsantia gsantia added the feature New functionality label Aug 10, 2021
@chris-s-friedman
Copy link
Contributor

so i'm thinking about a few things here, to do this,

  1. user passes source_data_url the connection url for whatever db they want to query, I.E. postgresql://{USER_NAME}:{PASSWORD}@{PG_HOST}:{PG_PORT}/{PG_DBNAME}
  2. add functionality and changes to common.fileretriever to handle database protocols
    A. add lines for different db protocols in common.file_retriever._getters
    B. add a getter for database connections. this would be something like trying to connect to the database and if it fails, error out, but return None.
  3. in common.io add a function read_sql_df that takes in the source_data_url and some sql and returns the results of pandas.read_sql().
  4. make common.io.read_df know when to use read_sql_df

@fiendish
Copy link
Contributor

Why are we ingesting details that are already in the database?

@chris-s-friedman
Copy link
Contributor

This would be for pulling from any db-could be d3b warehouse. What's at top of my mind is pulling from file_metadata schéma in KFpostgres

@fiendish
Copy link
Contributor

What's at top of my mind is pulling from file_metadata schéma in KFpostgres

Why?

@chris-s-friedman
Copy link
Contributor

Not having to run s3 scrapes manually since scrapes are now getting into the file_metadata.aws_scrapes table. Also assigning hashes from the file_metadata.hashes table

@fiendish
Copy link
Contributor

fiendish commented Aug 14, 2021

So you're putting s3 file metadata into its own table in the same database as the dataservice but not creating GFs in the dataservice so that they would get registered in indexd? Is there something blocking registering GFs in the dataservice?

@chris-s-friedman
Copy link
Contributor

Correct. Not everything in s3 gets registered. That table allows comparing files between buckets and between accounts - e.g. allowing data ops to see if every file in a bucket is replicated in a DR bucket in another account- and the hashes of those files match.

It allows seeing what files in s3 have not yet been registered. As well.

@fiendish
Copy link
Contributor

fiendish commented Aug 14, 2021

Anyway, as for the general problem...

If we add a sql protocol like you're suggesting, it will require a way to pass a second argument to the sql engine for the desired query. You can't currently just use the FileRetriever to create the engine and then expect the read_func to do the query, because the read_func is handed a file name, not an arbitrary object.

You could refactor FileRetriever.py/extract.py to return in-memory bytestreams instead of writing to temporary files, and then apply the reader to the bytestream instead of a file path. Then you could make the file retriever create an engine for you and then have read_func do your query.

Or you could just allow an unset source url as long as do_after_read is defined and then do whatever query you want in do_after_read and return a DataFrame. This is probably easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality
Projects
None yet
Development

No branches or pull requests

3 participants