-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow extraction of data via passed in SQL query #627
Comments
so i'm thinking about a few things here, to do this,
|
Why are we ingesting details that are already in the database? |
This would be for pulling from any db-could be d3b warehouse. What's at top of my mind is pulling from file_metadata schéma in KFpostgres |
Why? |
Not having to run s3 scrapes manually since scrapes are now getting into the file_metadata.aws_scrapes table. Also assigning hashes from the file_metadata.hashes table |
So you're putting s3 file metadata into its own table in the same database as the dataservice but not creating GFs in the dataservice so that they would get registered in indexd? Is there something blocking registering GFs in the dataservice? |
Correct. Not everything in s3 gets registered. That table allows comparing files between buckets and between accounts - e.g. allowing data ops to see if every file in a bucket is replicated in a DR bucket in another account- and the hashes of those files match. It allows seeing what files in s3 have not yet been registered. As well. |
Anyway, as for the general problem... If we add a sql protocol like you're suggesting, it will require a way to pass a second argument to the sql engine for the desired query. You can't currently just use the FileRetriever to create the engine and then expect the read_func to do the query, because the read_func is handed a file name, not an arbitrary object. You could refactor FileRetriever.py/extract.py to return in-memory bytestreams instead of writing to temporary files, and then apply the reader to the bytestream instead of a file path. Then you could make the file retriever create an engine for you and then have read_func do your query. Or you could just allow an unset source url as long as do_after_read is defined and then do whatever query you want in do_after_read and return a DataFrame. This is probably easier. |
Sometimes users will query the prod kf-postgres database for data using SQL and save the results into a format which the ingest library comprehends, then use these files in an extract config. We could allow
.sql
files to be input to the extract stage and bypass this manual work.Not entirely sure how credentials would work here, and what steps may be needed to sanitize SQL statements before running them on production data.
The text was updated successfully, but these errors were encountered: