Allow extraction of data via passed in SQL query #627

gsantia · 2021-08-10T22:01:52Z

Sometimes users will query the prod kf-postgres database for data using SQL and save the results into a format which the ingest library comprehends, then use these files in an extract config. We could allow .sql files to be input to the extract stage and bypass this manual work.

Not entirely sure how credentials would work here, and what steps may be needed to sanitize SQL statements before running them on production data.

The text was updated successfully, but these errors were encountered:

chris-s-friedman · 2021-08-13T19:39:38Z

so i'm thinking about a few things here, to do this,

user passes source_data_url the connection url for whatever db they want to query, I.E. postgresql://{USER_NAME}:{PASSWORD}@{PG_HOST}:{PG_PORT}/{PG_DBNAME}
add functionality and changes to common.fileretriever to handle database protocols
A. add lines for different db protocols in common.file_retriever._getters
B. add a getter for database connections. this would be something like trying to connect to the database and if it fails, error out, but return None.
in common.io add a function read_sql_df that takes in the source_data_url and some sql and returns the results of pandas.read_sql().
make common.io.read_df know when to use read_sql_df

fiendish · 2021-08-14T19:03:12Z

Why are we ingesting details that are already in the database?

chris-s-friedman · 2021-08-14T19:06:15Z

This would be for pulling from any db-could be d3b warehouse. What's at top of my mind is pulling from file_metadata schéma in KFpostgres

fiendish · 2021-08-14T19:07:01Z

What's at top of my mind is pulling from file_metadata schéma in KFpostgres

Why?

chris-s-friedman · 2021-08-14T19:09:38Z

Not having to run s3 scrapes manually since scrapes are now getting into the file_metadata.aws_scrapes table. Also assigning hashes from the file_metadata.hashes table

fiendish · 2021-08-14T19:13:25Z

So you're putting s3 file metadata into its own table in the same database as the dataservice but not creating GFs in the dataservice so that they would get registered in indexd? Is there something blocking registering GFs in the dataservice?

chris-s-friedman · 2021-08-14T19:18:10Z

Correct. Not everything in s3 gets registered. That table allows comparing files between buckets and between accounts - e.g. allowing data ops to see if every file in a bucket is replicated in a DR bucket in another account- and the hashes of those files match.

It allows seeing what files in s3 have not yet been registered. As well.

fiendish · 2021-08-14T19:40:00Z

Anyway, as for the general problem...

If we add a sql protocol like you're suggesting, it will require a way to pass a second argument to the sql engine for the desired query. You can't currently just use the FileRetriever to create the engine and then expect the read_func to do the query, because the read_func is handed a file name, not an arbitrary object.

You could refactor FileRetriever.py/extract.py to return in-memory bytestreams instead of writing to temporary files, and then apply the reader to the bytestream instead of a file path. Then you could make the file retriever create an engine for you and then have read_func do your query.

Or you could just allow an unset source url as long as do_after_read is defined and then do whatever query you want in do_after_read and return a DataFrame. This is probably easier.

gsantia added the feature New functionality label Aug 10, 2021

gsantia assigned chris-s-friedman and gsantia Aug 10, 2021

chris-s-friedman mentioned this issue Dec 7, 2021

📝 Sql query tutorial #642

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow extraction of data via passed in SQL query #627

Allow extraction of data via passed in SQL query #627

gsantia commented Aug 10, 2021 •

edited

Loading

chris-s-friedman commented Aug 13, 2021

fiendish commented Aug 14, 2021

chris-s-friedman commented Aug 14, 2021

fiendish commented Aug 14, 2021

chris-s-friedman commented Aug 14, 2021

fiendish commented Aug 14, 2021 •

edited

Loading

chris-s-friedman commented Aug 14, 2021

fiendish commented Aug 14, 2021 •

edited

Loading

Allow extraction of data via passed in SQL query #627

Allow extraction of data via passed in SQL query #627

Comments

gsantia commented Aug 10, 2021 • edited Loading

chris-s-friedman commented Aug 13, 2021

fiendish commented Aug 14, 2021

chris-s-friedman commented Aug 14, 2021

fiendish commented Aug 14, 2021

chris-s-friedman commented Aug 14, 2021

fiendish commented Aug 14, 2021 • edited Loading

chris-s-friedman commented Aug 14, 2021

fiendish commented Aug 14, 2021 • edited Loading

gsantia commented Aug 10, 2021 •

edited

Loading

fiendish commented Aug 14, 2021 •

edited

Loading

fiendish commented Aug 14, 2021 •

edited

Loading