Replies: 3 comments
-
I second option 2 as well. As for sanitizing raw data, I think Great Expectaion might be one of the suitable tools for this purpose. It checks the property for data and also supports BigQuery. |
Beta Was this translation helpful? Give feedback.
-
go for option 2 as well |
Beta Was this translation helpful? Give feedback.
-
In the end I adapted option THREE: 1) upload csv raw data by this tool #23 2) add one more task of the DAG to query from bigquery and dump to a csv back in airflow 3) rg-cli use the csv as usual. The option THREE has both of the pros of option 1 (no need to refactoring rg-cli that much) and option 2 (by default we use bigquery and no need of GCS). The trade-off is performance for one more I/O layer. However, in our foreseeable future, it is not an issue in our case because the data is not that much. Great Expectation will be enhancement of #23 . For our goal, it is currently over-killed, so I sanitize the csv raw data on our on with the help #23 . |
Beta Was this translation helpful? Give feedback.
-
In order to complete #5 , I want to integrate the command line tool https://github.com/pycontw/pycontw-postevent-report-generator , which requires input files as CSV, as a airflow DAG. The raw data are CSV files. There are two options popping up in my mind, but I don't have ideas yet which one may be better:
My gut feeling shows the option2 is better, but I would like to know more comments.
Beta Was this translation helpful? Give feedback.
All reactions