-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make bigsampler bq output partition configurable #705
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #705 +/- ##
==========================================
- Coverage 71.09% 70.91% -0.18%
==========================================
Files 44 44
Lines 1816 1822 +6
Branches 292 301 +9
==========================================
+ Hits 1291 1292 +1
- Misses 525 530 +5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love if you could separate formatting changes from actual changes, but LGTM!
| --sample=<percentage> Percentage of records to take in sample, a decimal between 0.0 and 1.0 | ||
| --input=<path> Input file path or BigQuery table | ||
| --output=<path> Output file path or BigQuery table | ||
| [--fields=<field1,field2,...>] An optional list of fields to include in hashing for sampling cohort selection | ||
| [--seed=<seed>] An optional seed used in hashing for sampling cohort selection | ||
| [--hashAlgorithm=(murmur|farm)] An optional arg to select the hashing algorithm for sampling cohort selection. Defaults to FarmHash for BigQuery compatibility | ||
| [--distribution=(uniform|stratified)] An optional arg to sample for a stratified or uniform distribution. Must provide `distributionFields` | ||
| [--distributionFields=<field1,field2,...>] An optional list of fields to sample for distribution. Must provide `distribution` | ||
| [--exact] An optional arg for higher precision distribution sampling. | ||
| [--byteEncoding=(raw|hex|base64)] An optional arg for how to encode fields of type bytes: raw bytes, hex encoded string, or base64 encoded string. Default is to hash raw bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Is it possible to separate formatting changes into a different PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, since I'm aligning the new whitespace with the field being added in this PR
adds a new arg to BigSampler called
bigqueryPartitioning
, defaults to "DAY", which should maintain the same behavior as before. Users can pass in "DAY|HOUR|MONTH|YEAR", as well as NULL if no table partitioning is desired.Making this change so that Ratatool works better with Spotify's internal Luigi BigQuery tasks, which use table sharding as partitioning, and when ratatool sets the partitioning to ingestion day, it causes problems with retention.
Tested by outputting this table via this workflow:
the
43ea5c916cd5a85623bf0de598da15982c29d8952dbf63a068d10e5b56466e61
docker image is using my local ratatool PR's code viasbt publishM2
the linked table has to partitioning and uses the sharding generated by the BigQueryTarget in Luigi
here is another table that is using the
--bigquery-partitioning
arg to set the partitioning to "MONTH".