Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make bigsampler bq output partition configurable #705

Merged
merged 4 commits into from
Feb 27, 2024

Conversation

benkonz
Copy link
Contributor

@benkonz benkonz commented Feb 20, 2024

adds a new arg to BigSampler called bigqueryPartitioning, defaults to "DAY", which should maintain the same behavior as before. Users can pass in "DAY|HOUR|MONTH|YEAR", as well as NULL if no table partitioning is desired.

Making this change so that Ratatool works better with Spotify's internal Luigi BigQuery tasks, which use table sharding as partitioning, and when ratatool sets the partitioning to ingestion day, it causes problems with retention.

Tested by outputting this table via this workflow:

apiVersion: workflow.data.spotify.com/v1alpha1
kind: Workflow
metadata:
  name: ratatool-internal-examples-stream-days-bigsampler
  namespace: data-quality-spotify
spec:
  resourceID: ratatool-internal-examples.stream.days.BigSampler
  componentID: ratatool-internal
  scheduling:
    schedule: daily
  serviceAccountRef:
    external: contours-test-pipeline@data-quality-spotify.iam.gserviceaccount.com
  docker:
    args:
      - 'wrap-luigi'
      - '--module'
      - 'luigi_tasks'
      - 'BigSampler'
      - '--uri-prefix'
      - 'gs://benk-playground'
      - '--project'
      - 'data-quality-spotify'
      - '--service-account'
      - 'contours-test-pipeline@data-quality-spotify.iam.gserviceaccount.com'
      - '--input-endpoint'
      - 'spotify-people:groups.groups_%Y%m%d'
      - '--output-endpoint'
      - 'data-quality-spotify:benk_test_eu.benk_test_eu_%Y%m%d'
      - '--sample'
      - '0.01'
      - '--partition'
      - '{}'
    terminationLogging: true
    image: 43ea5c916cd5a85623bf0de598da15982c29d8952dbf63a068d10e5b56466e61
  workflowAlertingDisabled: true

the 43ea5c916cd5a85623bf0de598da15982c29d8952dbf63a068d10e5b56466e61 docker image is using my local ratatool PR's code via sbt publishM2

the linked table has to partitioning and uses the sharding generated by the BigQueryTarget in Luigi

here is another table that is using the --bigquery-partitioning arg to set the partitioning to "MONTH".

Copy link

codecov bot commented Feb 20, 2024

Codecov Report

Attention: Patch coverage is 25.00000% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 70.91%. Comparing base (f637e88) to head (6ed1bbf).
Report is 8 commits behind head on master.

Files Patch % Lines
...ala/com/spotify/ratatool/samplers/BigSampler.scala 33.33% 4 Missing ⚠️
...spotify/ratatool/samplers/BigSamplerBigQuery.scala 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #705      +/-   ##
==========================================
- Coverage   71.09%   70.91%   -0.18%     
==========================================
  Files          44       44              
  Lines        1816     1822       +6     
  Branches      292      301       +9     
==========================================
+ Hits         1291     1292       +1     
- Misses        525      530       +5     
Flag Coverage Δ
ratatoolCli 2.90% <0.00%> (-0.02%) ⬇️
ratatoolCommon 0.00% <ø> (ø)
ratatoolDiffy 32.73% <0.00%> (-0.13%) ⬇️
ratatoolExamples 17.34% <0.00%> (-0.07%) ⬇️
ratatoolSampling 62.11% <25.00%> (-0.26%) ⬇️
ratatoolScalacheck 78.14% <ø> (ø)
ratatoolShapeless 4.18% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@monzalo14 monzalo14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love if you could separate formatting changes from actual changes, but LGTM!

Comment on lines +104 to +113
| --sample=<percentage> Percentage of records to take in sample, a decimal between 0.0 and 1.0
| --input=<path> Input file path or BigQuery table
| --output=<path> Output file path or BigQuery table
| [--fields=<field1,field2,...>] An optional list of fields to include in hashing for sampling cohort selection
| [--seed=<seed>] An optional seed used in hashing for sampling cohort selection
| [--hashAlgorithm=(murmur|farm)] An optional arg to select the hashing algorithm for sampling cohort selection. Defaults to FarmHash for BigQuery compatibility
| [--distribution=(uniform|stratified)] An optional arg to sample for a stratified or uniform distribution. Must provide `distributionFields`
| [--distributionFields=<field1,field2,...>] An optional list of fields to sample for distribution. Must provide `distribution`
| [--exact] An optional arg for higher precision distribution sampling.
| [--byteEncoding=(raw|hex|base64)] An optional arg for how to encode fields of type bytes: raw bytes, hex encoded string, or base64 encoded string. Default is to hash raw bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Is it possible to separate formatting changes into a different PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, since I'm aligning the new whitespace with the field being added in this PR

@benkonz benkonz merged commit 1b6b9c3 into master Feb 27, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants