Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cdc: compress ndjson files for cloud sink #43103

Closed
piyush-singh opened this issue Dec 10, 2019 · 5 comments · Fixed by #45326
Closed

cdc: compress ndjson files for cloud sink #43103

piyush-singh opened this issue Dec 10, 2019 · 5 comments · Fixed by #45326
Labels
A-cdc Change Data Capture good first issue

Comments

@piyush-singh
Copy link

Is your feature request related to a problem? Please describe.
After setting up CDC on the registration cluster, we noticed that the files sent to the S3 bucket were uncompressed and therefore consumed much more space than the logical size in the cluster. For reference, the stmtstats table was 250 GiB in the admin UI, but it was around 8 TB in S3.

We should compress the CDC output files. cc @ajwerner

@ajwerner ajwerner added good first issue A-cdc Change Data Capture labels Dec 16, 2019
@ajwerner
Copy link
Contributor

I think that the steps to implement this would require adding:

  1. A new option for the cloudstorage sink to indicate the desire for compression
  2. A new file suffix to indicate that the data is compressed. Assuming we'd use gzip compression the suiffix would be .gz
  3. Actually writing compressed files. I suspect that the compression should occur while bufferring data in memory:

Probably the solution should store an io.WriteCloser that either points directly to &buf or to a https://golang.org/pkg/compress/gzip/#Writer that wraps the buffer.

This straightforward task adds a lot of value to the cloudstorage sink.

One consideration should be the tolerance of consumers of these files to compression. For example, will snowflake accept compressed files? How painful will the compression make these files to use? @piyush-singh

@piyush-singh
Copy link
Author

Snowflake supports creating custom file formats for ingestion which includes specifying compression types. The compression types they support are:

COMPRESSION = AUTO | GZIP | BZ2 | BROTLI | ZSTD | DEFLATE | RAW_DEFLATE | NONE

Having run through this process, this is a fairly trivial change. It would just take a one time, few minute long setup.

@ajwerner
Copy link
Contributor

Snowflake supports creating custom file formats for ingestion which includes specifying compression types.

Awesome! I marked this as good first issue as I think it's easy with high reward. Let's try to get this done for 20.1 one way or another.

@picklerick0496
Copy link

@ajwerner Can I work on this issue.?

@ajwerner
Copy link
Contributor

@ajwerner Can I work on this issue.?

Yes you may! Let me know if you run in to any issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture good first issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants