Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EXPORT: support option to compress output files using gzip #45579

Closed
chriscasano opened this issue Mar 2, 2020 · 6 comments · Fixed by #45978
Closed

EXPORT: support option to compress output files using gzip #45579

chriscasano opened this issue Mar 2, 2020 · 6 comments · Fixed by #45978
Assignees

Comments

@chriscasano
Copy link

When running EXPORT, one option that may assist with smaller file sizes and less writes is by using compression. This could be a simple a option in the export statement:

export into csv 's3://export.csv' with delimiter = '|', compression = gzip from select * from kv;

This can save users from kicking off another process to compress data and give them smaller files to work with for faster pipeline processing. It seems like we already have something similar for imports: #26796

@dt dt changed the title Compression for exports EXPORT: support option to compress output files using gzip Mar 3, 2020
@dt
Copy link
Member

dt commented Mar 9, 2020

I think we'll want an approach somewhat similar to #45326

We'll need to update the options here: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/export.go#L59

And then the translation into a distsql spec here:
https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/distsql_physical_planner.go#L3214

And then to do the actual compression, I think we'll want to wrap buf here with a compressing writer: https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/importccl/exportcsv.go#L100

Finally, there are some tests in exportcsv_test.go that should provide a basis for adding a compression case to them.

@C0rWin
Copy link
Contributor

C0rWin commented Mar 9, 2020

I would like to confirm that my understanding of expected outcome, first of all as far as I understand I need to plug into export.go, i.e. ConstrtuctExport which basically responsible to produce export execution node corresponding to the EXPORT statement. Therefore I need to be able to extract compression option and extend protobuf io-formats.proto the CSVOption to include compression field.

// CSVOptions describe the format of csv data (delimiter, comment, etc).
message CSVOptions {
  // comma is an delimiter used by the CSV file; defaults to a comma.
  optional int32 comma = 1 [(gogoproto.nullable) = false];
  // comment is an comment rune; zero value means comments not enabled.
  optional int32 comment = 2 [(gogoproto.nullable) = false];
  // null_encoding, if not nil, is the string which identifies a NULL. Can be the empty string.
  optional string null_encoding = 3 [(gogoproto.nullable) = true];
  // skip the first N lines of the input (e.g. to ignore column headers) when reading.
  optional uint32 skip = 4 [(gogoproto.nullable) = false];
  // If strict_quotes is true, a quote may NOT appear in an unquoted field and a
  // non-doubled quote may NOT appear in a quoted field.
  optional bool strict_quotes = 5 [(gogoproto.nullable) = false];
  // Compression defines whenever exported CSV should be compressed
  // property holds name of the compression codec, currently only
  // gzip is supported
  optional string compression = 6 [(gogoproto.nullable) = false];
}

Next, I need to add supporting logic into exportcsv.go the Run method where I need to gzip produced CSV file.

So some questions:

  1. Did I get this task correctly?
  2. Will it result in producing several archives?
  3. How shall I test it and what is the best practices implementing UT for such functionality? Is there a way to implement an integration UT for S3?
  4. In the issue there is an example with S3, is that the only format of external storage to support the compression or there is a need to introduce such functionality in other place?

@C0rWin
Copy link
Contributor

C0rWin commented Mar 9, 2020

And then the translation into a distsql spec here:
https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/distsql_physical_planner.go#L3214

I'm not really sure why you need, but I might be missing something since not familiar with the code base yet.

Also, given that you add only an option into https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/export.go#L59 don't you need to also updated protobuf so you can read it later in https://github.com/cockroachdb/cockroach/blob/master/pkg/ccl/importccl/exportcsv.go#L100?

@C0rWin
Copy link
Contributor

C0rWin commented Mar 9, 2020

Seems if we add compression options then there is not need to change anything in distsql_physical_planner.go, cause option updates seems to be covered here: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/distsql_physical_planner.go#L3225

@C0rWin
Copy link
Contributor

C0rWin commented Mar 11, 2020

@dt just to confirm the direction introduced a small PR #45978, please let me know whenever it makes sense.

@miretskiy
Copy link
Contributor

I believe this issue can be closed? @dt ?

C0rWin added a commit to C0rWin/cockroach that referenced this issue Mar 21, 2020
This commit extends EXPORT functionality by enabling compression of the
exported stream as suggested in cockroachdb#45579. Currently only gzip is supported
and the export clause to use compression looks as following:

```
export into csv 's3://export.csv' with compression = gzip from select * from foo;
```

Signed-off-by: Artem Barger <[email protected]>

Release note (sql change): support option to compress output files using
gzip

Release justification: none
C0rWin added a commit to C0rWin/cockroach that referenced this issue Apr 2, 2020
This commit extends EXPORT functionality by enabling compression of the
exported stream as suggested in cockroachdb#45579. Currently only gzip is supported
and the export clause to use compression looks as following:

```
export into csv 's3://export.csv' with compression = gzip from select * from foo;
```

Signed-off-by: Artem Barger <[email protected]>

Release note (sql change): support option to compress output files using
gzip

Release justification: none
craig bot pushed a commit that referenced this issue Apr 2, 2020
45978: importccl: support option to compress output files using gzip r=dt a=C0rWin

Fix #45579

This commit extends EXPORT functionality by enabling compression of the
exported stream as suggested in #45579. Currently only gzip is supported
and the export clause to use compression looks as following:

```
export into csv 's3://export.csv' with compression = gzip from select * from foo;
```

Signed-off-by: Artem Barger <[email protected]>

Release note (sql change): support option to compress output files using
gzip

Co-authored-by: Artem Barger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants