Parquet wildcard writing #489

alex-zaitsev · 2024-09-27T14:50:07Z

INSERT INTO s3(‘s3://<my_bucket>/myfiles*.parquet’)

That will automatically split data into multiple files, use existing min_insert_block_size_rows/bytes. Should close ClickHouse#41537

For example, other systems implement it as follows:

BigQuery

The path must contain exactly one wildcard * anywhere in the leaf directory of the path string, for example, ../aa/, ../aa/bc, ../aa/bc, and ../aa/bc. BigQuery replaces * with 0000..N depending on the number of files exported. BigQuery determines the file count and sizes. If BigQuery decides to export two files, then * in the first file's filename is replaced by 000000000000, and * in the second file's filename is replaced by 000000000001.

RedShift

to 's3://amzn-s3-demo-bucket/unload/venue_pipe_'

By default, UNLOAD writes one or more files per slice. Assuming a two-node cluster with two slices per node, the previous example creates these files in amzn-s3-demo-bucket as follows:
venue_pipe_0000_part_00
venue_pipe_0001_part_00
venue_pipe_0002_part_00
venue_pipe_0003_part_00

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet wildcard writing #489

Parquet wildcard writing #489

alex-zaitsev commented Sep 27, 2024 •

edited

Loading

Parquet wildcard writing #489

Parquet wildcard writing #489

Comments

alex-zaitsev commented Sep 27, 2024 • edited Loading

alex-zaitsev commented Sep 27, 2024 •

edited

Loading