Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet wildcard writing #489

Open
alex-zaitsev opened this issue Sep 27, 2024 · 0 comments
Open

Parquet wildcard writing #489

alex-zaitsev opened this issue Sep 27, 2024 · 0 comments

Comments

@alex-zaitsev
Copy link
Member

alex-zaitsev commented Sep 27, 2024

INSERT INTO s3(‘s3://<my_bucket>/myfiles*.parquet’)

That will automatically split data into multiple files, use existing min_insert_block_size_rows/bytes. Should close ClickHouse#41537

For example, other systems implement it as follows:

BigQuery

The path must contain exactly one wildcard * anywhere in the leaf directory of the path string, for example, ../aa/, ../aa/bc, ../aa/bc, and ../aa/bc. BigQuery replaces * with 0000..N depending on the number of files exported. BigQuery determines the file count and sizes. If BigQuery decides to export two files, then * in the first file's filename is replaced by 000000000000, and * in the second file's filename is replaced by 000000000001.

RedShift

to 's3://amzn-s3-demo-bucket/unload/venue_pipe_'

By default, UNLOAD writes one or more files per slice. Assuming a two-node cluster with two slices per node, the previous example creates these files in amzn-s3-demo-bucket as follows:
venue_pipe_0000_part_00
venue_pipe_0001_part_00
venue_pipe_0002_part_00
venue_pipe_0003_part_00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant