-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming COPY #7889
Comments
Cc @lichuang |
Before this work, I think we should finish CopyInterpreterV2.list_files can be same with InterpreterCommon.list_files |
further update on this? |
Wait for #7892 |
#7892 has been closed. |
There are other works to finish on copy side. Let's reopen it. |
I will take this, we need to consider more situations: cc @Xuanwo |
Partitions scanning and data reading are different kinds of workloads that can be executed at the same time. Adding this feature will also improve the performance of With this feature, we can do copying at the same time of listing: Given a bucket with 1000_0000 (1kw) files, every listing request will return 200 objects, and the typical list latency is 100ms. We will need 1000_0000 / 200 / 0.1 = 1.3 hour to fully scan it. |
New COPY design will be use StageTable as table engine and re-use the planner& pipeline distribution: |
Background
Databend will list all files before copying data:
https://github.com/datafuselabs/databend/blob/46da962d5637b35555e8f3f49885d1b58211ab41/src/query/service/src/interpreters/interpreter_copy_v2.rs#L399-L460
This will lead to:
COPY INTO
command cannot import data from s3 with massive files or directories #7862We need to introduce streaming copy which will copy simultaneously during the listing.
Problems
We need to change them to streaming-friendly. For example, only check a batch of files.
Tasks
The text was updated successfully, but these errors were encountered: