Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming COPY #7889

Closed
1 of 2 tasks
Tracked by #7823
Xuanwo opened this issue Sep 26, 2022 · 10 comments
Closed
1 of 2 tasks
Tracked by #7823

Streaming COPY #7889

Xuanwo opened this issue Sep 26, 2022 · 10 comments
Assignees

Comments

@Xuanwo
Copy link
Member

Xuanwo commented Sep 26, 2022

Background

Databend will list all files before copying data:

https://github.com/datafuselabs/databend/blob/46da962d5637b35555e8f3f49885d1b58211ab41/src/query/service/src/interpreters/interpreter_copy_v2.rs#L399-L460

This will lead to:

We need to introduce streaming copy which will copy simultaneously during the listing.

Problems

  • GetTableCopiedFileReply will return all files
  • UpsertTableCopiedFileReq requires update all copied files at once

We need to change them to streaming-friendly. For example, only check a batch of files.

Tasks

@Xuanwo
Copy link
Member Author

Xuanwo commented Sep 26, 2022

Cc @lichuang

@Xuanwo Xuanwo self-assigned this Sep 26, 2022
@BohuTANG
Copy link
Member

BohuTANG commented Sep 26, 2022

Before this work, I think we should finish CopyInterpreterV2.list_files can be same with InterpreterCommon.list_files

@ClSlaid
Copy link
Contributor

ClSlaid commented Oct 8, 2022

further update on this?

@Xuanwo
Copy link
Member Author

Xuanwo commented Oct 8, 2022

Wait for #7892

@lichuang
Copy link
Contributor

lichuang commented Oct 8, 2022

Wait for #7892

#7892 will be processed after finishing share table impl

@lichuang
Copy link
Contributor

#7892 has been closed.

@Xuanwo Xuanwo reopened this Oct 19, 2022
@Xuanwo
Copy link
Member Author

Xuanwo commented Oct 19, 2022

There are other works to finish on copy side. Let's reopen it.

@BohuTANG BohuTANG assigned BohuTANG and unassigned Xuanwo Nov 4, 2022
@BohuTANG
Copy link
Member

BohuTANG commented Nov 4, 2022

I will take this, we need to consider more situations:
If COPY will be distributed, streaming reading files in StageTable:read_partitions is hard.
For now, it looks better to follow the current mechanism for copying. If a bucket has many files, now slow(list files) is by design.

cc @Xuanwo

@Xuanwo
Copy link
Member Author

Xuanwo commented Nov 4, 2022

streaming reading files in StageTable:read_partitions is hard.

Partitions scanning and data reading are different kinds of workloads that can be executed at the same time. Adding this feature will also improve the performance of FuseTable.

With this feature, we can do copying at the same time of listing:

Given a bucket with 1000_0000 (1kw) files, every listing request will return 200 objects, and the typical list latency is 100ms. We will need 1000_0000 / 200 / 0.1 = 1.3 hour to fully scan it.

@BohuTANG
Copy link
Member

BohuTANG commented Nov 4, 2022

New COPY design will be use StageTable as table engine and re-use the planner& pipeline distribution:
#8635

@BohuTANG BohuTANG closed this as completed Jun 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants