Streaming COPY #7889

Xuanwo · 2022-09-26T05:07:17Z

Background

Databend will list all files before copying data:

https://github.com/datafuselabs/databend/blob/46da962d5637b35555e8f3f49885d1b58211ab41/src/query/service/src/interpreters/interpreter_copy_v2.rs#L399-L460

This will lead to:

bug: COPY INTO command cannot import data from s3 with massive files or directories #7862

We need to introduce streaming copy which will copy simultaneously during the listing.

Problems

GetTableCopiedFileReply will return all files
UpsertTableCopiedFileReq requires update all copied files at once

We need to change them to streaming-friendly. For example, only check a batch of files.

Tasks

bug: optimize get/upsert copied file info #7892
Copy file streamingly

The text was updated successfully, but these errors were encountered:

Xuanwo · 2022-09-26T05:07:32Z

Cc @lichuang

BohuTANG · 2022-09-26T05:18:54Z

Before this work, I think we should finish CopyInterpreterV2.list_files can be same with InterpreterCommon.list_files

ClSlaid · 2022-10-08T07:18:22Z

further update on this?

Xuanwo · 2022-10-08T07:23:31Z

Wait for #7892

lichuang · 2022-10-08T07:31:13Z

Wait for #7892

#7892 will be processed after finishing share table impl

lichuang · 2022-10-19T06:29:06Z

#7892 has been closed.

Xuanwo · 2022-10-19T06:59:56Z

There are other works to finish on copy side. Let's reopen it.

BohuTANG · 2022-11-04T04:40:36Z

I will take this, we need to consider more situations:
If COPY will be distributed, streaming reading files in StageTable:read_partitions is hard.
For now, it looks better to follow the current mechanism for copying. If a bucket has many files, now slow(list files) is by design.

cc @Xuanwo

Xuanwo · 2022-11-04T04:59:24Z

streaming reading files in StageTable:read_partitions is hard.

Partitions scanning and data reading are different kinds of workloads that can be executed at the same time. Adding this feature will also improve the performance of FuseTable.

With this feature, we can do copying at the same time of listing:

Given a bucket with 1000_0000 (1kw) files, every listing request will return 200 objects, and the typical list latency is 100ms. We will need 1000_0000 / 200 / 0.1 = 1.3 hour to fully scan it.

BohuTANG · 2022-11-04T05:03:03Z

New COPY design will be use StageTable as table engine and re-use the planner& pipeline distribution:
#8635

Xuanwo self-assigned this Sep 26, 2022

Xuanwo added this to Xuanwo's Work Sep 26, 2022

Xuanwo moved this to 📋 Backlog in Xuanwo's Work Sep 26, 2022

Xuanwo mentioned this issue Sep 26, 2022

feat(parquet): read in parallel. #7903

Merged

BohuTANG mentioned this issue Sep 27, 2022

Tracking: Large dataset insert and read #7823

Closed

50 tasks

This was referenced Oct 11, 2022

tracing file Formats #7732

Open

combine streaming and distributed in copy into #8128

Closed

BohuTANG closed this as completed Oct 19, 2022

Xuanwo reopened this Oct 19, 2022

BohuTANG assigned BohuTANG and unassigned Xuanwo Nov 4, 2022

BohuTANG closed this as completed Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming COPY #7889

Streaming COPY #7889

Xuanwo commented Sep 26, 2022 •

edited

Loading

Xuanwo commented Sep 26, 2022

BohuTANG commented Sep 26, 2022 •

edited

Loading

ClSlaid commented Oct 8, 2022

Xuanwo commented Oct 8, 2022

lichuang commented Oct 8, 2022

lichuang commented Oct 19, 2022

Xuanwo commented Oct 19, 2022

BohuTANG commented Nov 4, 2022

Xuanwo commented Nov 4, 2022 •

edited

Loading

BohuTANG commented Nov 4, 2022

Streaming COPY #7889

Streaming COPY #7889

Comments

Xuanwo commented Sep 26, 2022 • edited Loading

Background

Problems

Tasks

Xuanwo commented Sep 26, 2022

BohuTANG commented Sep 26, 2022 • edited Loading

ClSlaid commented Oct 8, 2022

Xuanwo commented Oct 8, 2022

lichuang commented Oct 8, 2022

lichuang commented Oct 19, 2022

Xuanwo commented Oct 19, 2022

BohuTANG commented Nov 4, 2022

Xuanwo commented Nov 4, 2022 • edited Loading

BohuTANG commented Nov 4, 2022

Xuanwo commented Sep 26, 2022 •

edited

Loading

BohuTANG commented Sep 26, 2022 •

edited

Loading

Xuanwo commented Nov 4, 2022 •

edited

Loading