Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow executing COPY INTO in a cluster #6395

Closed
Tracked by #7823
flaneur2020 opened this issue Jul 1, 2022 · 9 comments
Closed
Tracked by #7823

allow executing COPY INTO in a cluster #6395

flaneur2020 opened this issue Jul 1, 2022 · 9 comments
Assignees
Labels
A-query Area: databend query

Comments

@flaneur2020
Copy link
Member

Summary

i'm running a COPY INTO in my cluster with a 8 replicas, but it seems only utilized one replica to execute the COPY INTO statement:

Screen Shot 2022-07-01 at 9 29 16 PM

it'd be 8x faster if the COPY INTO statement could utilize the other instances in the cluster.

@sundy-li
Copy link
Member

sundy-li commented Jul 1, 2022

Is it a single file ?

@PsiACE
Copy link
Member

PsiACE commented Jul 1, 2022

Is it a single file ?

One hundred files, obtained by cutting the ontime dataset.

@sundy-li
Copy link
Member

sundy-li commented Jul 1, 2022

Ok, seems parallel copy only works in single query mode.

https://github.com/datafuselabs/databend/blob/f152cbe7edc96fe5850982ef40d9c04c57ecc94e/query/src/interpreters/interpreter_copy.rs

     if ctx.get_settings().get_enable_new_processor_framework()? != 0
            && self.ctx.get_cluster().is_empty()
        {
            table.append2(ctx.clone(), &mut pipeline)?;
            pipeline.set_max_threads(settings.get_max_threads()? as usize);

            let async_runtime = ctx.get_storage_runtime();
            let query_need_abort = ctx.query_need_abort();
            let executor =
                PipelineCompleteExecutor::try_create(async_runtime, query_need_abort, pipeline)?;

            executor.execute()?;
            return Ok(ctx.consume_precommit_blocks());
        }

@sundy-li sundy-li self-assigned this Jul 1, 2022
@sundy-li sundy-li added the A-query Area: databend query label Jul 1, 2022
@zhang2014
Copy link
Member

Distributed copy into need #6253(exchange precommit block in cluster nodes).

@sundy-li
Copy link
Member

sundy-li commented Sep 5, 2022

@RinChanNOWWW you can try this issue, it's ready to do now.

@BohuTANG
Copy link
Member

BohuTANG commented Sep 6, 2022

I think we can make the <internal/external-stage, remote location> as a special storage engine, then we can get the file list as table source, and optimize the files to the distribution cluster. Also, this will be the basement for:
#7228 and #7211

I would ping @dantengsky, he is doing a similar storage engine (pre-sign), if some codes need refactoring, please let us know:)

@RinChanNOWWW
Copy link
Contributor

I think we can make the <internal/external-stage, remote location> as a special storage engine, then we can get the file list as table source

Then we can convert copy into t from @stage into insert into t select from @stage and achieve distributed copy into by #7501.

@BohuTANG
Copy link
Member

BohuTANG commented Sep 6, 2022

@RinChanNOWWW

Please take a look #7502
We are going to make the catalog to meet these requirements, work is in progress by @dantengsky
If you are interested, you can ping and talk with dantengsky :)

@PsiACE
Copy link
Member

PsiACE commented Jul 17, 2023

#11943

@PsiACE PsiACE closed this as completed Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-query Area: databend query
Projects
None yet
Development

No branches or pull requests

7 participants