-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(cloud-merge) Support shadow tablet to do cumulative compaction in cloud mode #37293
Conversation
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
run buildall |
clang-tidy review says "All clean, LGTM! 👍" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
} | ||
|
||
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params) { | ||
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: function '_convert_historical_rowsets' exceeds recommended size/complexity thresholds [readability-function-size]
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params,
^
Additional context
be/src/cloud/cloud_schema_change_job.cpp:216: 172 lines including whitespace and comments (threshold 80)
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params,
^
run buildall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
} | ||
|
||
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params) { | ||
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: function '_convert_historical_rowsets' exceeds recommended size/complexity thresholds [readability-function-size]
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params,
^
Additional context
be/src/cloud/cloud_schema_change_job.cpp:214: 172 lines including whitespace and comments (threshold 80)
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params,
^
run buildall |
TPC-H: Total hot run time: 40097 ms
|
TPC-DS: Total hot run time: 173080 ms
|
ClickBench: Total hot run time: 30.24 s
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
} | ||
|
||
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params) { | ||
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: function '_convert_historical_rowsets' exceeds recommended size/complexity thresholds [readability-function-size]
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params,
^
Additional context
be/src/cloud/cloud_schema_change_job.cpp:224: 172 lines including whitespace and comments (threshold 80)
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params,
^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
} | ||
|
||
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params) { | ||
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: function '_convert_historical_rowsets' exceeds recommended size/complexity thresholds [readability-function-size]
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params,
^
Additional context
be/src/cloud/cloud_schema_change_job.cpp:222: 173 lines including whitespace and comments (threshold 80)
Status CloudSchemaChangeJob::_convert_historical_rowsets(const SchemaChangeParams& sc_params,
^
fe/fe-core/src/main/java/org/apache/doris/cloud/datasource/CloudInternalCatalog.java
Outdated
Show resolved
Hide resolved
fe/fe-core/src/main/java/org/apache/doris/cloud/datasource/CloudInternalCatalog.java
Outdated
Show resolved
Hide resolved
fe/fe-core/src/main/java/org/apache/doris/alter/CloudSchemaChangeJobV2.java
Show resolved
Hide resolved
LOG.warn("tryTimes:{}, onCancel exception:", tryTimes, e); | ||
} | ||
sleepSeveralSeconds(); | ||
tryTimes++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it tries to abort the same tablet job multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ms will check the message <origin_idx, shadow_idx, origin_tablet, shadow_tablet>. If it's not match, skip the request.
((CloudInternalCatalog) Env.getCurrentInternalCatalog()) | ||
.removeSchemaChangeJob(dbId, tableId, baseIndexId, partitionId, baseTabletId); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log here:
- which table/index has been processed (id name etc.)
- how many talbets have been processed here.
and, what if it tries to abort the same tablet job multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ms will check the message <origin_idx, rollup_idx, origin_tablet, rollup_tablet>. If it's not match, skip the request.
if (recorded_job.has_schema_change() && request->action() == FinishTabletJobRequest::COMMIT && | ||
!check_compaction_input_verions(compaction, recorded_job)) { | ||
SS << "Check compaction input versions failed in schema change. input_version_start=" | ||
<< compaction.input_versions(0) << " input_version_end=" << compaction.input_versions(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SS
typo? it should be lowercase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#define SS (ss << &FILE[get_file_name_offset(FILE)] << ":" << LINE << " ")
59894ff
to
130c04e
Compare
run buildall |
130c04e
to
65ae146
Compare
run buildall |
TPC-H: Total hot run time: 41877 ms
|
TPC-DS: Total hot run time: 170147 ms
|
ClickBench: Total hot run time: 30.76 s
|
run p0 |
PR approved by anyone and no changes requested. |
PR approved by at least one committer and no changes requested. |
…on in cloud mode (apache#37293)" This reverts commit b58b9e4.
…on in cloud mode (#37293)" (#38828) We have to figure out why it causes a SEGV when running cloud_p0 later ``` #0 doris::cloud::TabletCompactionJobPB::_internal_input_versions(int) const /root/doris/cloud/../gensrc/build/gen_cpp/cloud.pb.h:48193:33 #1 doris::cloud::MetaServiceImpl::start_tablet_job(google::protobuf::RpcController*, doris::cloud::StartTabletJobRequest const*, doris::cloud::StartTabletJobResponse*, google::protobuf::Closure*) /root/doris/cloud/src/meta-service/meta_service_job.cpp:436:9 #2 void doris::cloud::MetaServiceProxy::call_impl<doris::cloud::StartTabletJobRequest, doris::cloud::StartTabletJobResponse>(void (doris::cloud::MetaService::*)(google::protobuf::RpcController*, doris::cloud::StartTabletJobRequest const*, doris::cloud::StartTabletJobResponse*, google::protobuf::Closure*), google::protobuf::RpcController*, doris::cloud::StartTabletJobRequest const*, doris::cloud::StartTabletJobResponse*, google::protobuf::Closure*) /root/doris/cloud/src/meta-service/meta_service.h:684:13 #3 doris::cloud::MetaServiceProxy::start_tablet_job(google::protobuf::RpcController*, doris::cloud::StartTabletJobRequest const*, doris::cloud::StartTabletJobResponse*, google::protobuf::Closure*) /root/doris/cloud/src/meta-service/meta_service.h:478:9 #4 doris::cloud::MetaService::CallMethod(google::protobuf::MethodDescriptor const*, google::protobuf::RpcController*, google::protobuf::Message const*, google::protobuf::Message*, google::protobuf::Closure*) /root/doris/gensrc/build/gen_cpp/cloud.pb.cc:0:7 ``` This PR also add some FE log
…oud mode (apache#37293) In cloud mode, when do schema change, shadow tablet encounters -235 because it cant do cumulative compaction in the case of a large number of loads. And it will prevents the user from continuing to loads. Implementation details: 1. When start schema change, record the end convert rowset version `alter_version` into SchemaChangeJob. 2. For origin tablet, only can do base compaction in [0, `alter_version`] and do cumulative compaction in (`alter_version`, N]. can not do compaction across `alter_verison` such as compaction [a, `alter_version` + n]. 3. For shadow tablet, cannot do base compaction and and do cumulative compaction in (`alter_version`, N]. 4. When the schema change failed because FE or BE coredump, it will retry. When retry the schema change, it will get the `alter_version` from meta_serive, and continue to do it. 5. When finish the schema change job or cancel it, we need to clear the schema change job. Before this pr, it will cover by next schema change.
…oud mode (apache#37293) In cloud mode, when do schema change, shadow tablet encounters -235 because it cant do cumulative compaction in the case of a large number of loads. And it will prevents the user from continuing to loads. Implementation details: 1. When start schema change, record the end convert rowset version `alter_version` into SchemaChangeJob. 2. For origin tablet, only can do base compaction in [0, `alter_version`] and do cumulative compaction in (`alter_version`, N]. can not do compaction across `alter_verison` such as compaction [a, `alter_version` + n]. 3. For shadow tablet, cannot do base compaction and and do cumulative compaction in (`alter_version`, N]. 4. When the schema change failed because FE or BE coredump, it will retry. When retry the schema change, it will get the `alter_version` from meta_serive, and continue to do it. 5. When finish the schema change job or cancel it, we need to clear the schema change job. Before this pr, it will cover by next schema change.
…mpaction during schema change in cloud mode (#39558) ## Proposed changes In cloud mode, when do schema change, shadow tablet encounters -235 because it cant do cumulative compaction in the case of a large number of loads. And it will prevents the user from continuing to loads. Implementation details: 1. When start schema change, record the end convert rowset version `alter_version` into SchemaChangeJob. 2. For origin tablet, only can do base compaction in [0, `alter_version`] and do cumulative compaction in (`alter_version`, N]. can not do compaction across `alter_verison` such as compaction [a, `alter_version` + n]. 3. For shadow tablet, cannot do base compaction and and do cumulative compaction in (`alter_version`, N]. 4. When the schema change failed because FE or BE coredump, it will retry. When retry the schema change, it will get the `alter_version` from meta_serive, and continue to do it. 5. When finish the schema change job or cancel it, we need to clear the schema change job. Before this pr, it will cover by next schema change. co-author(main author): @Lchangliang original PR: #37293 --------- Co-authored-by: Lightman <[email protected]>
…mpaction during schema change in cloud mode (#39558) In cloud mode, when do schema change, shadow tablet encounters -235 because it cant do cumulative compaction in the case of a large number of loads. And it will prevents the user from continuing to loads. Implementation details: 1. When start schema change, record the end convert rowset version `alter_version` into SchemaChangeJob. 2. For origin tablet, only can do base compaction in [0, `alter_version`] and do cumulative compaction in (`alter_version`, N]. can not do compaction across `alter_verison` such as compaction [a, `alter_version` + n]. 3. For shadow tablet, cannot do base compaction and and do cumulative compaction in (`alter_version`, N]. 4. When the schema change failed because FE or BE coredump, it will retry. When retry the schema change, it will get the `alter_version` from meta_serive, and continue to do it. 5. When finish the schema change job or cancel it, we need to clear the schema change job. Before this pr, it will cover by next schema change. co-author(main author): @Lchangliang original PR: #37293 --------- Co-authored-by: Lightman <[email protected]>
In cloud mode, when do schema change, shadow tablet encounters -235 because it cant do cumulative compaction in the case of a large number of loads. And it will prevents the user from continuing to loads.
Implementation details:
alter_version
into SchemaChangeJob.alter_version
] and do cumulative compaction in (alter_version
, N]. can not do compaction acrossalter_verison
such as compaction [a,alter_version
+ n].alter_version
, N].alter_version
from meta_serive, and continue to do it.