Full duplication got stuck in `DS_PREPARE` due to some error after sending request to create the follower table on target cluster #2146

empiredan · 2024-11-15T03:05:15Z

Firstly, I created a new duplication, and related logs of primary meta server of source cluster could be found as follows:

D2024-10-22 10:37:21.642 (1729564641642000641 0088)   meta.meta_state0.0102000500000411: meta_duplication_service.cpp:183:add_duplication(): add duplication for app(zgy_test_dup), remote cluster name is target_cluster, remote app name is wetry_db.zgy_test_dup, remote replica count is 3
D2024-10-22 10:37:21.644 (1729564641644158230 0088)   meta.meta_state0.01030000000000dd: meta_duplication_service.cpp:324:operator()(): [a27d1729564641] add duplication successfully [app_name: zgy_test_dup, remote_cluster_name: target_cluster, remote_app_name: wetry_db.zgy_test_dup]
D2024-10-22 10:37:21.644 (1729564641644182306 0088)   meta.meta_state0.01030000000000dd: duplication_info.cpp:172:persist_status(): [a27d1729564641] change duplication status from DS_INIT to DS_PREPARE successfully [app_id: 27]
W2024-10-22 10:37:21.827 (1729564641827424045 0088)   meta.meta_state0.01020003000003dd: meta_duplication_service.cpp:172:all_checkpoint_has_prepared(): replica checkpoint still running: 0/4
D2024-10-22 10:37:21.827 (1729564641827543128 0088)   meta.meta_state0.01020003000003dd: duplication_info.cpp:203:report_progress_if_time_up(): duplication report: {"create_ts":"2024-10-22 10:37:21","dupid":1729564641,"fail_mode":"FAIL_SLOW","progress":{"0":-1,"1":-1,"2":-1,"3":-1},"remote":"target_cluster","remote_app_name":"wetry_db.zgy_test_dup","remote_replica_count":3,"status":"DS_PREPARE"}

However, it seems that the duplication got stuck in DS_PREPARE and never forwarded to next status. To find the reason, some error logs primary meta server of source cluster could be found as follows:

E2024-10-22 10:38:06.829 (1729564686829455037 0088)   meta.meta_state0.01030000000000e6: meta_duplication_service.cpp:502:operator()(): create follower app[target_cluster.wetry_db.zgy_test_dup] to trigger duplicate checkpoint failed: duplication_status = DS_PREPARE, create_err = ERR_TIMEOUT, update_err = ERR_NO_NEED_OPERATE
......
E2024-10-22 10:38:11.833 (1729564691833345364 0088)   meta.meta_state0.01030000000000e8: meta_duplication_service.cpp:502:operator()(): create follower app[target_cluster.wetry_db.zgy_test_dup] to trigger duplicate checkpoint failed: duplication_status = DS_PREPARE, create_err = ERR_APP_EXIST, update_err = ERR_NO_NEED_OPERATE
......
E2024-10-22 13:23:02.220 (1729574582220478204 0088)   meta.meta_state0.01030000000008a2: meta_duplication_service.cpp:502:operator()(): create follower app[target_cluster.wetry_db.zgy_test_dup] to trigger duplicate checkpoint failed: duplication_status = DS_PREPARE, create_err = ERR_APP_EXIST, update_err = ERR_NO_NEED_OPERATE

Some error occurred while creating table on follower cluster, and the primary meta server of source cluster did not receive successful response. ERR_APP_EXIST means table has been created on the follower cluster, and the source cluster could not send the same request to create the table again.

On the follower cluster, the ID of the created table is 4672, and the primary meta server of target cluster sent the proposal to proposal to the primary replica of each partition of the table:

D2024-10-22 13:22:56.98 (1729574576098497750 2263d3)   meta.meta_state0.01010000000009b9: server_state.cpp:1381:send_proposal(): send proposal replication::config_type::CT_ASSIGN_PRIMARY for gpid(4672.0), ballot = 0, target = 10.2.131.174:8171, node = 10.2.131.174:8171
D2024-10-22 13:22:56.98 (1729574576098581429 2263d3)   meta.meta_state0.01010000000009b9: server_state.cpp:1381:send_proposal(): send proposal replication::config_type::CT_ASSIGN_PRIMARY for gpid(4672.1), ballot = 0, target = 10.2.131.173:8171, node = 10.2.131.173:8171
D2024-10-22 13:22:56.98 (1729574576098612769 2263d3)   meta.meta_state0.01010000000009b9: server_state.cpp:1381:send_proposal(): send proposal replication::config_type::CT_ASSIGN_PRIMARY for gpid(4672.2), ballot = 0, target = 10.2.131.172:8171, node = 10.2.131.172:8171
D2024-10-22 13:22:56.98 (1729574576098655070 2263d3)   meta.meta_state0.01010000000009b9: server_state.cpp:1381:send_proposal(): send proposal replication::config_type::CT_ASSIGN_PRIMARY for gpid(4672.3), ballot = 0, target = 10.2.131.170:8171, node = 10.2.131.170:8171

The replica server in charge of the primary replica also received the proposal. However, it failed to connect to the replica server of source cluster to get the checkpoint data(actually there is some security policy on the source cluster which would ignore the packaged from other clusters):

D2024-10-22 13:22:56.98 (1729574576098611769 2d5346) replica.replica9.020053150cc06bae: replica_stub.cpp:1028:on_config_proposal(): [email protected]:8171: received config proposal replication::config_type::CT_ASSIGN_PRIMARY for [10.2.131.173:8171](https://10.2.131.173:8171/)
D2024-10-22 13:22:56.99 (1729574576099024706 2d5371) replica.rep_long2.02040009060b292b: fs_manager.cpp:273:allocate_dir(): 10.2.131.173:8171: put pid(4672.1) to dir(data0), which has 0 replicas of current app, 10762 replicas totally
D2024-10-22 13:22:56.99 (1729574576099271963 2d5371) replica.rep_long2.02040009060b292b: replica_follower.cpp:83:duplicate_checkpoint(): [[email protected]:8171] start duplicate master[source_cluster.zgy_test_dup] checkpoint
D2024-10-22 13:22:56.99 (1729574576099314314 2d5323) replica.default6.020600020000babc: replica_follower.cpp:109:async_duplicate_checkpoint_from_master_replica(): [[email protected]:8171] query master[source_cluster.zgy_test_dup] replica configuration
......
E2024-10-22 13:23:01.99 (1729574581099441055 2d5324) replica.default7.020100060717fc1e: replica_follower.cpp:138:update_master_replica_config(): [[email protected]:8171] query master[source_cluster.zgy_test_dup] config failed: ERR_TIMEOUT
E2024-10-22 13:23:01.99 (1729574581099538649 2d5371) replica.rep_long2.02040009060b292b: replica_stub.cpp:2320:new_replica(): [email protected]:8171: try to duplicate replica checkpoint failed, error(ERR_TRY_AGAIN) and please check previous detail error log
D2024-10-22 13:23:01.99 (1729574581099608162 2d5371) replica.rep_long2.02040009060b292b: replica.cpp:546:close(): [email protected]:8171: replica closed, time_used = 0ms
D2024-10-22 13:23:01.99 (1729574581099614989 2d5371) replica.rep_long2.02040009060b292b: replica.cpp:546:close(): [email protected]:8171: replica closed, time_used = 0ms
D2024-10-22 13:23:01.99 (1729574581099665868 2d5371) replica.rep_long2.02040009060b292b: fs_manager.cpp:295:remove_replica(): 10.2.131.173:8171: remove gpid(4672.1) from dir(data0)
D2024-10-22 13:23:01.99 (1729574581099673613 2d5371) replica.rep_long2.02040009060b292b: replica_stub.cpp:2259:open_replica(): [email protected]:8171: open replica failed, erase from opening replicas

The text was updated successfully, but these errors were encountered:

empiredan added the type/bug This issue reports a bug. label Nov 15, 2024

empiredan mentioned this issue Nov 15, 2024

fix(duplication): dup would get stuck in DS_PREPARE indefinitely once some error occurred during creating follower table #2144

Merged

empiredan added the component/duplication cluster duplication label Nov 15, 2024

empiredan closed this as completed in #2144 Nov 15, 2024

empiredan closed this as completed in 6330427 Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full duplication got stuck in `DS_PREPARE` due to some error after sending request to create the follower table on target cluster #2146

Full duplication got stuck in `DS_PREPARE` due to some error after sending request to create the follower table on target cluster #2146

empiredan commented Nov 15, 2024

Full duplication got stuck in DS_PREPARE due to some error after sending request to create the follower table on target cluster #2146

Full duplication got stuck in DS_PREPARE due to some error after sending request to create the follower table on target cluster #2146

Comments

empiredan commented Nov 15, 2024

Full duplication got stuck in `DS_PREPARE` due to some error after sending request to create the follower table on target cluster #2146

Full duplication got stuck in `DS_PREPARE` due to some error after sending request to create the follower table on target cluster #2146