Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full duplication got stuck in DS_PREPARE due to some error after sending request to create the follower table on target cluster #2146

Closed
empiredan opened this issue Nov 15, 2024 · 0 comments · Fixed by #2144
Labels
component/duplication cluster duplication type/bug This issue reports a bug.

Comments

@empiredan
Copy link
Contributor

Firstly, I created a new duplication, and related logs of primary meta server of source cluster could be found as follows:

D2024-10-22 10:37:21.642 (1729564641642000641 0088)   meta.meta_state0.0102000500000411: meta_duplication_service.cpp:183:add_duplication(): add duplication for app(zgy_test_dup), remote cluster name is target_cluster, remote app name is wetry_db.zgy_test_dup, remote replica count is 3
D2024-10-22 10:37:21.644 (1729564641644158230 0088)   meta.meta_state0.01030000000000dd: meta_duplication_service.cpp:324:operator()(): [a27d1729564641] add duplication successfully [app_name: zgy_test_dup, remote_cluster_name: target_cluster, remote_app_name: wetry_db.zgy_test_dup]
D2024-10-22 10:37:21.644 (1729564641644182306 0088)   meta.meta_state0.01030000000000dd: duplication_info.cpp:172:persist_status(): [a27d1729564641] change duplication status from DS_INIT to DS_PREPARE successfully [app_id: 27]
W2024-10-22 10:37:21.827 (1729564641827424045 0088)   meta.meta_state0.01020003000003dd: meta_duplication_service.cpp:172:all_checkpoint_has_prepared(): replica checkpoint still running: 0/4
D2024-10-22 10:37:21.827 (1729564641827543128 0088)   meta.meta_state0.01020003000003dd: duplication_info.cpp:203:report_progress_if_time_up(): duplication report: {"create_ts":"2024-10-22 10:37:21","dupid":1729564641,"fail_mode":"FAIL_SLOW","progress":{"0":-1,"1":-1,"2":-1,"3":-1},"remote":"target_cluster","remote_app_name":"wetry_db.zgy_test_dup","remote_replica_count":3,"status":"DS_PREPARE"}

However, it seems that the duplication got stuck in DS_PREPARE and never forwarded to next status. To find the reason, some error logs primary meta server of source cluster could be found as follows:

E2024-10-22 10:38:06.829 (1729564686829455037 0088)   meta.meta_state0.01030000000000e6: meta_duplication_service.cpp:502:operator()(): create follower app[target_cluster.wetry_db.zgy_test_dup] to trigger duplicate checkpoint failed: duplication_status = DS_PREPARE, create_err = ERR_TIMEOUT, update_err = ERR_NO_NEED_OPERATE
......
E2024-10-22 10:38:11.833 (1729564691833345364 0088)   meta.meta_state0.01030000000000e8: meta_duplication_service.cpp:502:operator()(): create follower app[target_cluster.wetry_db.zgy_test_dup] to trigger duplicate checkpoint failed: duplication_status = DS_PREPARE, create_err = ERR_APP_EXIST, update_err = ERR_NO_NEED_OPERATE
......
E2024-10-22 13:23:02.220 (1729574582220478204 0088)   meta.meta_state0.01030000000008a2: meta_duplication_service.cpp:502:operator()(): create follower app[target_cluster.wetry_db.zgy_test_dup] to trigger duplicate checkpoint failed: duplication_status = DS_PREPARE, create_err = ERR_APP_EXIST, update_err = ERR_NO_NEED_OPERATE

Some error occurred while creating table on follower cluster, and the primary meta server of source cluster did not receive successful response. ERR_APP_EXIST means table has been created on the follower cluster, and the source cluster could not send the same request to create the table again.

On the follower cluster, the ID of the created table is 4672, and the primary meta server of target cluster sent the proposal to proposal to the primary replica of each partition of the table:

D2024-10-22 13:22:56.98 (1729574576098497750 2263d3)   meta.meta_state0.01010000000009b9: server_state.cpp:1381:send_proposal(): send proposal replication::config_type::CT_ASSIGN_PRIMARY for gpid(4672.0), ballot = 0, target = 10.2.131.174:8171, node = 10.2.131.174:8171
D2024-10-22 13:22:56.98 (1729574576098581429 2263d3)   meta.meta_state0.01010000000009b9: server_state.cpp:1381:send_proposal(): send proposal replication::config_type::CT_ASSIGN_PRIMARY for gpid(4672.1), ballot = 0, target = 10.2.131.173:8171, node = 10.2.131.173:8171
D2024-10-22 13:22:56.98 (1729574576098612769 2263d3)   meta.meta_state0.01010000000009b9: server_state.cpp:1381:send_proposal(): send proposal replication::config_type::CT_ASSIGN_PRIMARY for gpid(4672.2), ballot = 0, target = 10.2.131.172:8171, node = 10.2.131.172:8171
D2024-10-22 13:22:56.98 (1729574576098655070 2263d3)   meta.meta_state0.01010000000009b9: server_state.cpp:1381:send_proposal(): send proposal replication::config_type::CT_ASSIGN_PRIMARY for gpid(4672.3), ballot = 0, target = 10.2.131.170:8171, node = 10.2.131.170:8171

The replica server in charge of the primary replica also received the proposal. However, it failed to connect to the replica server of source cluster to get the checkpoint data(actually there is some security policy on the source cluster which would ignore the packaged from other clusters):

D2024-10-22 13:22:56.98 (1729574576098611769 2d5346) replica.replica9.020053150cc06bae: replica_stub.cpp:1028:on_config_proposal(): [email protected]:8171: received config proposal replication::config_type::CT_ASSIGN_PRIMARY for [10.2.131.173:8171](https://10.2.131.173:8171/)
D2024-10-22 13:22:56.99 (1729574576099024706 2d5371) replica.rep_long2.02040009060b292b: fs_manager.cpp:273:allocate_dir(): 10.2.131.173:8171: put pid(4672.1) to dir(data0), which has 0 replicas of current app, 10762 replicas totally
D2024-10-22 13:22:56.99 (1729574576099271963 2d5371) replica.rep_long2.02040009060b292b: replica_follower.cpp:83:duplicate_checkpoint(): [[email protected]:8171] start duplicate master[source_cluster.zgy_test_dup] checkpoint
D2024-10-22 13:22:56.99 (1729574576099314314 2d5323) replica.default6.020600020000babc: replica_follower.cpp:109:async_duplicate_checkpoint_from_master_replica(): [[email protected]:8171] query master[source_cluster.zgy_test_dup] replica configuration
......
E2024-10-22 13:23:01.99 (1729574581099441055 2d5324) replica.default7.020100060717fc1e: replica_follower.cpp:138:update_master_replica_config(): [[email protected]:8171] query master[source_cluster.zgy_test_dup] config failed: ERR_TIMEOUT
E2024-10-22 13:23:01.99 (1729574581099538649 2d5371) replica.rep_long2.02040009060b292b: replica_stub.cpp:2320:new_replica(): [email protected]:8171: try to duplicate replica checkpoint failed, error(ERR_TRY_AGAIN) and please check previous detail error log
D2024-10-22 13:23:01.99 (1729574581099608162 2d5371) replica.rep_long2.02040009060b292b: replica.cpp:546:close(): [email protected]:8171: replica closed, time_used = 0ms
D2024-10-22 13:23:01.99 (1729574581099614989 2d5371) replica.rep_long2.02040009060b292b: replica.cpp:546:close(): [email protected]:8171: replica closed, time_used = 0ms
D2024-10-22 13:23:01.99 (1729574581099665868 2d5371) replica.rep_long2.02040009060b292b: fs_manager.cpp:295:remove_replica(): 10.2.131.173:8171: remove gpid(4672.1) from dir(data0)
D2024-10-22 13:23:01.99 (1729574581099673613 2d5371) replica.rep_long2.02040009060b292b: replica_stub.cpp:2259:open_replica(): [email protected]:8171: open replica failed, erase from opening replicas
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/duplication cluster duplication type/bug This issue reports a bug.
Projects
None yet
1 participant