cluster re-sync after connectivity loss #3748
Unanswered
WorkingClass
asked this question in
Q&A
Replies: 2 comments
-
Hey Victor, we run into the same issue. Have you found any other solution other than restarting the node? |
Beta Was this translation helpful? Give feedback.
0 replies
-
Hello Aljoscha,
We upgrade to the latest 3.1.x and increase the bandwidth between our two DC. No magic.
Sorry for the late reply.
Thanks
On Thursday, October 20, 2022 at 05:25:33 a.m. EDT, Aljoscha Weber ***@***.***> wrote:
Hey Victor,
we run into the same issue. Have you found any other solution other than restarting the node?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
We run a ClouchDB .3x (git rev-parse HEAD: e83935c) cluster with 3 nodes.
[cluster]
q=4
n=3
placement = z1:2,z2:1
z1 and z2 are in different DataCenter.
Cause of the issue:
[error] 2021-09-08T03:35:35.775157Z [email protected] <0.27127.173> -------- ** Node '[email protected]' not responding **
** Removing (timedout) connection **
[error] 2021-09-08T03:35:35.775288Z [email protected] <0.30996.2> -------- ** Node '[email protected]' not responding **
** Removing (timedout) connection **
[error] 2021-09-08T03:35:35.781578Z [email protected] emulator -------- Error in process <0.1593.0> on node '[email protected]' with exit value:
{{nocatch,{error,{nodedown,<<"progress not possible">>}}},[{fabric_view_changes,send_changes,6,[{file,"src/fabric_view_changes.erl"},{line,200}]},{fabric_view_changes,keep_sending_changes,8,[{file,"src/fabric_view_changes.erl"},{line,82}]},{fabric_view_changes,go,5,[{file,"src/fabric_view_changes.erl"},{line,43}]}]}
[warning] 2021-09-08T03:35:36.410891Z [email protected] <0.24863.708> -------- 4312 shards in cluster with only 1 copy on nodes that are currently up
[warning] 2021-09-08T03:35:36.411025Z [email protected] <0.24863.708> -------- 4312 shards in cluster with only 1 copy on nodes not in maintenance mode
[warning] 2021-09-08T03:35:36.411071Z [email protected] <0.24863.708> -------- 2 conflicted shards in cluster
[warning] 2021-09-08T03:35:36.981372Z [email protected] <0.469.710> -------- 4312 shards in cluster with only 1 copy on nodes that are currently up
[warning] 2021-09-08T03:35:36.981488Z [email protected] <0.469.710> -------- 4312 shards in cluster with only 1 copy on nodes not in maintenance mode
[warning] 2021-09-08T03:35:36.981549Z [email protected] <0.469.710> -------- 2 conflicted shards in cluster
[warning] 2021-09-08T03:54:36.904890Z [email protected] <0.15445.710> -------- 2 conflicted shards in cluster
[warning] 2021-09-08T03:54:36.904982Z [email protected] <0.15445.710> -------- 4312 shards in cluster with only 2 copies on nodes that are currently up
[warning] 2021-09-08T03:54:36.905051Z [email protected] <0.15445.710> -------- 4312 shards in cluster with only 2 copies on nodes not in maintenance mode
[error] 2021-09-08T03:54:37.861002Z [email protected] emulator -------- Error in process <0.815.707> on node '[email protected]' with exit value:
{{rexi_EXIT,{{badmatch,{'EXIT',noproc}},[{couch_file,pread_binary,2,[{file,"src/couch_file.erl"},{line,172}]},{couch_file,pread_term,2,[{file,"src/couch_file.erl"},{line,160}]},{couch_btree,get_node,2,[{file,"src/couch_btree.erl"},{line,435}]},{couch_btree,lookup,3,[{file,"src/couch_btree.erl"},{line,286}]},{couch_btree,lookup,2,[{file,"src/couch_btree.erl"},{line,276}]},{couch_bt_engine,open_local_docs,2,[{file,"src/couch_bt_engine.erl"},{line,407}]},{couch_db,open_doc_int,3,[{file,"src/couch_db.erl"},{line,1664}]},{couch_db,open_doc,3,[{file,"src/couch_db.erl"},{line,292}]}]}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,392}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}
[warning] 2021-09-08T03:54:37.861771Z [email protected] <0.336.0> -------- mem3_sync shards/40000000-7fffffff/account/1e/1f/3b973e1db60829651a007231a52f-202109.1630229760 [email protected] {{rexi_EXIT,{{badmatch,{'EXIT',noproc}},[{couch_file,pread_binary,2,[{file,[115,114,99,47,99,111,117,99,104,95,102,105,108,101,46,101,114,108]},{line,172}]},{couch_file,pread_term,2,[{file,[115,114,99,47,99,111,117,99,104,95,102,105,108,101,46,101,114,108]},{line,160}]},{couch_btree,get_node,2,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,114,101,101,46,101,114,108]},{line,435}]},{couch_btree,lookup,3,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,114,101,101,46,101,114,108]},{line,286}]},{couch_btree,lookup,2,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,114,101,101,46,101,114,108]},{line,276}]},{couch_bt_engine,open_local_docs,2,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,95,101,110,103,105,110,101,46,101,114,108]},{line,407}]},{couch_db,open_doc_int,3,[{file,[115,114,99,47,99,111,117,99,104,95,100,98,46,101,114,108]},{line,1664}]},{couch_db,open_doc,3,[{file,[115,114,99,47,99,111,117,99,104,95,100,98,46,101,114,108]},{line,292}]}]}},[{mem3_rpc,rexi_call,3,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,392}]},{mem3_rep,calculate_start_seq,3,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{maps,map,2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,390}]},{mem3_rep,repl,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,292}]},{mem3_rep,go,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,[115,114,99,47,109,101,109,51,95,115,121,110,99,46,101,114,108]},{line,212}]}]}
[warning] 2021-09-08T03:54:38.219543Z [email protected] <0.25714.711> -------- 2 conflicted shards in cluster
[error] 2021-09-08T03:54:39.795469Z [email protected] emulator -------- Error in process <0.2767.711> on node '[email protected]' with exit value:
{function_clause,[{couch_db,incref,[undefined],[{file,"src/couch_db.erl"},{line,190}]},{couch_server,open_int,2,[{file,"src/couch_server.erl"},{line,106}]},{couch_server,open,2,[{file,"src/couch_server.erl"},{line,96}]},{couch_db,open,2,[{file,"src/couch_db.erl"},{line,163}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,107}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}
The issue: If the connectivity between the DC in interrupted, one or two of the instances are running high CPU because it keeps running this, even if the connectivity is re-established:
[error] 2021-09-08T04:44:45.926690Z [email protected] emulator -------- Error in process <0.9532.1> on node '[email protected]' with exit value:
{{rexi_DOWN,{'[email protected]',noproc}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}
[error] 2021-09-08T04:44:45.926935Z [email protected] emulator -------- Error in process <0.9531.1> on node '[email protected]' with exit value:
{{rexi_DOWN,{'[email protected]',noproc}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}
[warning] 2021-09-08T04:44:45.933007Z [email protected] <0.1364.0> -------- mem3_sync shards/80000000-bfffffff/account/34/05/a4821ee6f1ade789cb8a1b2fd89e.1620895874 [email protected] {{rexi_DOWN,{'[email protected]',noproc}},[{mem3_rpc,rexi_call,3,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{maps,map,2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,390}]},{mem3_rep,repl,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,292}]},{mem3_rep,go,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,[115,114,99,47,109,101,109,51,95,115,121,110,99,46,101,114,108]},{line,212}]}]}
The only way I found to correct the issue is to restart the affected node.
Is there some setting to allow the node to "re-sync" after connectivity loss?
Here is our replica & fabric config:
[fabric]
; all_docs_concurrency = 10
; changes_duration =
; shard_timeout_factor = 2
; uuid_prefix_len = 7
request_timeout = 600000
; all_docs_timeout = 10000
attachments_timeout = 600000
; view_timeout = 3600000
; partition_view_timeout = 3600000
[replicator]
; Random jitter applied on replication job startup (milliseconds)
startup_jitter = 5000
; Number of actively running replications
max_jobs = 500
;Scheduling interval in milliseconds. During each reschedule cycle
interval = 60000
; Maximum number of replications to start and stop during rescheduling.
max_churn = 20
; More worker processes can give higher network throughput but can also
; imply more disk and network IO.
worker_processes = 4
; With lower batch sizes checkpoints are done more frequently. Lower batch sizes
; also reduce the total amount of used RAM memory.
worker_batch_size = 500
; Maximum number of HTTP connections per replication.
http_connections = 20
; HTTP connection timeout per replication.
; Even for very fast/reliable networks it might need to be increased if a remote
; database is too busy.
connection_timeout = 300000
; Request timeout
request_timeout = 600000
; If a request fails, the replicator will retry it up to N times.
retries_per_request = 5
; Use checkpoints
use_checkpoints = true
; Checkpoint interval
checkpoint_interval = 30000
socket_options = [{keepalive, true}, {nodelay, false}]
; Set to true to validate peer certificates.
verify_ssl_certificates = false
; File containing a list of peer trusted certificates (in the PEM format).
;ssl_trusted_certificates_file = /etc/ssl/certs/ca-certificates.crt
; Maximum peer certificate depth (must be set even if certificate validation is off).
ssl_certificate_max_depth = 3
; Maximum document ID length for replication.
;max_document_id_length = infinity
Thanks,
Victor
Beta Was this translation helpful? Give feedback.
All reactions