cluster re-sync after connectivity loss #3748

WorkingClass · 2021-09-12T18:07:15Z

WorkingClass
Sep 12, 2021

Hello,

We run a ClouchDB .3x (git rev-parse HEAD: e83935c) cluster with 3 nodes.

[cluster]
q=4
n=3
placement = z1:2,z2:1

z1 and z2 are in different DataCenter.

Cause of the issue:

[error] 2021-09-08T03:35:35.775157Z [email protected] <0.27127.173> -------- ** Node '[email protected]' not responding **
** Removing (timedout) connection **

[error] 2021-09-08T03:35:35.775288Z [email protected] <0.30996.2> -------- ** Node '[email protected]' not responding **
** Removing (timedout) connection **

[error] 2021-09-08T03:35:35.781578Z [email protected] emulator -------- Error in process <0.1593.0> on node '[email protected]' with exit value:
{{nocatch,{error,{nodedown,<<"progress not possible">>}}},[{fabric_view_changes,send_changes,6,[{file,"src/fabric_view_changes.erl"},{line,200}]},{fabric_view_changes,keep_sending_changes,8,[{file,"src/fabric_view_changes.erl"},{line,82}]},{fabric_view_changes,go,5,[{file,"src/fabric_view_changes.erl"},{line,43}]}]}

[warning] 2021-09-08T03:35:36.410891Z [email protected] <0.24863.708> -------- 4312 shards in cluster with only 1 copy on nodes that are currently up
[warning] 2021-09-08T03:35:36.411025Z [email protected] <0.24863.708> -------- 4312 shards in cluster with only 1 copy on nodes not in maintenance mode
[warning] 2021-09-08T03:35:36.411071Z [email protected] <0.24863.708> -------- 2 conflicted shards in cluster
[warning] 2021-09-08T03:35:36.981372Z [email protected] <0.469.710> -------- 4312 shards in cluster with only 1 copy on nodes that are currently up
[warning] 2021-09-08T03:35:36.981488Z [email protected] <0.469.710> -------- 4312 shards in cluster with only 1 copy on nodes not in maintenance mode
[warning] 2021-09-08T03:35:36.981549Z [email protected] <0.469.710> -------- 2 conflicted shards in cluster
[warning] 2021-09-08T03:54:36.904890Z [email protected] <0.15445.710> -------- 2 conflicted shards in cluster
[warning] 2021-09-08T03:54:36.904982Z [email protected] <0.15445.710> -------- 4312 shards in cluster with only 2 copies on nodes that are currently up
[warning] 2021-09-08T03:54:36.905051Z [email protected] <0.15445.710> -------- 4312 shards in cluster with only 2 copies on nodes not in maintenance mode
[error] 2021-09-08T03:54:37.861002Z [email protected] emulator -------- Error in process <0.815.707> on node '[email protected]' with exit value:
{{rexi_EXIT,{{badmatch,{'EXIT',noproc}},[{couch_file,pread_binary,2,[{file,"src/couch_file.erl"},{line,172}]},{couch_file,pread_term,2,[{file,"src/couch_file.erl"},{line,160}]},{couch_btree,get_node,2,[{file,"src/couch_btree.erl"},{line,435}]},{couch_btree,lookup,3,[{file,"src/couch_btree.erl"},{line,286}]},{couch_btree,lookup,2,[{file,"src/couch_btree.erl"},{line,276}]},{couch_bt_engine,open_local_docs,2,[{file,"src/couch_bt_engine.erl"},{line,407}]},{couch_db,open_doc_int,3,[{file,"src/couch_db.erl"},{line,1664}]},{couch_db,open_doc,3,[{file,"src/couch_db.erl"},{line,292}]}]}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,392}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[warning] 2021-09-08T03:54:37.861771Z [email protected] <0.336.0> -------- mem3_sync shards/40000000-7fffffff/account/1e/1f/3b973e1db60829651a007231a52f-202109.1630229760 [email protected] {{rexi_EXIT,{{badmatch,{'EXIT',noproc}},[{couch_file,pread_binary,2,[{file,[115,114,99,47,99,111,117,99,104,95,102,105,108,101,46,101,114,108]},{line,172}]},{couch_file,pread_term,2,[{file,[115,114,99,47,99,111,117,99,104,95,102,105,108,101,46,101,114,108]},{line,160}]},{couch_btree,get_node,2,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,114,101,101,46,101,114,108]},{line,435}]},{couch_btree,lookup,3,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,114,101,101,46,101,114,108]},{line,286}]},{couch_btree,lookup,2,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,114,101,101,46,101,114,108]},{line,276}]},{couch_bt_engine,open_local_docs,2,[{file,[115,114,99,47,99,111,117,99,104,95,98,116,95,101,110,103,105,110,101,46,101,114,108]},{line,407}]},{couch_db,open_doc_int,3,[{file,[115,114,99,47,99,111,117,99,104,95,100,98,46,101,114,108]},{line,1664}]},{couch_db,open_doc,3,[{file,[115,114,99,47,99,111,117,99,104,95,100,98,46,101,114,108]},{line,292}]}]}},[{mem3_rpc,rexi_call,3,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,392}]},{mem3_rep,calculate_start_seq,3,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{maps,map,2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,390}]},{mem3_rep,repl,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,292}]},{mem3_rep,go,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,[115,114,99,47,109,101,109,51,95,115,121,110,99,46,101,114,108]},{line,212}]}]}
[warning] 2021-09-08T03:54:38.219543Z [email protected] <0.25714.711> -------- 2 conflicted shards in cluster
[error] 2021-09-08T03:54:39.795469Z [email protected] emulator -------- Error in process <0.2767.711> on node '[email protected]' with exit value:
{function_clause,[{couch_db,incref,[undefined],[{file,"src/couch_db.erl"},{line,190}]},{couch_server,open_int,2,[{file,"src/couch_server.erl"},{line,106}]},{couch_server,open,2,[{file,"src/couch_server.erl"},{line,96}]},{couch_db,open,2,[{file,"src/couch_db.erl"},{line,163}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,107}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

The issue: If the connectivity between the DC in interrupted, one or two of the instances are running high CPU because it keeps running this, even if the connectivity is re-established:

[error] 2021-09-08T04:44:45.926690Z [email protected] emulator -------- Error in process <0.9532.1> on node '[email protected]' with exit value:
{{rexi_DOWN,{'[email protected]',noproc}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[error] 2021-09-08T04:44:45.926935Z [email protected] emulator -------- Error in process <0.9531.1> on node '[email protected]' with exit value:
{{rexi_DOWN,{'[email protected]',noproc}},[{mem3_rpc,rexi_call,3,[{file,"src/mem3_rpc.erl"},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,"src/mem3_rep.erl"},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,"maps.erl"},{line,232}]},{maps,map,2,[{file,"maps.erl"},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,"src/mem3_rep.erl"},{line,390}]},{mem3_rep,repl,1,[{file,"src/mem3_rep.erl"},{line,292}]},{mem3_rep,go,1,[{file,"src/mem3_rep.erl"},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,"src/mem3_sync.erl"},{line,212}]}]}

[warning] 2021-09-08T04:44:45.933007Z [email protected] <0.1364.0> -------- mem3_sync shards/80000000-bfffffff/account/34/05/a4821ee6f1ade789cb8a1b2fd89e.1620895874 [email protected] {{rexi_DOWN,{'[email protected]',noproc}},[{mem3_rpc,rexi_call,3,[{file,[115,114,99,47,109,101,109,51,95,114,112,99,46,101,114,108]},{line,394}]},{mem3_rep,calculate_start_seq,3,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,402}]},{maps,'-map/2-lc$^0/1-0-',2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{maps,map,2,[{file,[109,97,112,115,46,101,114,108]},{line,232}]},{mem3_rep,calculate_start_seq_multi,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,390}]},{mem3_rep,repl,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,292}]},{mem3_rep,go,1,[{file,[115,114,99,47,109,101,109,51,95,114,101,112,46,101,114,108]},{line,111}]},{mem3_sync,'-start_push_replication/1-fun-0-',2,[{file,[115,114,99,47,109,101,109,51,95,115,121,110,99,46,101,114,108]},{line,212}]}]}

The only way I found to correct the issue is to restart the affected node.

Is there some setting to allow the node to "re-sync" after connectivity loss?

Here is our replica & fabric config:

[fabric]
; all_docs_concurrency = 10
; changes_duration =
; shard_timeout_factor = 2
; uuid_prefix_len = 7
request_timeout = 600000
; all_docs_timeout = 10000
attachments_timeout = 600000
; view_timeout = 3600000
; partition_view_timeout = 3600000

[replicator]
; Random jitter applied on replication job startup (milliseconds)
startup_jitter = 5000
; Number of actively running replications
max_jobs = 500
;Scheduling interval in milliseconds. During each reschedule cycle
interval = 60000
; Maximum number of replications to start and stop during rescheduling.
max_churn = 20
; More worker processes can give higher network throughput but can also
; imply more disk and network IO.
worker_processes = 4
; With lower batch sizes checkpoints are done more frequently. Lower batch sizes
; also reduce the total amount of used RAM memory.
worker_batch_size = 500
; Maximum number of HTTP connections per replication.
http_connections = 20
; HTTP connection timeout per replication.
; Even for very fast/reliable networks it might need to be increased if a remote
; database is too busy.
connection_timeout = 300000
; Request timeout
request_timeout = 600000
; If a request fails, the replicator will retry it up to N times.
retries_per_request = 5
; Use checkpoints
use_checkpoints = true
; Checkpoint interval
checkpoint_interval = 30000
socket_options = [{keepalive, true}, {nodelay, false}]
; Set to true to validate peer certificates.
verify_ssl_certificates = false
; File containing a list of peer trusted certificates (in the PEM format).
;ssl_trusted_certificates_file = /etc/ssl/certs/ca-certificates.crt
; Maximum peer certificate depth (must be set even if certificate validation is off).
ssl_certificate_max_depth = 3
; Maximum document ID length for replication.
;max_document_id_length = infinity

Thanks,
Victor

Anubiso · 2022-10-20T09:25:21Z

Anubiso
Oct 20, 2022

Hey Victor,

we run into the same issue. Have you found any other solution other than restarting the node?

0 replies

WorkingClass · 2022-10-31T05:55:36Z

WorkingClass
Oct 31, 2022
Author

Hello Aljoscha, We upgrade to the latest 3.1.x and increase the bandwidth between our two DC. No magic. Sorry for the late reply. Thanks On Thursday, October 20, 2022 at 05:25:33 a.m. EDT, Aljoscha Weber ***@***.***> wrote: Hey Victor, we run into the same issue. Have you found any other solution other than restarting the node? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster re-sync after connectivity loss #3748

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

cluster re-sync after connectivity loss #3748

WorkingClass Sep 12, 2021

Replies: 2 comments

Anubiso Oct 20, 2022

WorkingClass Oct 31, 2022 Author

WorkingClass
Sep 12, 2021

Anubiso
Oct 20, 2022

WorkingClass
Oct 31, 2022
Author