Recovery from data centre failure in multi-dc not working #420

mikenorgate · 2020-04-24T13:18:46Z

I am doing some tests to see if antidote would be a suitable solution for us. I am running some tests with a setup based on this docker compose example https://github.com/AntidoteDB/docker-antidote/blob/master/compose-files/dc3n1/docker-compose.yml

I am testing the scenario where a single data centre becomes unavailable and I want to see what the recovery behaviour is. However I have found the upon restarting the failed DC it does not catch up with changes that it has missed.

Here is how I'm doing my test:

docker stop dc2n1
Increment a counter in dc1n1
docker start dc2n1

After restarting dc2n1 the following is output in the logs and the line Buffering txn in {{'antidote@dc1n1', is repeated forever

2020-04-24T13:05:00.728870+00:00 notice:
disk_log: repairing "/antidote-data/data/data_antidote/0--0.LOG" ...

2020-04-24T13:05:00.756855+00:00 notice:
disk_log: repairing "/antidote-data/data/data_antidote/730750818665451459101842416358141509827966271488--730750818665451459101842416358141509827966271488.LOG" ...

2020-04-24T13:05:01.790063+00:00 info:
Starting heartbeat sender timers
2020-04-24T13:05:01.791126+00:00 notice:
Forgetting DC {'antidote@dc3n1',
                  {1587,733399,89049}}
2020-04-24T13:05:01.791356+00:00 notice:
Forgetting DC {'antidote@dc1n1',
                  {1587,733359,875651}}
2020-04-24T13:05:01.791527+00:00 info:
Observing DC {'antidote@dc3n1',
                 {1587,733399,89049}}
2020-04-24T13:05:07.827192+00:00 info:
Observing DC {'antidote@dc1n1',
                 {1587,733359,875651}}
2020-04-24T13:05:10.434336+00:00 info:
Waiting for application antidote to start (0 seconds).
2020-04-24T13:05:10.434412+00:00 info:
Waiting for application antidote to start (0 seconds).
2020-04-24T13:05:10.434554+00:00 info:
Waiting for application antidote to start (0 seconds).
2020-04-24T13:05:10.434580+00:00 info:
Waiting for application antidote to start (0 seconds).
2020-04-24T13:05:10.434671+00:00 info:
Waiting for application antidote to start (0 seconds).
2020-04-24T13:05:10.434691+00:00 info:
Waiting for application antidote to start (0 seconds).
2020-04-24T13:05:13.522205+00:00 info:
Connected to DC {'antidote@dc3n1',
                    {1587,733399,89049}}
2020-04-24T13:05:13.553140+00:00 info:
Waiting for DC {'antidote@dc1n1',
                   {1587,733359,875651}}
2020-04-24T13:05:14.473092+00:00 info:
Whoops, lost message. New is 42, last was 27. Asking the remote DC {{'antidote@dc1n1',
                                                                     {1587,
                                                                      733359,
                                                                      875651}},
                                                                    730750818665451459101842416358141509827966271488}
2020-04-24T13:05:14.474678+00:00 critical:
Received unexpected log_reader_resp messages in state normal Message []. State {inter_dc_sub_buf,
                                                                                normal,
                                                                                {{'antidote@dc2n1',
                                                                                  {1587,
                                                                                   733385,
                                                                                   215012}},
                                                                                 730750818665451459101842416358141509827966271488},
                                                                                init,
                                                                                {[],
                                                                                 []},
                                                                                true}
2020-04-24T13:05:14.553899+00:00 info:
Waiting for DC {'antidote@dc1n1',
                   {1587,733359,875651}}
2020-04-24T13:05:15.474089+00:00 info:
Buffering txn in {{'antidote@dc1n1',
                      {1587,733359,875651}},
                  730750818665451459101842416358141509827966271488}
2020-04-24T13:05:15.554915+00:00 info:
Connected to DC {'antidote@dc1n1',
                    {1587,733359,875651}}
2020-04-24T13:05:15.615894+00:00 info:
Starting heartbeat sender timers
2020-04-24T13:05:15.616401+00:00 info:
    application: antidote
    started_at: 'antidote@dc2n1'
2020-04-24T13:05:15.686080+00:00 info:
Wait complete for application antidote (5 seconds)
2020-04-24T13:05:15.686282+00:00 info:
Wait complete for application antidote (5 seconds)
2020-04-24T13:05:15.686460+00:00 info:
Wait complete for application antidote (5 seconds)
2020-04-24T13:05:15.686635+00:00 info:
Wait complete for application antidote (5 seconds)
2020-04-24T13:05:15.686875+00:00 info:
Wait complete for application antidote (5 seconds)
2020-04-24T13:05:15.686993+00:00 info:
Wait complete for application antidote (5 seconds)
2020-04-24T13:05:16.474943+00:00 info:
Buffering txn in {{'antidote@dc1n1',
                      {1587,733359,875651}},
                  730750818665451459101842416358141509827966271488}
2020-04-24T13:05:17.476105+00:00 info:
Buffering txn in {{'antidote@dc1n1',
                      {1587,733359,875651}},
                  730750818665451459101842416358141509827966271488}
2020-04-24T13:05:18.476976+00:00 info:
Buffering txn in {{'antidote@dc1n1',
                      {1587,733359,875651}},
                  730750818665451459101842416358141509827966271488}
2020-04-24T13:05:19.477951+00:00 info:
Buffering txn in {{'antidote@dc1n1',
                      {1587,733359,875651}},
                  730750818665451459101842416358141509827966271488}
2020-04-24T13:05:20.479036+00:00 info:
Buffering txn in {{'antidote@dc1n1',
                      {1587,733359,875651}},
                  730750818665451459101842416358141509827966271488}

The text was updated successfully, but these errors were encountered:

peterzeller · 2020-04-24T16:08:21Z

@dajenet This might be related to what you are working on. Can you check if you can reproduce this problem and whether it is fixed on your branch?

dajenet · 2020-04-27T07:14:20Z

Looks like the problem I encountered. I will take a look at it.

albsch added the bug label Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery from data centre failure in multi-dc not working #420

Recovery from data centre failure in multi-dc not working #420

mikenorgate commented Apr 24, 2020

peterzeller commented Apr 24, 2020

dajenet commented Apr 27, 2020

Recovery from data centre failure in multi-dc not working #420

Recovery from data centre failure in multi-dc not working #420

Comments

mikenorgate commented Apr 24, 2020

peterzeller commented Apr 24, 2020

dajenet commented Apr 27, 2020