Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix when we switch synchronous_standby_names to '*'. #488

Merged
merged 2 commits into from
Nov 2, 2020
Merged

Conversation

DimCitus
Copy link
Collaborator

Because we assign both SECONDARY and PRIMARY in the same monitor
ProceedGroupState call, we might end up with the primary fetching its new
synchronous_standby_names from the monitor before when the secondary has
made it to its new state.

With the previous coding, the primary would then retrieve '', meaning sync
replication is disabled, when what's needed is '*'.

We add some needed test coverage from the situation and fix it in the
monitor by returning '*' as soon as the other node is assigned SECONDARY
rather than only after it has reported the state.

@DimCitus DimCitus added the bug Something isn't working label Oct 30, 2020
@DimCitus DimCitus added this to the pg_auto_failover 1.4.1 milestone Oct 30, 2020
@DimCitus DimCitus requested a review from JelteF October 30, 2020 11:10
@DimCitus DimCitus self-assigned this Oct 30, 2020
tests/test_auth.py Show resolved Hide resolved
Because we assign both SECONDARY and PRIMARY in the same monitor
ProceedGroupState call, we might end up with the primary fetching its new
synchronous_standby_names from the monitor before when the secondary has
made it to its new state.

With the previous coding, the primary would then retrieve '', meaning sync
replication is disabled, when what's needed is '*'.

We add some needed test coverage from the situation and fix it in the
monitor by returning '*' as soon as the other node is assigned SECONDARY
rather than only after it has reported the state.
@jchampio
Copy link
Contributor

jchampio commented Oct 30, 2020

With this PR branch, I still get intermittent failures as in #481:

======================================================================
FAIL: test_multi_standbys.test_014_update_standby_names_when_adding_and_removing_a_standby
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/src/pg_auto_failover/tests/test_multi_standbys.py", line 245, in test_014_update_standby_names_when_adding_and_removing_a_standby
    eq_(node1.get_synchronous_standby_names(),
AssertionError: '' != '*'

I'm not sure that the work here is related, because test_014 is missing calls to wait_until_state().

Copy link
Contributor

@onderkalaci onderkalaci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, and I verified that the race condition is gone. (And, wow it was almost always setting the sync standbys to `` before the patch)

With this PR branch, I still get intermittent failures as in #481:

As we talked in a private chat, we realized that we might need to find the reason for #481 separately as that's multiple standbys are not going through this code-path

src/monitor/node_active_protocol.c Show resolved Hide resolved
@DimCitus DimCitus requested review from JelteF and removed request for JelteF November 2, 2020 14:19
Copy link
Contributor

@JelteF JelteF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed through chat, some test ordering should still be changed. Other than that this looks good to me.

@DimCitus DimCitus merged commit f593086 into master Nov 2, 2020
@DimCitus DimCitus deleted the fix/sync-rep branch November 2, 2020 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants