Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set master scores also on the unavailable members #56

Closed
wants to merge 2 commits into from

Conversation

YanChii
Copy link
Contributor

@YanChii YanChii commented Dec 23, 2016

Hi.
When the master crashes and is kicked out of the cluster, his master score remains set on 1001. This can be dangerous if you shut down whole cluster and then bring online the old master first. It is most dangerous with the 2-node cluster.
My proposal is to set master scores to -1000 also to unavailable cluster members, because they obviously cannot become masters until resynced.

Simply said, I've replaced
crm_node --partition
by
crm_node --nodes
here https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L548

This was also mentioned in #26.

Jan

@ioguix
Copy link
Member

ioguix commented Dec 24, 2016

Hi @YanChii,

This already has been raised and discussed by @blogh in a previous pull request.

Unfortunately, we cannot use crm_node --list because its output is different depending on the ha stack used. I asked pacemaker's dev about this last month, there's no obvious solution yet. See: https://www.mail-archive.com/[email protected]/msg00341.html

I was contemplating the idea of using the notify variables to record the cluster nodes in a cluster attribute...i have to dig a bit in Pacemaker code first about how it finds node given in notify vars...no other idea if its a good idea or not yet.

@YanChii
Copy link
Contributor Author

YanChii commented Dec 25, 2016

Hi @ioguix,

thank you for you response. I see the problem now. After reading the thread with your question, I'm starting to think that the list of nodes as a cfg parameter will be the best option.

It will solve also another issue that I'm encountering - the observer nodes in the cluster. I'm building 3-node PAF clusters with the 3rd node as an observer. Observer is a light node without any datadir (and ideally without any postgres install) with only one purpose - to be a quorum tie breaker. Current PAF implementation expects every node to be a full database member and therefore it fails probes on this node and also spams logs with the warning that the node is not connected. So far I've patched it directly in the RA code (to ignore this node).

I was considering the implementation of observer_nodes parameter. But the parameter of database_nodes (or something like that) will solve both problems at once.

What do you think about that?

Jan

@YanChii
Copy link
Contributor Author

YanChii commented Dec 25, 2016

Another source of the node list (for both problems) can be the list of (positive) location constraints. But the way of getting this info can be equally problematic as the crm_node --list call. The first problem is how to get the name of the HA (master-slave) resource as it is not part of the RA environment AFAIK.

@ioguix
Copy link
Member

ioguix commented Dec 25, 2016

Hi @YanChii,

It will solve also another issue that I'm encountering - the observer nodes in the cluster. I'm building 3-node PAF clusters with the 3rd node as an observer. Observer is a light node without any datadir (and ideally without any postgres install) with only one purpose - to be a quorum tie breaker. Current PAF implementation expects every node to be a full database member and therefore it fails probes on this node and also spams logs with the warning that the node is not connected. So far I've patched it directly in the RA code (to ignore this node).

There's already two ways to solve your issue:

I was considering the implementation of observer_nodes parameter. But the parameter of database_nodes (or something like that) will solve both problems at once.

As described above, observer_node is useless for your second concern. However, parameter node_list is definitely the easiest solution for your first issue. But I still hope we can avoid to add a new parameter to solve this issue. I remember we were considering with @blogh to read the CIB itself to discover all the nodes as an alternative solution.

So we end up with three different solutions to explore:

  • new parameter node_list
  • discover all the nodes through the CIB
  • discover all the nodes thanks to notify vars (OCF_RESKEY_CRM_meta_notify_active_uname and OCF_RESKEY_CRM_meta_notify_inactive_uname)

@YanChii
Copy link
Contributor Author

YanChii commented Jan 5, 2017

Hi @ioguix,

thank you for the explanation. It is true that banning the resource from some node or disabling pacemaker will prevent the cluster from running probes. But it will not prevent the pgslms from flooding system logs with
WARNING: _check_locations: "node3" is not connected to the primary, set score to -1000.
It is because the command crm_node --partition will give you nodelist regardless of constraints.

Also, if you want to run two independent postgres HA resources using one pacemaker (e.g: 3 pacemaker nodes with one PG resource on nodes 1+3 and second PG resource on nodes 2+3), you will get the above "not connected" message even if everything is working correctly.

To be honest, I really hate the idea of node_list parameter. Unfortunately, it gives sense because alternatives are quite tricky and don't solve all problems. It can be impemented as the optional parameter for more compex setups. You can specify node_list or you can leave it to autodetect (some sort of crm_node or notify list).

What is your opinion on these things?

Jan

@blogh
Copy link
Collaborator

blogh commented Jan 5, 2017

Hi @YanChii and @ioguix ,

I didn't like the idea of a list_node parameter either but it's the simplest and safest way to implement it right now. And it's not like we will add/remove definitively a node every day (at least not in my use case).

I don't like the idea of an optional parameter because security of the data souldn't be an option.

The WARNING can be flood controlled by moving the log message in the node_score test at the end of _check_locations like so (at least it's what I did when I tested the list_node solution).

        $node_score = _get_master_score( $node );
        if $node_score ne '-1000 {
               ocf_log( 'warning', sprintf
                    '_check_locations: "%s" is not connected to the primary, set score to -1000',
                    $node );
                _set_master_score( '-1000', $node )';
        }

I am not sure I follow your exemple of two clusters on one pacemaker (although I never tried it, but will asap). I think that since the configuration would be per database we would have one list_node per database. no ?

Benoit

EDIT: changed "per cluster" with "per database". my sentence didn't make sense.

@YanChii
Copy link
Contributor Author

YanChii commented Jan 5, 2017

Hi @blogh,

two-DB cluster setup is used in some cases because you do not need two standby nodes (node3 is standby for both DBs). To explain more, the setup is like this:
node1(DB1 master)-------node3(DB1 slave, DB2 slave)-------node2(DB2 master)

Node_list parameter (as every parameter) is set per-HA-resource:
DB1 resource: node_list=node1,node3 port=5432 dbdir=...
DB2 resource: node_list=node2,node3 port=5433 dbdir=...
In this case, you have consistent view of all database nodes.

If node_list is implemented as mandatory, there will be no need to alter the WARNING.

Jan

@blogh
Copy link
Collaborator

blogh commented Jan 5, 2017

Ok, I understand your points, we are on the same page. my bad.

Benoit

@ioguix
Copy link
Member

ioguix commented Jan 16, 2017

Hi @YanChii and @blogh,

Trying to resume this discussion, it seems node_list is actually the best way to go, at least to be able to define the expected standby to check during _check_locations and _master_score_exists calls.

Specifically, auto-detecting nodes using notify vars is not enough to help PAF defining what nodes host the PostgreSQL instances taking part of the HA cluster. Defining what PostgreSQL instances are taking part in the cluster is needed in the scenario described by @YanChii above.

The only idea that comes in mind to answer this problem would be to compare the database system identifier that is supposed to be common to all instances in a cluster. Eg:

$ pg_controldata|grep "system id"
Database system identifier: 

So to compare both solutions:

  • node_list or standby_names:
    • pro: easy to implement, small code footprint
    • cons: adds one parameter to deal with
  • auto-detect nodes and standby using notify vars and db system id:
    • pro: fully transparent, auto-detecting new nodes
    • cons: need some more study, larger code footprint, harder to debug, how to detect administrative standby removal?

I still feel the second solution is more elegant, but it requires much more dev effort. I would be OK to go with the first solution, but I'm still not clear if it should be mandatory or not.

Thoughts?

@blogh
Copy link
Collaborator

blogh commented Jan 16, 2017

Hi guys !

Regarding solution (1). I feel like it should be mandatory, the score is not reliable as is (in this corner case). We have to fix it one way or another.

Regarding solution (2). Less configuration is always good :)
How do you see it implemented ? For each monitor:

  • The primary adds it own node name if it's not already in the list for the current database system id
  • For each standby connected to the primary we check if its node name exist in the node list for the current database system id
  • For each node of the list of the current database system id where we have no standby connection we set the score to -1000

If we remove a node from the cluster, do we leave it on the user to delete the list (and wait for one monitor to redetect all the nodes) ? It feels hacky :/

Benoit

[EDIT: moderated my not reliable comment]

@blogh
Copy link
Collaborator

blogh commented Jan 16, 2017

humm, it doesn't work if we have connections in pg_stat_replication for other reasons than the standby replication (pg_receive_xlog etc..).

We could do the registration:

  • on resource start (but I feel it's not enough)
  • for each monitor each node checks if it is in the list an adds itself if needed
  • for each standby connected to the primary database, where the node name is not in the list and the application-name is a valid node name.

It's late... beer time.

Benoit

@ioguix
Copy link
Member

ioguix commented Jan 17, 2017

Hi @blogh and @YanChii ,

The main problem with solution (2) is the administrative node removal. If it requires some admin commands to inform the cluster, it sounds easier to edit the node_list directly. So I think we should reject solution (2), at least for now.

Now, going back to the original subject of this issue, the very first problem is the master score not being removed (or set to -1000) when a master (or standby) crash. To be exhaustive we have another solution that I believe should be rejected. We discussed solution (3) yesterday night with @blogh. It appeared to me this master score problem could probably be handled during the post-stop notification from one existing node. This sounds quite appealing, but in fact, it would break the current Pacemaker transition because of the attribute update and we don't want that. Postponing this master score update to the next monitor is trickier than the node_list and achieve the same result.

So I guess we have a consensus here to say that we don't like the node_list parameter, but it is the simpler solution discussed for now and probably cleaner. For reference, here is @blogh's patch implementing a mandatory node_list parameter.

I will work with @blogh's patch to move forward with this.

Regards,

@blogh
Copy link
Collaborator

blogh commented Jan 18, 2017

Hi @ioguix and @YanChii ,

First of all, since I messed up my repo while trying to do a merge (I am very bad at git ;/). I did a reset and had to use a file I saved to redo the branch (hence the date of the commits).

I noticed two problems already:

  • If we remove the master from the list. The resource stays on the node. (we already talked about it)
  • If a node is on the list and pacemaker is stopped on the node. If you start the postgres manually, the score moves to 1000. If you stop it, it goes back to -1000. (I noticed it this morning while doing some failover tests)

Benoit.

@ioguix
Copy link
Member

ioguix commented Jan 19, 2017

I'm closing this PR as this solution will not be merged, but we can keep discussing here if needed.

@ioguix ioguix closed this Jan 19, 2017
@ioguix
Copy link
Member

ioguix commented Jan 19, 2017

I was playing with @blogh's patch, doing some tests. I was able to defeat the whole purpose of this new parameter during the second test: when removing a node from the node_list, the node is ignored during the monitor action of the current master ans its master score is not changed.

We would have to add some more code complexity to track what node was removed from node_list, etc, etc.

Ok, this is a corner case during an administrative task, we could document this, etc. But in fact, I feel like node_list doesn't solve the issue by itself anyway. You can have a "large" race condition window before next monitor action (which is supposed to set the master score to -1000) where anything can happen. This solution "might" save you from the scenario described in this PR or the one in #26...but it's not a silver bullet.

The only clean and safe solution would be to set the master_score right when the resource is gone, during the pre or post stop notification. But as explained it would break the Pacemaker transition, forbidding us to detect recovery or move transition.

Solution to your problems exists:

  • use quorum
  • use wait_for_all in corosync.conf
  • document document document: you should never start Pacemaker on a node that had some failure before checking everything on it, including its master score. This is a human error and I feel like we shouldn't bloat the code to avoid countless situations were human can fail.

The last simple solution that stroke me today was to use private node attribute (outside of the CIB) to set the master score. I will study this tomorrow.

Cheers,

@ioguix
Copy link
Member

ioguix commented Jan 19, 2017

...and we can not set a master score as a private attribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants