Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On a fresh Riak 2.2 cluster - AAE won't work for most of the first day #1656

Open
martinsumner opened this issue Aug 9, 2017 · 2 comments
Open

Comments

@martinsumner
Copy link
Contributor

martinsumner commented Aug 9, 2017

I’ve been recently doing some testing of Riak 2.2.3 with active anti-entropy, so I'm documenting here the issue I've found in case others stumble into this same situation.

The background to this is in: #1473

If a new cluster is started (for example we will assume a 5-node cluster with a ring-size of 64), with all nodes running Riak 2.2.3 and active anti-entropy, all nodes will start using legacy AAE initially. I'd originally assumed that a fresh 2.2.3 cluster would start using the new AAE hash algorithm, but thinking further the implementation makes sense as-is, as at the point the node starts (and may start receiving traffic) it doesn’t know that all nodes with which it is to be joined will yet be running non-legacy AAE. So by defaulting to legacy AAE only the forward transition needs to be considered.

However, this means that even a new cluster, will have to go through an AAE hashtree upgrade.

Once a node has stable membership of a cluster, and the entropy manager can safely ascertain that all cluster nodes support new-wave AAE, a switch is flipped to make the AAE trees on this node pending an upgrade to v0 AAE object hashes (i.e.hashes of the vector clock, not the whole non-canonicalised object). This is all fine.

The trees will all now require an upgrade. At this stage, if this is a clean cluster started, the riak_kv_entropy_manger will have a tree_queue (a queue of trees to be poked) with all the hash trees in that queue. Next tick, up to ten of them will be poked - and they will all attempt to upgrade. However, there is concurrency management in the entropy_manager which by default allocates a single token for every token_period (default 1 hour), so only one of those pokes will result in an upgrade that hour.

The upgrade is not an upgrade, it just clears out the old tree and starts a new tree using v0 hashes, accepting new inserts but with a built state of false. So the one hashtree will be upgraded, the other 11 partitions (assuming there are 12 primary vnodes active on the node) will requeue a poke back for themselves on the tree_queue. Nothing can happen though regards to upgrading the other hash trees until the token_period elapses. The upgraded hashtree, is now rendered unusable, due to its built status being false - it cannot be used, as it does not have the full history of changes for that partition. So any exchange depending on that partition will stop being started - and there will be no AAE for those partition pairs.

When the token_period elapses, the 11 partitions will still be on the tree_queue (as they always requeue themselves when they can’t get a token for the upgrade) - and another hash tree will win the next race, upgrade (i.e. change to new version and revert to built=false). The other ten trees will be re-poked back onto the tree_queue to wait again for the (default 1h) token_period to elapse.

The issue at this stage, is that each node one the cluster will have two trees which are not built, and there is now a combinatorial affect of this whereby the number of exchanges which can actually complete begin to collapse. Not only can these trees not exchange, and other tree which wishes to exchange with this tree cannot exchange. So scope of AAE begins to shrink dramatically. What has been observed is that by the third token_period (e.g. after 3 hours of running), no exchanges appear to be taking place any more (i.e. riak-admin aae-status show ever increasing values in the “Last (ago)” column).

The issue now, is that at each token_period another AAE tree is cleared for upgrade, and it is only once all the trees have been cleared for upgrade that the tree_queue is empty, is refreshed with all trees again, and now the trees can begin to request a token in order to build.

Once each node has a small number of trees which have been rebuilt, some exchanges will restart. However, if there are P primary partitions per node, AAE will not work for a period of about P * token_period, and will not be fully working for a period of 2 * P * token_period.

With the default settings, this means that if you start a new cluster with AAE enabled, AAE will not be operational for most of the first day. Which would perhaps be unexpected, and if nothing else causes difficulty when trying to run tests.

This is made worse by the fact that it is common practice to extend the token_period due to issues with the cost of AAE tree rebuilds (e.g. to 1d). In this case AAE becomes unusable for days.

There are some other nightmare scenarios. For example if one were to reduce the tree.expiry time for trees, you can actually cause a situation where AAE will never work, as trees will keep expiring/rebuilding in their unusable state (e.g. with conflicting versions), as rebuild decisions are made first before upgrade decisions. In mitigation, reducing the tree.expiry time to a less than P * token_period is almost certainly never a sensible configuration anyway.

I have a branch I’m testing which to try and improve this situation. The key changes are:

  • As the upgrade doesn’t do any significant work (it merely clears the old tree), the entropy_manager no longer requires a token for upgrade work. To make sure that this small amount of work isn’t an issue the number of pokes per tick are also reduced from 10 to 2. This means that the wait of P * token_period for all upgradable trees to be cleared/upgraded is reduced to P/2 * tick_period (which will normally be 1 -2 minutes). There is still a wait of P * token_period for AAE to become fully operational, but it will be partially operational once initial builds are completed.

  • The check to see if an upgrade is necessary, made by riak_kv_index_hashtree following a poke, is moved to happen before the check to see if a rebuild is required. This stops the situation which has been seen in tests when a tree spends a token to rebuild using a legacy hash, only to use a token in the next period to wipe out what it has just built to be ready for the upgrade.

IIRC, testing the develop branch doesn’t have problems to this extent, perhaps as concurrency management is better in riak_kv_sweeper. However riak_kv_sweeper has an uncertain status with respect to production-readiness.

@martinsumner
Copy link
Contributor Author

This branch has the changes I've been testing:

https://github.com/martinsumner/riak_kv/tree/mas-2.1.7-aaeupdate

It is a branch of mas-217-baseline which is taken from the 2.1.7 tag of Basho riak_kv

@martinsumner
Copy link
Contributor Author

This never got resolved. However, I think the issue of why we were sometimes getting legacy not v0 was misunderstood.

A customer reported this issue. they were regularly doing mass restarts of nodes, and finding that those mass restarts were triggering AAE tree rebuilds - that appear to be related to object_hash_version changes. Excerpt from discussion of the problem:

A little bit about how capabilities work:

When a node starts, it runs through a sequence of process starts.  Some of those processes will register a capability with riak_core_capability.  Every time a capability is registered riak_core_capability renegotiates with the rest of the cluster to determine what that nodes perspective is of the overall cluster capability (note that the renegotiation is first done by reading the latest riak_core_ring metadata where capabilities are initially registered which prompts a re-gossip of the ring).  In the future another process may issue a `get` to riak_core_capability to view what the cluster capability is, and it is returned either the last negotiated capability, or the default value if there is no such capability negotiated.

The problem we appear to have is that when we do the mass restart, the kv_index_hashtree processes on some nodes are getting the default capability (legacy), not the expected negotiated capability (v0).

There are three ways in which this might occur (that I'm aware of):

1. The request to get the capability arrived at riak_core_capability before the request to register the capability.  The riak_core_capability process will not negotiate a capability that does not exist locally (even if all the other nodes have advertised it.

2. One node is not registering support for 0, only legacy (which is what might happen during a database software migration).

3. At the last point of renegotiation, another node in the cluster had registered some capabilities, but not this capability.  If the node was down, or had not registered any capabilities ... it is ignored.  However, it is up and registered some capabilities (but not the {riak_kv, object_hash_version} capability), then during the negotiation there will be an RPC call to the node to ask directly what its {riak_kv, object_hash_version} environment variable is - if the RPC fails, the node will be ignored, but if it succeeds and doesn't return v0 then legacy will be negotiated.

The scenario 1 I can recreate in an artificial way.  On my test nodes there is a 600 ms window between the node starting, and the node registering the {riak_kv, object_hash_version} capability, when it will return the default (legacy) to a local capability get request.  However, I'm pretty sure that the ordering of the startup processes will prevent this from happening - the kv_index_hashtree process which do the get cannot be started until the riak_kv_entropy_manager (which does the register in its initialisation function) has completed startup.  This is good news if true, as I don't think there is anyway of fixing this through configuration if it is occurring (i.e. the proposed changes won't make any difference in this scenario).

I don't believe scenario 2 is happening, everything is at least 2.2.5 and so should never report legacy.

I think scenario 3 might be the problem.  If node X starts whilst any other node is in the 600ms window between startup and registering the {riak_kv, object_hash_version} capability, then the fallback to rpc call will be made.  As the node is up the rpc call will succeed - but nothing in startup actually sets the {riak_kv, object_hash_version} as an environment variable - so the RPC call will result in the default (legacy) answer.

The workaround proposed was to add these settings to advanced.config:

[
  {riak_core,
    [
      ...
    ]
  },

  {riak_repl,
    [
      ....
    ]
  },

  {riak_kv,
    [
      ...
      {override_capability, [{object_hash_version, [{use, 0}]}]},
      {object_hash_version, 0}
    ]
  }
].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant