-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On a fresh Riak 2.2 cluster - AAE won't work for most of the first day #1656
Comments
This branch has the changes I've been testing: https://github.com/martinsumner/riak_kv/tree/mas-2.1.7-aaeupdate It is a branch of mas-217-baseline which is taken from the 2.1.7 tag of Basho riak_kv |
This never got resolved. However, I think the issue of why we were sometimes getting legacy not v0 was misunderstood. A customer reported this issue. they were regularly doing mass restarts of nodes, and finding that those mass restarts were triggering AAE tree rebuilds - that appear to be related to object_hash_version changes. Excerpt from discussion of the problem:
The workaround proposed was to add these settings to advanced.config:
|
I’ve been recently doing some testing of Riak 2.2.3 with active anti-entropy, so I'm documenting here the issue I've found in case others stumble into this same situation.
The background to this is in: #1473
If a new cluster is started (for example we will assume a 5-node cluster with a ring-size of 64), with all nodes running Riak 2.2.3 and active anti-entropy, all nodes will start using legacy AAE initially. I'd originally assumed that a fresh 2.2.3 cluster would start using the new AAE hash algorithm, but thinking further the implementation makes sense as-is, as at the point the node starts (and may start receiving traffic) it doesn’t know that all nodes with which it is to be joined will yet be running non-legacy AAE. So by defaulting to legacy AAE only the forward transition needs to be considered.
However, this means that even a new cluster, will have to go through an AAE hashtree upgrade.
Once a node has stable membership of a cluster, and the entropy manager can safely ascertain that all cluster nodes support new-wave AAE, a switch is flipped to make the AAE trees on this node pending an upgrade to v0 AAE object hashes (i.e.hashes of the vector clock, not the whole non-canonicalised object). This is all fine.
The trees will all now require an upgrade. At this stage, if this is a clean cluster started, the
riak_kv_entropy_manger
will have atree_queue
(a queue of trees to be poked) with all the hash trees in that queue. Next tick, up to ten of them will be poked - and they will all attempt to upgrade. However, there is concurrency management in the entropy_manager which by default allocates a single token for everytoken_period
(default 1 hour), so only one of those pokes will result in an upgrade that hour.The upgrade is not an upgrade, it just clears out the old tree and starts a new tree using v0 hashes, accepting new inserts but with a
built
state offalse
. So the one hashtree will be upgraded, the other 11 partitions (assuming there are 12 primary vnodes active on the node) will requeue a poke back for themselves on thetree_queue
. Nothing can happen though regards to upgrading the other hash trees until thetoken_period
elapses. The upgraded hashtree, is now rendered unusable, due to its built status being false - it cannot be used, as it does not have the full history of changes for that partition. So any exchange depending on that partition will stop being started - and there will be no AAE for those partition pairs.When the
token_period
elapses, the 11 partitions will still be on thetree_queue
(as they always requeue themselves when they can’t get a token for the upgrade) - and another hash tree will win the next race, upgrade (i.e. change to new version and revert tobuilt=false
). The other ten trees will be re-poked back onto thetree_queue
to wait again for the (default 1h)token_period
to elapse.The issue at this stage, is that each node one the cluster will have two trees which are not built, and there is now a combinatorial affect of this whereby the number of exchanges which can actually complete begin to collapse. Not only can these trees not exchange, and other tree which wishes to exchange with this tree cannot exchange. So scope of AAE begins to shrink dramatically. What has been observed is that by the third
token_period
(e.g. after 3 hours of running), no exchanges appear to be taking place any more (i.e. riak-admin aae-status show ever increasing values in the “Last (ago)” column).The issue now, is that at each
token_period
another AAE tree is cleared for upgrade, and it is only once all the trees have been cleared for upgrade that thetree_queue
is empty, is refreshed with all trees again, and now the trees can begin to request a token in order to build.Once each node has a small number of trees which have been rebuilt, some exchanges will restart. However, if there are P primary partitions per node, AAE will not work for a period of about P *
token_period
, and will not be fully working for a period of 2 * P *token_period
.With the default settings, this means that if you start a new cluster with AAE enabled, AAE will not be operational for most of the first day. Which would perhaps be unexpected, and if nothing else causes difficulty when trying to run tests.
This is made worse by the fact that it is common practice to extend the
token_period
due to issues with the cost of AAE tree rebuilds (e.g. to 1d). In this case AAE becomes unusable for days.There are some other nightmare scenarios. For example if one were to reduce the
tree.expiry
time for trees, you can actually cause a situation where AAE will never work, as trees will keep expiring/rebuilding in their unusable state (e.g. with conflicting versions), as rebuild decisions are made first before upgrade decisions. In mitigation, reducing thetree.expiry
time to a less than P *token_period
is almost certainly never a sensible configuration anyway.I have a branch I’m testing which to try and improve this situation. The key changes are:
As the upgrade doesn’t do any significant work (it merely clears the old tree), the entropy_manager no longer requires a token for upgrade work. To make sure that this small amount of work isn’t an issue the number of pokes per tick are also reduced from 10 to 2. This means that the wait of P *
token_period
for all upgradable trees to be cleared/upgraded is reduced to P/2 *tick_period
(which will normally be 1 -2 minutes). There is still a wait of P *token_period
for AAE to become fully operational, but it will be partially operational once initial builds are completed.The check to see if an upgrade is necessary, made by
riak_kv_index_hashtree
following a poke, is moved to happen before the check to see if a rebuild is required. This stops the situation which has been seen in tests when a tree spends a token to rebuild using a legacy hash, only to use a token in the next period to wipe out what it has just built to be ready for the upgrade.IIRC, testing the
develop
branch doesn’t have problems to this extent, perhaps as concurrency management is better inriak_kv_sweeper
. Howeverriak_kv_sweeper
has an uncertain status with respect to production-readiness.The text was updated successfully, but these errors were encountered: