-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do we need consistent reads of etcd? #8295
Comments
I believe allocations are the only resource that require two resources to achieve their guarantee. If sharded, the allocation maps have to be per shard anyway. Trying to think of what other resources fall under this constraint.
|
Allocation is safe because set addition is a crdt safe operation. Removal is not safe and so requires linearization read-read-write in order so as to make forward progress. |
True, IP allocations can't change - poor example I chose there :-) But:
I get really nervous when we start layering on checks in response to counter-examples; I just worry that there's always another counter-example that hasn't been suggested yet. Another problem: if we assign a portal IP, but then fail to update the service list, then the version of the portal map will be truly ahead of the highest version in the service list. I don't know how we differentiate that case from a stale read of the service list. |
The allocator controller reduces the risk of that occurring because it is level driven. In general, the easier assumption to relax is that service ips never change, and that users are allowed to request portal ips. Appreciate the review.
|
Alternatively, assuming multi object write in etcd3, we combine write and allocation. But timeframe on that is further out. |
On Thu, May 14, 2015 at 5:50 PM, Clayton Coleman
You could change the port, though, unless we have a general rule that
Do we set ResourceVersion for a list to the highest of any member of |
/subscribe |
It's the current etcd committed term so it's "now" according to the etcd server you're talking to.
Get from server or via kubectl get?
Used to be
|
Also, we can compare two things. The current etcd index on the read should be > than the read index of the map, AND at least one service modified index should be > the map modified index. If the latter condition is not true we don't know whether the service succeeded (and we didn't observe it), failed, or is delayed arbitrarily. Only in that case must we read again with a wait.
|
I've opened an issue with etcd to see if they can provide some input. But, I think we might be OK, even without consistent reads, if we do the following procedure (which is I think what @smarterclayton suggested):
The write in step 3 will only succeed if the key has not changed and was up to date. (This is what we have today) But we also know that we saw a linearizable view of the world. So we only have to deal with concurrent mutations, which the current logic deals with by always writing to the bitmap key whenever any service changes, thus invalidating the CAS. In step 2, if we do multiple reads, I do not know whether we have to restart if we see any out-of-sequence indexes, or can just compare to step 1. My guess is that in general we do, but in this particular case because we always update the bitmap key we're OK. In other words, I think we can get linearized reads simply by observing the etcd server index and making sure it is monotonic. But this is what I have asked the etcd folk to confirm. |
----- Original Message -----
Your description matches what expect etcd to view - we have to linearize allocation map -> service list -> CAS on allocation map. At least one service must be at or above the etcd index of the allocation map.
|
On Fri, May 15, 2015 at 6:01 AM, Clayton Coleman
kubectl - should there be a difference?
Seems not any more. |
It's a bug, there's special code to unify resource versions across multiple calls.
Bug. The rest API for each individual call returns it (note you have a List, not a PodList).
|
No, it is set to the global etcd index (which will be >= the highest member of the list). The highest member of the list might be too old to be in the watch window. As for why you don't see it in kubectl output, that sounds like a kubectl bug. It's definitely there if you curl. (just tried it) In fact basically nothing would work if it weren't being set right. |
|
closing this issue b/c original issue sited has been closed for a long time now. |
I am looking and looking at the code for pkg/registry/service/ipallocator/controller/repair.go, and I am having a hard time proving to myself that it is correct.
We have etcd configured to read in 'weakly consistent' mode. As @smarterclayton and @lavalamp previously explained to me (thanks!) that is not a problem, because any writes are done as a CAS with the version. We might see stale data, and derive stale data from that, but we won't ever be able to write stale data.
However, I am worried that I can construct what appears to be possible etcd-partition states that an inconsistent read won't catch. I'm basing this on this issue etcd-io/etcd#741, in particular that raft/etcd allows a node to believe itself the leader while another leader has been elected.
For example, support I have one service, s1, with one IP at t0: 1.1.1.1 Then I change s1's IP at t1: 2.2.2.2. So our IP allocation map should go from (1.1.1.1) -> (2.2.2.2)
But suppose we had a partition event in etcd just before t1. At t0, node 1 believes itself to be the leader. At the partition event, nodes 2 and 3 elect node 2 to be the leader. Node 1 still believes itself to be the leader at t1 (though its days are numbered), and will answer inconsistent reads. But Node 1 has not seen s1's change from 1.1.1.1 -> 2.2.2.2.
Now, at t2 the repair process does the following:
Reads the services from node 1: { s1=(1.1.1.1) }
Reads the node map from node 2: { (2.2.2.2) }
Builds a new node map: { (1.1.1.1) }
Writes the new node map back to node 2: { (1.1.1.1) }
And now (2.2.2.2) can be re-allocated to another service, despite the fact that s1 has it allocated.
Moreover, I think that not only can this happen with a double-leader partition, but it can happen simply if node 1 is lagging (no partitions behind the leader) when go-etcd is doing weakly consistent reads.
In short, I think we need etcd do consistent reads. And then we also need to do further checks to detect partitions: I think the partition scenario could be caught because the node map in node would be newer than the newest service.
Is this in fact possible? Or is there some etcd/versioning mechanism I am missing?
The text was updated successfully, but these errors were encountered: