Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need consistent reads of etcd? #8295

Closed
justinsb opened this issue May 15, 2015 · 16 comments
Closed

Do we need consistent reads of etcd? #8295

justinsb opened this issue May 15, 2015 · 16 comments
Labels
area/etcd priority/backlog Higher priority than priority/awaiting-more-evidence. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@justinsb
Copy link
Member

I am looking and looking at the code for pkg/registry/service/ipallocator/controller/repair.go, and I am having a hard time proving to myself that it is correct.

We have etcd configured to read in 'weakly consistent' mode. As @smarterclayton and @lavalamp previously explained to me (thanks!) that is not a problem, because any writes are done as a CAS with the version. We might see stale data, and derive stale data from that, but we won't ever be able to write stale data.

However, I am worried that I can construct what appears to be possible etcd-partition states that an inconsistent read won't catch. I'm basing this on this issue etcd-io/etcd#741, in particular that raft/etcd allows a node to believe itself the leader while another leader has been elected.

For example, support I have one service, s1, with one IP at t0: 1.1.1.1 Then I change s1's IP at t1: 2.2.2.2. So our IP allocation map should go from (1.1.1.1) -> (2.2.2.2)

But suppose we had a partition event in etcd just before t1. At t0, node 1 believes itself to be the leader. At the partition event, nodes 2 and 3 elect node 2 to be the leader. Node 1 still believes itself to be the leader at t1 (though its days are numbered), and will answer inconsistent reads. But Node 1 has not seen s1's change from 1.1.1.1 -> 2.2.2.2.

Now, at t2 the repair process does the following:

Reads the services from node 1: { s1=(1.1.1.1) }
Reads the node map from node 2: { (2.2.2.2) }
Builds a new node map: { (1.1.1.1) }
Writes the new node map back to node 2: { (1.1.1.1) }
And now (2.2.2.2) can be re-allocated to another service, despite the fact that s1 has it allocated.

Moreover, I think that not only can this happen with a double-leader partition, but it can happen simply if node 1 is lagging (no partitions behind the leader) when go-etcd is doing weakly consistent reads.

In short, I think we need etcd do consistent reads. And then we also need to do further checks to detect partitions: I think the partition scenario could be caught because the node map in node would be newer than the newest service.

Is this in fact possible? Or is there some etcd/versioning mechanism I am missing?

@smarterclayton
Copy link
Contributor

On May 14, 2015, at 8:23 PM, Justin Santa Barbara [email protected] wrote:

I am looking and looking at the code for pkg/registry/service/ipallocator/controller/repair.go, and I am having a hard time proving to myself that it is correct.

We have etcd configured to read in 'weakly consistent' mode. As @smarterclayton and @lavalamp previously explained to me (thanks!) that is not a problem, because any writes are done as a CAS with the version. We might see stale data, and derive stale data from that, but we won't ever be able to write stale data.

However, I am worried that I can construct what appears to be possible etcd-partition states that an inconsistent read won't catch. I'm basing this on this issue etcd-io/etcd#741, in particular that raft/etcd allows a node to believe itself the leader while another leader has been elected.

For example, support I have one service, s1, with one IP at t0: 1.1.1.1 Then I change s1's IP at t1: 2.2.2.2. So our IP allocation map should go from (1.1.1.1) -> (2.2.2.2)

You can't change a services ip?
But suppose we had a partition event in etcd just before t1. At t0, node 1 believes itself to be the leader. At the partition event, nodes 2 and 3 elect node 2 to be the leader. Node 1 still believes itself to be the leader at t1 (though its days are numbered), and will answer inconsistent reads. But Node 1 has not seen s1's change from 1.1.1.1 -> 2.2.2.2.

Now, at t2 the repair process does the following:

Reads the services from node 1: { s1=(1.1.1.1) }
Reads the node map from node 2: { (2.2.2.2) }
Builds a new node map: { (1.1.1.1) }
Writes the new node map back to node 2: { (1.1.1.1) }
And now (2.2.2.2) can be re-allocated to another service, despite the fact that s1 has it allocated.

Moreover, I think that not only can this happen with a double-leader partition, but it can happen simply if node 1 is lagging (no partitions behind the leader) when go-etcd is doing weakly consistent reads.

In short, I think we need etcd do consistent reads. And then we also need to do further checks to detect partitions: I think the partition scenario could be caught because the node map in node would be newer than the newest service.

Is this in fact possible? Or is there some etcd/versioning mechanism I am missing?

I believe repair must compare the resource versions in order to guarantee that the service list is newer than the read map (or ask for a consistent read). That's a simple test. I'm not aware of a partition scenario where a new term can be committed in the absence of a restore from backup.

I believe allocations are the only resource that require two resources to achieve their guarantee. If sharded, the allocation maps have to be per shard anyway. Trying to think of what other resources fall under this constraint.


Reply to this email directly or view it on GitHub.

@smarterclayton
Copy link
Contributor

Allocation is safe because set addition is a crdt safe operation. Removal is not safe and so requires linearization read-read-write in order so as to make forward progress.

@justinsb
Copy link
Member Author

True, IP allocations can't change - poor example I chose there :-) But:

  1. I think the same applies if we delete service 1=(1.1.1.1) and create service 2=(2.2.2.2)
  2. I think we want an approach that works across all domains - I'm working on publicPorts

I get really nervous when we start layering on checks in response to counter-examples; I just worry that there's always another counter-example that hasn't been suggested yet.

Another problem: if we assign a portal IP, but then fail to update the service list, then the version of the portal map will be truly ahead of the highest version in the service list. I don't know how we differentiate that case from a stale read of the service list.

@smarterclayton
Copy link
Contributor

On May 14, 2015, at 10:45 PM, Justin Santa Barbara [email protected] wrote:

True, IP allocations can't change - poor example I chose there :-) But:

  1. I think the same applies if we delete service 1=(1.1.1.1) and create service 2=(2.2.2.2)
  2. I think we want an approach that works across all domains - I'm working on publicPorts

I get really nervous when we start layering on checks in response to counter-examples; I just worry that there's always another counter-example that hasn't been suggested yet.

Another problem: if we assign a portal IP, but then fail to update the service list, then the version of the portal map will be truly ahead of the highest version in the service list. I don't know how we differentiate that case from a stale read of the service list.

We may not have to. We can either wait and retry the read, or require allocations to miss two repair intervals before being reclaimed (or a unit multiple of that). That controls reclamation with an upper bound on the time an allocation can precede a service.

The allocator controller reduces the risk of that occurring because it is level driven. In general, the easier assumption to relax is that service ips never change, and that users are allowed to request portal ips.

Appreciate the review.


Reply to this email directly or view it on GitHub.

@smarterclayton
Copy link
Contributor

Alternatively, assuming multi object write in etcd3, we combine write and allocation. But timeframe on that is further out.

@thockin
Copy link
Member

thockin commented May 15, 2015

On Thu, May 14, 2015 at 5:50 PM, Clayton Coleman
[email protected] wrote:

On May 14, 2015, at 8:23 PM, Justin Santa Barbara [email protected] wrote:

For example, support I have one service, s1, with one IP at t0: 1.1.1.1 Then I change s1's IP at t1: 2.2.2.2. So our IP allocation map should go from (1.1.1.1) -> (2.2.2.2)

You can't change a services ip?

You could change the port, though, unless we have a general rule that
any instance of this pattern must be immutable

I believe repair must compare the resource versions in order to guarantee that the service list is newer than the read map (or ask for a consistent read). That's a simple test.

Do we set ResourceVersion for a list to the highest of any member of
the list? Empirically, when I get a list I always see metadata
empty - no selfLink, no resourceVersion. Should we be populating it?
Or should we get rid of it?

@davidopp
Copy link
Member

/subscribe

@smarterclayton
Copy link
Contributor

On May 14, 2015, at 11:51 PM, Tim Hockin [email protected] wrote:

On Thu, May 14, 2015 at 5:50 PM, Clayton Coleman
[email protected] wrote:

On May 14, 2015, at 8:23 PM, Justin Santa Barbara [email protected] wrote:

For example, support I have one service, s1, with one IP at t0: 1.1.1.1 Then I change s1's IP at t1: 2.2.2.2. So our IP allocation map should go from (1.1.1.1) -> (2.2.2.2)

You can't change a services ip?

You could change the port, though, unless we have a general rule that
any instance of this pattern must be immutable

I believe repair must compare the resource versions in order to guarantee that the service list is newer than the read map (or ask for a consistent read). That's a simple test.

Do we set ResourceVersion for a list to the highest of any member of
the list?

It's the current etcd committed term so it's "now" according to the etcd server you're talking to.

Empirically, when I get a list I always see metadata
empty - no selfLink, no resourceVersion.

Get from server or via kubectl get?

Should we be populating it?

Used to be

Or should we get rid of it?

Reply to this email directly or view it on GitHub.

@smarterclayton
Copy link
Contributor

On May 14, 2015, at 11:51 PM, Tim Hockin [email protected] wrote:

On Thu, May 14, 2015 at 5:50 PM, Clayton Coleman
[email protected] wrote:

On May 14, 2015, at 8:23 PM, Justin Santa Barbara [email protected] wrote:

For example, support I have one service, s1, with one IP at t0: 1.1.1.1 Then I change s1's IP at t1: 2.2.2.2. So our IP allocation map should go from (1.1.1.1) -> (2.2.2.2)

You can't change a services ip?

You could change the port, though, unless we have a general rule that
any instance of this pattern must be immutable

I believe repair must compare the resource versions in order to guarantee that the service list is newer than the read map (or ask for a consistent read). That's a simple test.

Also, we can compare two things. The current etcd index on the read should be > than the read index of the map, AND at least one service modified index should be > the map modified index. If the latter condition is not true we don't know whether the service succeeded (and we didn't observe it), failed, or is delayed arbitrarily. Only in that case must we read again with a wait.

Do we set ResourceVersion for a list to the highest of any member of
the list? Empirically, when I get a list I always see metadata
empty - no selfLink, no resourceVersion. Should we be populating it?
Or should we get rid of it?

Reply to this email directly or view it on GitHub.

@justinsb
Copy link
Member Author

I've opened an issue with etcd to see if they can provide some input.

But, I think we might be OK, even without consistent reads, if we do the following procedure (which is I think what @smarterclayton suggested):

  1. We read the key we are going to update, and note the current etcd raft index (the raft index, not the key lastModified)
  2. We read all the "input" keys, and if we ever see a raft index < the raft index in step 1, we abort / retry. We may have out-of-date reads, but we are seeing a consistent sequence in time (a linearizable view).
  3. We then do a CAS write on the first key, comparing against the value/lastModified in step 1

The write in step 3 will only succeed if the key has not changed and was up to date. (This is what we have today)

But we also know that we saw a linearizable view of the world. So we only have to deal with concurrent mutations, which the current logic deals with by always writing to the bitmap key whenever any service changes, thus invalidating the CAS.

In step 2, if we do multiple reads, I do not know whether we have to restart if we see any out-of-sequence indexes, or can just compare to step 1. My guess is that in general we do, but in this particular case because we always update the bitmap key we're OK.

In other words, I think we can get linearized reads simply by observing the etcd server index and making sure it is monotonic. But this is what I have asked the etcd folk to confirm.

@smarterclayton
Copy link
Contributor

----- Original Message -----

I've opened an issue with etcd to see if they can provide some input.

But, I think we might be OK, even without consistent reads, if we do the
following procedure (which is I think what @smarterclayton suggested):

  1. We read the key we are going to update, and note the current etcd raft
    index (the raft index, not the key lastModified)
  2. We read all the "input" keys, and if we ever see a raft index < the index
    in step 1, we abort / retry. We may have out-of-date reads, but we are
    seeing a consistent sequence in time (a linearizable view).
  3. We then do a CAS write on the first key, comparing against the
    value/lastModified in step 1

The write in #3 will only succeed if the key has not changed and was up to
date. (This is what we have today)

But we also know that we saw a linearizable view of the world. So we only
have to deal with concurrent mutations, which the current logic deals with
by always writing to the bitmap key whenever any service changes, thus
invalidating the CAS.

In step 2, if we do multiple reads, I do not know whether we have to restart
if we see any out-of-sequence indexes, or can just compare to step 1. My
guess is that in general we do, but in this particular case because we
always update the bitmap key we're OK.

In other words, I think we can get linearized reads simply by observing the
etcd server index and making sure it is monotonic. But this is what I have
asked the etcd folk to confirm.

Your description matches what expect etcd to view - we have to linearize allocation map -> service list -> CAS on allocation map. At least one service must be at or above the etcd index of the allocation map.


Reply to this email directly or view it on GitHub:
#8295 (comment)

@thockin
Copy link
Member

thockin commented May 15, 2015

On Fri, May 15, 2015 at 6:01 AM, Clayton Coleman
[email protected] wrote:

Get from server or via kubectl get?

kubectl - should there be a difference?

$ kubectl get pods -o json | head -5
{
    "kind": "List",
    "apiVersion": "v1beta3",
    "metadata": {},
    "items": [

Should we be populating it?

Used to be

Seems not any more.

@smarterclayton
Copy link
Contributor

On May 15, 2015, at 11:38 AM, Tim Hockin [email protected] wrote:

On Fri, May 15, 2015 at 6:01 AM, Clayton Coleman
[email protected] wrote:

Get from server or via kubectl get?

kubectl - should there be a difference?

It's a bug, there's special code to unify resource versions across multiple calls.

$ kubectl get pods -o json | head -5
{
"kind": "List",
"apiVersion": "v1beta3",
"metadata": {},
"items": [

Should we be populating it?

Used to be

Seems not any more.

Bug. The rest API for each individual call returns it (note you have a List, not a PodList).


Reply to this email directly or view it on GitHub.

@lavalamp
Copy link
Member

Do we set ResourceVersion for a list to the highest of any member of the list?

No, it is set to the global etcd index (which will be >= the highest member of the list). The highest member of the list might be too old to be in the watch window.

As for why you don't see it in kubectl output, that sounds like a kubectl bug. It's definitely there if you curl. (just tried it) In fact basically nothing would work if it weren't being set right.

@smarterclayton
Copy link
Contributor

On May 15, 2015, at 10:55 PM, Daniel Smith [email protected] wrote:

Do we set ResourceVersion for a list to the highest of any member of the list?

No, it is set to the global etcd index (which will be >= the highest member of the list). The highest member of the list might be too old to be in the watch window.

As for why you don't see it in kubectl output, that sounds like a kubectl bug. It's definitely there if you curl. (just tried it) In fact basically nothing would work if it weren't being set right.

It's almost certainly the version merge code being wrong after recent changes to how we dealt with lists. In fact, I kind of remember putting a todo in the code to fix it.

Reply to this email directly or view it on GitHub.

justinsb added a commit to justinsb/kubernetes that referenced this issue May 20, 2015
justinsb added a commit to justinsb/kubernetes that referenced this issue May 22, 2015
@mikedanese mikedanese added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed team/master labels Aug 20, 2015
@timothysc
Copy link
Member

timothysc commented Dec 9, 2016

closing this issue b/c original issue sited has been closed for a long time now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd priority/backlog Higher priority than priority/awaiting-more-evidence. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

No branches or pull requests

8 participants