Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent: Cache can return errors older than the last non-error result #4480

Closed
banks opened this issue Aug 1, 2018 · 1 comment · Fixed by #5094
Closed

agent: Cache can return errors older than the last non-error result #4480

banks opened this issue Aug 1, 2018 · 1 comment · Fixed by #5094
Assignees
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected
Milestone

Comments

@banks
Copy link
Member

banks commented Aug 1, 2018

Assume the first call to the Roots endpoint failed. That cache entry looks a little like this (pseudo code)

CacheEntry{
  Value: nil,
  Error: "not bootstrapped yet",
  Index: 0,
  Valid: false,
}

The next blocking call comes in with index=1 since we set that in the leaf client:

// We always want to block on at least an initial value. If this isn't
minIndex := atomic.LoadUint64(&c.caIndex)
if minIndex == 0 {
minIndex = 1
}

So the get makes another request to roots. Let's assume this request succeeds.

The non-nil value is populated into the new cache entry, and the index updated:

consul/agent/cache/cache.go

Lines 346 to 364 in ec755b4

if result.Value != nil {
// A new value was given, so we create a brand new entry.
newEntry.Value = result.Value
newEntry.Index = result.Index
if newEntry.Index < 1 {
// Less than one is invalid unless there was an error and in this case
// there wasn't since a value was returned. If a badly behaved RPC
// returns 0 when it has no data, we might get into a busy loop here. We
// set this to minimum of 1 which is safe because no valid user data can
// ever be written at raft index 1 due to the bootstrap process for
// raft. This insure that any subsequent background refresh request will
// always block, but allows the initial request to return immediately
// even if there is no data.
newEntry.Index = 1
}
// This is a valid entry with a result
newEntry.Valid = true
}

The cache Entry now looks like this:

CacheEntry{
  Value: <response>,
  Error: "not bootstrapped yet", // nothing cleared this
  Index: 123, // actual index of the response
  Valid: true,
}

Then we close the Waiter chan unblocking any Get calls which loop and hit:

consul/agent/cache/cache.go

Lines 211 to 231 in ec755b4

if ok && entry.Valid {
if info.MinIndex == 0 || info.MinIndex < entry.Index {
meta := ResultMeta{}
if first {
metrics.IncrCounter([]string{"consul", "cache", t, "hit"}, 1)
meta.Hit = true
}
// Touch the expiration and fix the heap.
c.entriesLock.Lock()
entry.Expiry.Reset()
c.entriesExpiryHeap.Fix(entry.Expiry)
c.entriesLock.Unlock()
// We purposely do not return an error here since the cache
// only works with fetching values that either have a value
// or have an error, but not both. The Error may be non-nil
// in the entry because of this to note future fetch errors.
return entry.Value, meta, nil
}
}

But since there is now actually a change in Index we make it through OK and the comment there explains why the Error value was left and gets ignored in this case.

So far so good.

Then the next blocking cache.Get comes in with the same index 123.

Assume it times out (because roots didn't change) which means the roots Fetch returns a good response (same as one in cache) and the same Index.

consul/agent/cache/cache.go

Lines 346 to 364 in ec755b4

if result.Value != nil {
// A new value was given, so we create a brand new entry.
newEntry.Value = result.Value
newEntry.Index = result.Index
if newEntry.Index < 1 {
// Less than one is invalid unless there was an error and in this case
// there wasn't since a value was returned. If a badly behaved RPC
// returns 0 when it has no data, we might get into a busy loop here. We
// set this to minimum of 1 which is safe because no valid user data can
// ever be written at raft index 1 due to the bootstrap process for
// raft. This insure that any subsequent background refresh request will
// always block, but allows the initial request to return immediately
// even if there is no data.
newEntry.Index = 1
}
// This is a valid entry with a result
newEntry.Valid = true
}

That is a no-op since it's setting same value, index, and valid values. So the cache entry still looks the same:

CacheEntry{
  Value: <response>,
  Error: "not bootstrapped yet", // nothing cleared this
  Index: 123, // actual index of the response
  Valid: true,
}

BUT we close the Waiter chan which causes our Get to unblock again.

This time we have the same index as the cached block so we fail to enter the happy path here:

consul/agent/cache/cache.go

Lines 211 to 212 in ec755b4

if ok && entry.Valid {
if info.MinIndex == 0 || info.MinIndex < entry.Index {

We should go back to blocking on another query if timeoutCh hasn't fired yet but instead we hit:

consul/agent/cache/cache.go

Lines 233 to 240 in ec755b4

// If this isn't our first time through and our last value has an error,
// then we return the error. This has the behavior that we don't sit in
// a retry loop getting the same error for the entire duration of the
// timeout. Instead, we make one effort to fetch a new value, and if
// there was an error, we return.
if !first && entry.Error != nil {
return entry.Value, ResultMeta{}, entry.Error
}

And so we return (early) the error from before the current valid cached result!

@banks banks added type/bug Feature does not function as expected theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies labels Aug 1, 2018
@banks banks added this to the 1.2.3 milestone Aug 1, 2018
@banks
Copy link
Member Author

banks commented Aug 1, 2018

For context, we theorized that we must be "caching errors" after seeing behaviour on a local test cluster that blew our minds until we realised that an agent was consistently returning errors from a request it had made during a wonky cluster startup (before the CA was populated) long after the cluster was up and healthy.

I went through the code and nearly gave up on the first pass through when it turns out we ignore the error stored on the next good response, but pressed on to see that the next request after that would hit this case.

I think the fix is either to clear the error field on a valid = true response being returned, or to be more sophisticated about the condition in which we exit with an error. But I've not thought it through yet.

@pearkes pearkes modified the milestones: 1.2.3, 1.3.0 Sep 11, 2018
@mkeeler mkeeler modified the milestones: 1.3.0, 1.4.0 Oct 10, 2018
@banks banks modified the milestones: 1.4.0, Upcoming Oct 15, 2018
@banks banks modified the milestones: Upcoming, 1.4.1 Nov 28, 2018
@banks banks self-assigned this Dec 13, 2018
banks added a commit that referenced this issue Dec 13, 2018
banks added a commit that referenced this issue Jan 8, 2019
* Fixes #4480. Don't leave old errors around in cache that can be hit in specific circumstances.

* Move error reset to cover extreme edge case of nil Value, nil err Fetch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants