agent: Cache can return errors older than the last non-error result #4480

banks · 2018-08-01T20:52:08Z

Assume the first call to the Roots endpoint failed. That cache entry looks a little like this (pseudo code)

CacheEntry{
  Value: nil,
  Error: "not bootstrapped yet",
  Index: 0,
  Valid: false,
}

The next blocking call comes in with index=1 since we set that in the leaf client:

consul/agent/cache-types/connect_ca_leaf.go

Lines 171 to 175 in ec755b4

    
           // We always want to block on at least an initial value. If this isn't 
        
           minIndex := atomic.LoadUint64(&c.caIndex) 
        
           if minIndex == 0 { 
        
           	minIndex = 1 
        
           }

So the get makes another request to roots. Let's assume this request succeeds.

The non-nil value is populated into the new cache entry, and the index updated:

consul/agent/cache/cache.go

Lines 346 to 364 in ec755b4

    
           if result.Value != nil { 
        
           	// A new value was given, so we create a brand new entry. 
        
           	newEntry.Value = result.Value 
        
           	newEntry.Index = result.Index 
        
           	if newEntry.Index < 1 { 
        
           		// Less than one is invalid unless there was an error and in this case 
        
           		// there wasn't since a value was returned. If a badly behaved RPC 
        
           		// returns 0 when it has no data, we might get into a busy loop here. We 
        
           		// set this to minimum of 1 which is safe because no valid user data can 
        
           		// ever be written at raft index 1 due to the bootstrap process for 
        
           		// raft. This insure that any subsequent background refresh request will 
        
           		// always block, but allows the initial request to return immediately 
        
           		// even if there is no data. 
        
           		newEntry.Index = 1 
        
           	} 
        
           	// This is a valid entry with a result 
        
           	newEntry.Valid = true 
        
           }

The cache Entry now looks like this:

CacheEntry{
  Value: <response>,
  Error: "not bootstrapped yet", // nothing cleared this
  Index: 123, // actual index of the response
  Valid: true,
}

Then we close the Waiter chan unblocking any Get calls which loop and hit:

consul/agent/cache/cache.go

Lines 211 to 231 in ec755b4

    
           if ok && entry.Valid { 
        
           	if info.MinIndex == 0 || info.MinIndex < entry.Index { 
        
           		meta := ResultMeta{} 
        
           		if first { 
        
           			metrics.IncrCounter([]string{"consul", "cache", t, "hit"}, 1) 
        
           			meta.Hit = true 
        
           		} 
        
           		// Touch the expiration and fix the heap. 
        
           		c.entriesLock.Lock() 
        
           		entry.Expiry.Reset() 
        
           		c.entriesExpiryHeap.Fix(entry.Expiry) 
        
           		c.entriesLock.Unlock() 
        
           		// We purposely do not return an error here since the cache 
        
           		// only works with fetching values that either have a value 
        
           		// or have an error, but not both. The Error may be non-nil 
        
           		// in the entry because of this to note future fetch errors. 
        
           		return entry.Value, meta, nil 
        
           	} 
        
           }

But since there is now actually a change in Index we make it through OK and the comment there explains why the Error value was left and gets ignored in this case.

So far so good.

Then the next blocking cache.Get comes in with the same index 123.

Assume it times out (because roots didn't change) which means the roots Fetch returns a good response (same as one in cache) and the same Index.

consul/agent/cache/cache.go

Lines 346 to 364 in ec755b4

    
           if result.Value != nil { 
        
           	// A new value was given, so we create a brand new entry. 
        
           	newEntry.Value = result.Value 
        
           	newEntry.Index = result.Index 
        
           	if newEntry.Index < 1 { 
        
           		// Less than one is invalid unless there was an error and in this case 
        
           		// there wasn't since a value was returned. If a badly behaved RPC 
        
           		// returns 0 when it has no data, we might get into a busy loop here. We 
        
           		// set this to minimum of 1 which is safe because no valid user data can 
        
           		// ever be written at raft index 1 due to the bootstrap process for 
        
           		// raft. This insure that any subsequent background refresh request will 
        
           		// always block, but allows the initial request to return immediately 
        
           		// even if there is no data. 
        
           		newEntry.Index = 1 
        
           	} 
        
           	// This is a valid entry with a result 
        
           	newEntry.Valid = true 
        
           }

That is a no-op since it's setting same value, index, and valid values. So the cache entry still looks the same:

CacheEntry{
  Value: <response>,
  Error: "not bootstrapped yet", // nothing cleared this
  Index: 123, // actual index of the response
  Valid: true,
}

BUT we close the Waiter chan which causes our Get to unblock again.

This time we have the same index as the cached block so we fail to enter the happy path here:

consul/agent/cache/cache.go

Lines 211 to 212 in ec755b4

    
           if ok && entry.Valid { 
        
           	if info.MinIndex == 0 || info.MinIndex < entry.Index {

We should go back to blocking on another query if timeoutCh hasn't fired yet but instead we hit:

consul/agent/cache/cache.go

Lines 233 to 240 in ec755b4

    
           // If this isn't our first time through and our last value has an error, 
        
           // then we return the error. This has the behavior that we don't sit in 
        
           // a retry loop getting the same error for the entire duration of the 
        
           // timeout. Instead, we make one effort to fetch a new value, and if 
        
           // there was an error, we return. 
        
           if !first && entry.Error != nil { 
        
           	return entry.Value, ResultMeta{}, entry.Error 
        
           }

And so we return (early) the error from before the current valid cached result!

The text was updated successfully, but these errors were encountered:

banks · 2018-08-01T21:06:55Z

For context, we theorized that we must be "caching errors" after seeing behaviour on a local test cluster that blew our minds until we realised that an agent was consistently returning errors from a request it had made during a wonky cluster startup (before the CA was populated) long after the cluster was up and healthy.

I went through the code and nearly gave up on the first pass through when it turns out we ignore the error stored on the next good response, but pressed on to see that the next request after that would hit this case.

I think the fix is either to clear the error field on a valid = true response being returned, or to be more sophisticated about the condition in which we exit with an error. But I've not thought it through yet.

…n specific circumstances.

* Fixes #4480. Don't leave old errors around in cache that can be hit in specific circumstances. * Move error reset to cover extreme edge case of nil Value, nil err Fetch

banks added type/bug Feature does not function as expected theme/connect Anything related to Consul Connect, Service Mesh, Side Car Proxies labels Aug 1, 2018

banks added this to the 1.2.3 milestone Aug 1, 2018

pearkes modified the milestones: 1.2.3, 1.3.0 Sep 11, 2018

mkeeler modified the milestones: 1.3.0, 1.4.0 Oct 10, 2018

banks modified the milestones: 1.4.0, Upcoming Oct 15, 2018

banks modified the milestones: Upcoming, 1.4.1 Nov 28, 2018

banks self-assigned this Dec 13, 2018

banks mentioned this issue Dec 13, 2018

agent: Don't leave old errors around in cache #5094

Merged

banks added a commit that referenced this issue Dec 13, 2018

Fixes #4480. Don't leave old errors around in cache that can be hit i…

c330421

…n specific circumstances.

banks closed this as completed in #5094 Jan 8, 2019

banks added a commit that referenced this issue Jan 8, 2019

agent: Don't leave old errors around in cache (#5094)

0589525

* Fixes #4480. Don't leave old errors around in cache that can be hit in specific circumstances. * Move error reset to cover extreme edge case of nil Value, nil err Fetch

kyhavlov mentioned this issue May 20, 2019

Consul connect: Consul agent fails to renew leaf certificate #5239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: Cache can return errors older than the last non-error result #4480

agent: Cache can return errors older than the last non-error result #4480

banks commented Aug 1, 2018

banks commented Aug 1, 2018

agent: Cache can return errors older than the last non-error result #4480

agent: Cache can return errors older than the last non-error result #4480

Comments

banks commented Aug 1, 2018

banks commented Aug 1, 2018