Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix null reference exception in simple queue cache. #2829

Merged

Conversation

jason-bragg
Copy link
Contributor

Simple queue cache was not rate limiting correctly, leading to messages being removed outside the expected cache purge behavior. This led to null references in cache cursors under load.

Rate limiting fixed..
Code block which removed messages from cache outside the context of the expected purge behavior was removed.

Simple queue cache was not rate limiting correctly, leading to messages being removed outside the expected cache purge behavior.  This led to null references in cache cursors under load.

Rate limiting fixed..
Code block which removed messages from cache outside the context of the expected purge behavior was removed.
// cache is full. Check how many cursors we have in the oldest bucket.
int numCursorsInLastBucket = cacheCursorHistogram[0].NumCurrentCursors;
return numCursorsInLastBucket > 0;
return cacheCursorHistogram.Count >= NUM_CACHE_HISTOGRAM_BUCKETS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can u please explain why u had to change that function and how it will work now? Why the current code is correct? Not saying it's not, just want u to explain the rationale and how it will work now.

Copy link
Contributor Author

@jason-bragg jason-bragg Mar 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected behavior of the cache is to report when it is 'under pressure' with the IsUnderPressure call. When the cache is under pressure, it is not ready to receive more messages. When it is not under pressure it reports how many messages it can accept with the GetMaxAddCount call.

In the previous version of the code, the cache was considered 'not under pressure' as long as the message count in the cache was less than the max cache size "Size < maxCacheSize", and GetMaxAddCount always reported 1/10th of the max cache size. This made it possible for the cache to report that it had more capacity than it actually did.

For instance, if it had 199 messages in a cache with size 200, it would report that it was not under pressure (199<200) but report that it had capacity for 20 messages (200/10), when it actually only had capacity for 1 (200-199).

When this happened, more messages would be added to the cache than it could handle, which was not supposed to be the case. This problem was handled by a code block in the "add" call which removed messages from the back of the cache, which is not desirable, as the expected means of removing messages from the cache is the purge logic (TryPurgeFromCache).

This change considers the cache 'under pressure' whenever the cache as all of it's available histogram buckets in use (each holding 1/10th of the cache capacity). This is a coarse check as the most recent bucket may not be entirely full, so the cache will report 'under pressure' when somewhere between 90-100% full.

This properly prohibits more messages from being added to the cache than it has capacity to support.

The solution could be more tight by varying the available capacity reported by the GetMaxAddCount call, but this logic is sufficient and simple. I tend to favor simplicity unless more complex solutions are warranted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the design behind SimpleQueueCache. But the behavior of GetMaxAddCount confuses me.
If it always return 1/10 of the max cache size, does it mean, if there's only 10 messages in a cache whose size is 200, it still return 20 as MaxAddCount, which means only 20 messages can be added each time? Is this behavior for rate limiting? or other concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your description it sounds like there was simply a bug in get max add count. I am in favor of fixing it and not changing the rest of the code, including not changing the is under pressure. It also sounds a trivial fix. We have the max size and the current size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is now a separate open question. You want to simplify or even improve the pressure detection. I need to think about that. Maybe. But the issue here is the bug in max count. The two issues are separate and not directly related. Let's fix the bug and separately think about pressure detection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change addresses the bug.
It addresses it by modifying IsUnderPressure to report correctly given the GetMaxAddCount.

The back-off logic relies on both of these calls, and requires them to function in tandem. Which call was behaving poorly is a matter of interpretation. I've opted to address the problem by simplifying the more confusing of the two.

You're suggestion is to correct the bug by making GetMaxAddCount more complicated (admittedly trivially so) rather than simplifying the IsUnderPressure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GetMaxAddCount is designed to return 1/10 of the cache size for a reason. otherwise it doesn't make sense why the original engineer would write it that way. Although I don't know about that reason, but making GetMaxAddCount return the 'correct' count may introduce more problems, because it is against its original design pattern.

Jason's change follow its original pattern. So I suggest either go with Jason's fix, or we figure out why GetMaxCount is written that way first and figure out the real fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the GetMaxAddCount func
public int GetMaxAddCount()
{
return CACHE_HISTOGRAM_MAX_BUCKET_SIZE;
}

based on the way it is written, maybe SimpleQueueCache only allow to fill in one bucket per add, for some reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don't have enough background on the code comment on the rightness or wrongness of the chosen fix here, I can tell you that the fix does appear to work. I was the one who originally brought the null ref to Jason's attention and requested this fix. I have it running now on my PPE system and I can confirm that the null ref I was seeing is now gone with this fix.

// cache is full. Check how many cursors we have in the oldest bucket.
int numCursorsInLastBucket = cacheCursorHistogram[0].NumCurrentCursors;
return numCursorsInLastBucket > 0;
return cacheCursorHistogram.Count >= NUM_CACHE_HISTOGRAM_BUCKETS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

//var last = cachedMessages.Last;
cachedMessages.RemoveLast();
var bucket = cacheCursorHistogram[0]; // same as: var bucket = last.Value.CacheBucket;
bucket.UpdateNumItems(-1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: method name UpdateNumItems is confusing. I have to look into the function to know it is doing : NumItemsAdd
also with UpdateNumCursors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is not part of the review, I will leave it to you to decide to change it to a better name or not

// cache is full. Check how many cursors we have in the oldest bucket.
int numCursorsInLastBucket = cacheCursorHistogram[0].NumCurrentCursors;
return numCursorsInLastBucket > 0;
return cacheCursorHistogram.Count >= NUM_CACHE_HISTOGRAM_BUCKETS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the design behind SimpleQueueCache. But the behavior of GetMaxAddCount confuses me.
If it always return 1/10 of the max cache size, does it mean, if there's only 10 messages in a cache whose size is 200, it still return 20 as MaxAddCount, which means only 20 messages can be added each time? Is this behavior for rate limiting? or other concern?

// cache is full. Check how many cursors we have in the oldest bucket.
int numCursorsInLastBucket = cacheCursorHistogram[0].NumCurrentCursors;
return numCursorsInLastBucket > 0;
return cacheCursorHistogram.Count >= NUM_CACHE_HISTOGRAM_BUCKETS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GetMaxAddCount is designed to return 1/10 of the cache size for a reason. otherwise it doesn't make sense why the original engineer would write it that way. Although I don't know about that reason, but making GetMaxAddCount return the 'correct' count may introduce more problems, because it is against its original design pattern.

Jason's change follow its original pattern. So I suggest either go with Jason's fix, or we figure out why GetMaxCount is written that way first and figure out the real fix.

// cache is full. Check how many cursors we have in the oldest bucket.
int numCursorsInLastBucket = cacheCursorHistogram[0].NumCurrentCursors;
return numCursorsInLastBucket > 0;
return cacheCursorHistogram.Count >= NUM_CACHE_HISTOGRAM_BUCKETS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the GetMaxAddCount func
public int GetMaxAddCount()
{
return CACHE_HISTOGRAM_MAX_BUCKET_SIZE;
}

based on the way it is written, maybe SimpleQueueCache only allow to fill in one bucket per add, for some reason.

@sergeybykov
Copy link
Contributor

I'll merge this, so that we can include the fix in 1.4.1.

@sergeybykov sergeybykov merged commit ac6f1d3 into dotnet:master Mar 24, 2017
sergeybykov pushed a commit to sergeybykov/orleans that referenced this pull request Mar 24, 2017
Simple queue cache was not rate limiting correctly, leading to messages being removed outside the expected cache purge behavior.  This led to null references in cache cursors under load.

Rate limiting fixed..
Code block which removed messages from cache outside the context of the expected purge behavior was removed.
jdom pushed a commit that referenced this pull request Mar 24, 2017
Simple queue cache was not rate limiting correctly, leading to messages being removed outside the expected cache purge behavior.  This led to null references in cache cursors under load.

Rate limiting fixed..
Code block which removed messages from cache outside the context of the expected purge behavior was removed.
@github-actions github-actions bot locked and limited conversation to collaborators Dec 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants