Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix null reference exception in simple queue cache. #2829

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions src/Orleans/Streams/QueueAdapters/DataNotAvailableException.cs
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,21 @@ public DataNotAvailableException(SerializationInfo info, StreamingContext contex
: base(info, context)
{
}
#endif
}

[Serializable]
public class CacheFullException : OrleansException
{
public CacheFullException() : this("Queue message cache is full") { }
public CacheFullException(string message) : base(message) { }
public CacheFullException(string message, Exception inner) : base(message, inner) { }

#if !NETSTANDARD
public CacheFullException(SerializationInfo info, StreamingContext context)
: base(info, context)
{
}
#endif
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -79,12 +79,7 @@ public SimpleQueueCache(int cacheSize, Logger logger)
/// </summary>
public virtual bool IsUnderPressure()
{
if (cachedMessages.Count == 0) return false; // empty cache
if (Size < maxCacheSize) return false; // there is still space in cache
if (cacheCursorHistogram.Count == 0) return false; // no cursors yet - zero consumers basically yet.
// cache is full. Check how many cursors we have in the oldest bucket.
int numCursorsInLastBucket = cacheCursorHistogram[0].NumCurrentCursors;
return numCursorsInLastBucket > 0;
return cacheCursorHistogram.Count >= NUM_CACHE_HISTOGRAM_BUCKETS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can u please explain why u had to change that function and how it will work now? Why the current code is correct? Not saying it's not, just want u to explain the rationale and how it will work now.

Copy link
Contributor Author

@jason-bragg jason-bragg Mar 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expected behavior of the cache is to report when it is 'under pressure' with the IsUnderPressure call. When the cache is under pressure, it is not ready to receive more messages. When it is not under pressure it reports how many messages it can accept with the GetMaxAddCount call.

In the previous version of the code, the cache was considered 'not under pressure' as long as the message count in the cache was less than the max cache size "Size < maxCacheSize", and GetMaxAddCount always reported 1/10th of the max cache size. This made it possible for the cache to report that it had more capacity than it actually did.

For instance, if it had 199 messages in a cache with size 200, it would report that it was not under pressure (199<200) but report that it had capacity for 20 messages (200/10), when it actually only had capacity for 1 (200-199).

When this happened, more messages would be added to the cache than it could handle, which was not supposed to be the case. This problem was handled by a code block in the "add" call which removed messages from the back of the cache, which is not desirable, as the expected means of removing messages from the cache is the purge logic (TryPurgeFromCache).

This change considers the cache 'under pressure' whenever the cache as all of it's available histogram buckets in use (each holding 1/10th of the cache capacity). This is a coarse check as the most recent bucket may not be entirely full, so the cache will report 'under pressure' when somewhere between 90-100% full.

This properly prohibits more messages from being added to the cache than it has capacity to support.

The solution could be more tight by varying the available capacity reported by the GetMaxAddCount call, but this logic is sufficient and simple. I tend to favor simplicity unless more complex solutions are warranted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with the design behind SimpleQueueCache. But the behavior of GetMaxAddCount confuses me.
If it always return 1/10 of the max cache size, does it mean, if there's only 10 messages in a cache whose size is 200, it still return 20 as MaxAddCount, which means only 20 messages can be added each time? Is this behavior for rate limiting? or other concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your description it sounds like there was simply a bug in get max add count. I am in favor of fixing it and not changing the rest of the code, including not changing the is under pressure. It also sounds a trivial fix. We have the max size and the current size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is now a separate open question. You want to simplify or even improve the pressure detection. I need to think about that. Maybe. But the issue here is the bug in max count. The two issues are separate and not directly related. Let's fix the bug and separately think about pressure detection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change addresses the bug.
It addresses it by modifying IsUnderPressure to report correctly given the GetMaxAddCount.

The back-off logic relies on both of these calls, and requires them to function in tandem. Which call was behaving poorly is a matter of interpretation. I've opted to address the problem by simplifying the more confusing of the two.

You're suggestion is to correct the bug by making GetMaxAddCount more complicated (admittedly trivially so) rather than simplifying the IsUnderPressure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think GetMaxAddCount is designed to return 1/10 of the cache size for a reason. otherwise it doesn't make sense why the original engineer would write it that way. Although I don't know about that reason, but making GetMaxAddCount return the 'correct' count may introduce more problems, because it is against its original design pattern.

Jason's change follow its original pattern. So I suggest either go with Jason's fix, or we figure out why GetMaxCount is written that way first and figure out the real fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the GetMaxAddCount func
public int GetMaxAddCount()
{
return CACHE_HISTOGRAM_MAX_BUCKET_SIZE;
}

based on the way it is written, maybe SimpleQueueCache only allow to fill in one bucket per add, for some reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I don't have enough background on the code comment on the rightness or wrongness of the chosen fix here, I can tell you that the fix does appear to work. I was the one who originally brought the null ref to Jason's attention and requested this fix. I have it running now on my PPE system and I can confirm that the null ref I was seeing is now gone with this fix.

}


Expand Down Expand Up @@ -291,6 +286,8 @@ internal void UnsetCursor(SimpleQueueCacheCursor cursor, StreamSequenceToken tok
private void Add(IBatchContainer batch, StreamSequenceToken sequenceToken)
{
if (batch == null) throw new ArgumentNullException(nameof(batch));
// this should never happen, but just in case
if (Size >= maxCacheSize) throw new CacheFullException();

CacheBucket cacheBucket;
if (cacheCursorHistogram.Count == 0)
Expand Down Expand Up @@ -319,18 +316,6 @@ private void Add(IBatchContainer batch, StreamSequenceToken sequenceToken)

cachedMessages.AddFirst(new LinkedListNode<SimpleQueueCacheItem>(item));
cacheBucket.UpdateNumItems(1);

if (Size > maxCacheSize)
{
//var last = cachedMessages.Last;
cachedMessages.RemoveLast();
var bucket = cacheCursorHistogram[0]; // same as: var bucket = last.Value.CacheBucket;
bucket.UpdateNumItems(-1);
if (bucket.NumCurrentItems == 0)
{
cacheCursorHistogram.RemoveAt(0);
}
}
}

internal static void Log(Logger logger, string format, params object[] args)
Expand Down