-
Notifications
You must be signed in to change notification settings - Fork 105
Priority index locking and logging of long operations #1847
Conversation
maxBuffered: maxBuffered, | ||
maxDelay: maxDelay, | ||
flushTrigger: make(chan struct{}, 1), | ||
flushPending: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a large change in WriteQueue behavior to make maxBuffered
more of a heuristic than an absolute max. This was done to take more control of the locking in the flush
function.
idx/memory/write_queue.go
Outdated
if !timer.Stop() { | ||
<-timer.C | ||
} | ||
return | ||
} | ||
} | ||
} | ||
|
||
func (wq *WriteQueue) isFlushPending() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent was to use this in testing. Or something similar.
// Wait for any pending writes | ||
bc := BlockContext{lock: pm, preLockTime: time.Now()} | ||
pm.lowPrioLock.RLock() | ||
pm.lock.RLock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely convinced we actually need to lock both of these here, if Lock
is guaranteed to acquire both write locks.
0f74e7c
to
fd9f586
Compare
a8b4770
to
5d1441a
Compare
At a high level, the PriorityRWMutex seems to make sense. Here's my breakdown Master
[1] it wouldn't be so bad if we could squeeze in some fast reads while the original fast read is still executing, but how many do we want to squeeze in? Where do we want to draw PR 1847
So
new:
So this tackles precisely the 2nd case. |
Yes! Excellent breakdown. |
So, what difference in behavior do you observe with this PR applied? Were you observing missing data due to the ingestion blocks? Or was it more along the lines of instances marking themselves as unavailable due to too high priority (kafka lag) ? I suspect that your request latencies are unchanged? |
Yes, this mainly solves the issue of instances getting marked unavailable due to queries that took a few seconds to and blocked at the same time a write was waiting.
We did see a very small improvement to long tail latencies, but I expect that was mostly due to fewer "bursts" of ingest after being blocked for a few seconds. Most of the latency improvement we got stemmed from enabling the write queue so the write locks were acquired less frequently. |
the prioritymutex makes sense to me, though i need to review the writeQueue buffering/flushing changes still. |
High confidence. We've been running it in prod for about 2 months now. |
The write queue change makes sense too. Since getting the index lock is slow, until we acquire it, keep ingesting into the write queue, and thus possibly breach maxBuffered by however many entries come in while we wait for the index lock. Seems like a very sensible tradeoff. |
Still needs cleaning up and more testing, but this PR changes the locking behavior in the index so that long read queries can't block ingest (
Update
andAddOrUpdate
calls).There is a scenario (common in our case) where long index queries (say 5s) can block the acquisition of a write lock (like in the WriteQueue or
add
if not using theWriteQueue
) which in turn blocks new read locks (like inUpdate
). This means long index read ops can actually block ingest.This PR alleviates that problem by differentiating the possibly slow operations and the almost certainly fast operations so that a write lock is still blocked on long index operations, but the fast operations should only get blocked by write locks (which we should make sure are fast).
It also adds logging around long lock waits/holds for non-high priority locking. I decided not to track the timings for those, since we may be doing them many thousands of times per second.