-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reject requests instead of crashing when overloaded #638
Comments
This is an interesting problem. It would be useful for me (and potentially for other large users) if you could share some details of your storage, the normal load, and the spikes you're seing (either here, or via email if you prefer not to share this info publicly). Does the client handle the service outage gracefully? And do you have bazel-remote configured to restart and hopefully recover from the overload? The first things to suggest if you want to maintain performance would be:
If the storage is falling behind due to the number of requests, then I doubt that increasing runtime/debug.SetMaxThreads would help for long (though it probably depends on the duration of the spikes). You're welcome to experiment with this, of course (it would be a one-line patch to try a higher limit). I think it would be reasonable to try to make bazel-remote attempt to recognise when the storage is falling behind and reject new requests (eg with the unofficial HTTP 529 status code or gRPC RESOURCE_EXHAUSTED or UNAVAILABLE - though I'm not sure if bazel handles those gracefully). Write actions could for example use semaphore#Weighted.TryAcquire on the same semaphore used for file removal. Prometheus metric requests don't touch the filesystem, and should be infrequent enough anyway to be left as-is. |
…erloaded We have been using a file removal semaphore with weight 5,000 (half of Go's default 10,000 maximum OS threads, beyond which Go will crash), in an attempt to avoid crashing when the filesystem/storage layer can't keep up with our requirements. This change renames that semaphore to `diskWaitSem` and also uses it for disk-write operations. When the semaphore cannot be acquired for disk-writes, we return HTTP 503 (service unavailable) or gRPC RESOURCE_EXHAUSTED error codes to the client. Relates to buchgr#638
Thanks @mostynb! We have patched bazel client with a custom load balancer. It distributes load over several bazel-remote instances. The patch contains logic to ignore cache instances for a minute if experienced remote cache issues, and during that time handle requests as if they were cache misses. That works well for local builds with remote caches, and should be OK also with rejects. That patch is not yet compatible with remote execution. For remote execution we use bazel's In both cases, the bazel client behaviour allows the remote cache to recover from the overload. Our normal load is about 500 – 1500 write requests/second per bazel-remote instance. We had three crashes. Our monitoring show request rate calculated over 20 second intervals, so I don't know exactly how high or short the spikes were:
I guess the numbers above can vary for different kind of builds and different file sizes. Thanks for #639 and your quick response! I will try to arrange for us to try it! I’m uncertain about what constants to choose, but I’m afraid 5000 could result in unnecessary rejects for very short high spikes. Especially if there are writes of large files that cause huge number of small file evictions. I’m thinking about trying something like:
|
Sorry for later answer, took time to get hardware and to experiment. When trying #639, bazel-remote rejects requests instead of crashing, which is good. But it rejected requests even when low load. Because writing one single large file can result in eviction of many thousands small files, and those eviction spikes can be very unpredictable. We tried increasing the semaphoreWeight from 5000 to 50000 and MaxThreads correspondingly, but it did not help, just as you predicted. We experienced the go runtime never release started phreads, therefore I suspect already 5000 concurrent file removal syscalls is more than desired. However, the good news: We experienced that if the cache is not full and no evictions occurs, then no crashes, request latency is fine and number of threads forked by the go runtime was below 2000. In other words, the overload seems to be caused by the evictions. What if we can minimize evictions at peak load to increase max throughput? And improve write latency by avoid performing evictions as part of handling incoming write requests? We experimented with a single go routine that evicted files slowly one at a time in background proactively when lru.currentSize is above a threshold. That single go routine could not evict files as fast as they were written at peaks (~2 minutes long peaks), but easily caught up between the peaks. The lru.currentSize fluctuated only ~400 MB. And the performance was good, similar to when no evictions at all. Other scenarios might be different, but I guess there would be plenty of margin if evicting proactively when lru.currentSize for example is above 95% of lru.maxSize. Using disk space as buffer and wasting 5% disk space to manage peaks would be a good trade-off for us. As extra safety, perhaps incoming disk.Put could still also perform evictions in the rare case that lru.currentSize would reach lru.maxSize. What do you think @mostynb? |
Thanks for the update, lots of things to consider here. I have been thinking about changing the file removal, so that when under load we process the removal of small blobs asynchronously at a slower rate. We might temporarily go over the cache size limit, but if we reduce the effective cache size limit that would kind of be equivalent to proactive evictions when reaching the limit, and wouldn't require polling the cache size. |
Interesting! What do you have in mind for implementing “asynchronously at a slower rate”? Can it avoid creating a goroutine for each blob? Example: If using 5% of a 2 TB disk as buffer for blobs scheduled for eviction, there can be many such blobs. Maybe 100 000 blobs, maybe more. I'm not sure how the go runtime would handle that many goroutines, but if a crash printed 100 000 stacktraces for me to read, then it would be cumbersome. 😃 I imagine having one threshold size that triggers evictions and another upper max size where bazel-remote would start rejecting write requests. |
golang.org/x/sync/semaphore is very small, we could fork it and add functions which expose the reserved/current size of the semaphore, and that could be used as a heuristic for when we're under high disk load. eg if we added a |
I'm striving to perform evictions with low priority (compared to the handling of incoming requests) rather than immediate removal of blobs. In order to not only mange the peak load, but also to reduce latency for incoming requests regardless of load. |
@mostynb @ulrfa I was experimenting a bit for a solution to this issue and would love your feedback on this PoC: #695. The idea is to have a fixed number of go routines that pick eviction tasks from a queue. Both the number of goroutines and the max length of the queue are configurable so that there is more freedom in deciding how many resources to allocate to eviction and when to consider the service overloaded. |
…erloaded We have been using a file removal semaphore with weight 5,000 (half of Go's default 10,000 maximum OS threads, beyond which Go will crash), in an attempt to avoid crashing when the filesystem/storage layer can't keep up with our requirements. This change renames that semaphore to `diskWaitSem` and also uses it for disk-write operations. When the semaphore cannot be acquired for disk-writes, we return HTTP 503 (service unavailable) or gRPC RESOURCE_EXHAUSTED error codes to the client. Relates to buchgr#638
Thanks @AlessandroPatti! Status update from me: I did several experiments and surprisingly found that deleting lots of files sequentially with a single goroutine, is much faster than starting separate go routines and removing them in parallel, despite SSDs with high IOPS performance. I implemented a new set of patches based on those findings and we have used them in production a few months. We are very happy with them, since they do not only avoid the crashes on high load, but also improves the performance in scenarios with many cache evictions. I intended to push them upstream before, but I did not have time to rebase and write test cases, so it did not happen. But now I pushed them as is: #696 My set of patches are based on @mostynb's previous commit and adds 3 parts. The "Optimize file removals (part 1)" is based on a queue just as @AlessandroPatti's PoC. Two people independently arrive at the same conclusion is a good sign! 😃 @AlessandroPatti, have you benchmarked any scenario where having more than one gorutine for deleting files from the queue, is improving the performance? @mostynb, have you been thinking more about this? |
@ulrfa Thanks for sharing your findings! I haven't run any "scientific" benchmark on this but from experimenting, performing eviction in parallel seems generally better. This is what I tried:
By running the above I can see the eviction queue growing when the concurrency is too low, while it's somewhat stable for concurrency ~100. On another note, the solution you've shared seems to throttle PUT requests with a semaphore, but not GET requests which could however also require eviction if the blob is fetched from the backend. Curious if you've tried using a proxy backend and this is not an issue from your experience? |
(I'm busier than normal at the moment, but wanted to mention that my goal is to merge some form of #680 this week, then turn my attention back to this issue.) |
Most benchmarks I have been running are re-sending traffic patterns from recorded real sessions. I need to think about if I can share something from those. But for cache eviction performance I propose the following benchmark:
In order to messure in step 2 above, I prepared commits with additional logging of "Duration for loading and evicting files" for the #695 and #696 pull requests, and also master:
fill_cache.py is doing:
On one of our cache servers (72 CPUs, 2 x SSDs in RAID-0 config, Linux, XFS), I get the following result: PR #696 (queue with single consuming goroutine):
PR #695 (queue with various numbers of consuming goroutines):
Latest master (goroutine per file and semaphore weight 5000):
It would be interesting to know if you get a similar pattern on your systems? Would you like to try that @AlessandroPatti?
Do you think it is possible that the eviction queue is more stable for concurrency ~100, not because old files are removed faster, but due to those 100 goroutines are slowing down the other goroutines that are writing new files, since competing for resources?
I'm not using the proxy backend and don't have much experience of it. (I'm using sharding instead). The throttling via semaphore in #696 is not because it can result in eviction, but because each PUT file write seems to consume one operating system thread. However, I think you are right about that some additional handling is needed for such GET requests with proxy since they probably also can result in blocking file write syscalls. Would it make sense to throttle with the same semaphore also in the end of disk.get method, e.g. after getting a proxy cache hit and going to write it? |
As reference with same pre-filled cache of 10 GB tiny files:
A naive attempt to use parallellism via xargs is only making it worse:
If deleting complete subdirectories, then parallelism seems to help slightly. (Maybe not deleting files in same directory from different concurrent processes helps? But I don't think it is worth having separate eviction queues per subdirectory, because there will be other writes of new files going on concurrently anyway.)
|
Thanks @ulrfa I added a small benchmark in 420f4ad that seems to somewhat confirm what you've experienced. Increasing the go routines does not make it worse but doesn't make it better either.
|
Thanks for running benchmark @AlessandroPatti! Both #695 and #696 are introducing queues for evicted files, and I think both our benchmarks motivates that approach. I think next step is to hear what Mostyn thinks about a queue for files to be removed. But I guess Mostyn is still busy. I'm also busy at the moment, that is why I'm responding with delay, sorry for that. |
We experience bazel-remote crashes when local SSD is overloaded by write operations. It would be preferable if bazel-remote would reject requests instead of crashing.
The crash message contains:
runtime: program exceeds 10000-thread limit
and among the listed goroutines, there are almost 10000 stack traces containing either diskCache.Put or diskCache.removeFile, doing syscalls. They originate from both HTTP and gRPC requests.What do you think about rejecting incoming requests when reaching a configurable number of max concurrent diskCache.Put invocations? Or are there better ways?
We have not experienced overload from read requests, the network interface is probably saturated before that would happen. Therefore, it seems beneficial to still allow read requests and get cache hits even when writes are rejected.
I’m uncertain about proper relation between the semaphore constant (currently 5000) for the diskCache.removeFile, the number of allowed diskCache.Put and Go’s default 10 000 operating system threads. Would it make sense to set one limit for the sum of diskCache.removeFile and diskCache.Put? Should bazel-remote support tuning https://pkg.go.dev/runtime/debug#SetMaxThreads?
It is important to not reject or hinder Prometheus HTTP requests, because the metrics are even more important in overload conditions.
The text was updated successfully, but these errors were encountered: