-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IMPORT: OOM on larger cluster sizes #35773
Comments
It would also be nice to have the node itself more resilient to this by limiting the number of expensive requests it processes concurrently -- we've done something similar with |
we still need to do more profiling to figure out exactly what's using so much memory about handling an AddSSTable request and I think we can also trim it e.g. I think we could trivially avoid one copy by making a leveldb |
Here is the panic I see on 9 nodes in a 60 node cluster during import:
4 additional nodes OOMed. Here is the repro steps in this closed issue. |
spent some time today with @yuzefovich looking at the OOMs reported in #34878.
Most of the time we observe an OOM the idle go memory is significantly more than the actual in-use. My limited understanding is that this implies we recently briefly used a significant amount of memory and the GC has not yet released it.
We're not quite sure what used (and then stopped using) so much memory, but looking a sampling of profiles, it seems like handling AddSSTable commands dominates the majority of profiles, with the majority in just deserializing these huge (32mb) commands from GRPC. The evaluation of an AddSSTable command does make some copies of the input data -- I think I see one obvious one for leveldb SSTable iteration, plus some other steps -- so evaluating one 32mb AddSSTable command could use (and then release) a non-trivial amount of memory.
In the 60 node cluster that @yuzefovich is testing with, I think it is likely that with 60 producers all making and sending SSTs out, a given node might see a large spike in memory usage if it recieved e.g. 10 or 20 such SSTs at the same time. If handling each required even just a couple copies, we could be using a couple of gigs or so just for this pretty quickly.
The nodes in this cluster have a total of 8gb of ram, so with cgo/rocks sitting on 2-3 and various other sql and server machinery sitting on a gig, it seems easy to easy to get into the danger zone quickly.
Reducing the SST size from 32mb to 16mb appeared to make the job complete successfully, so we may need to, in the short term, document that in larger clusters, you may need to do that if individual nodes don't have enough headroom for the spikes that a larger number of processor's making and sending SSTs can produce.
The text was updated successfully, but these errors were encountered: