-
Notifications
You must be signed in to change notification settings - Fork 479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tcmalloc bug when it handles non-sequential CPUs #188
Comments
We're using Envoy version 1.24 which uses tcmalloc version from 2022-08-06. |
I believe this is a bug in |
@ckennelly It looks like Abseil ends up calling:
Which is reasonable at first glance, and changing that behavior would probably break a lot of things. Restating the issue, the number of CPUs online is discontinuous, and we have holes in our set. tcmalloc/tcmalloc/internal/percpu.cc Lines 177 to 181 in 201cb85
i.e. if we have CPUs [0,2,4,6] online, it seems like we'd only try to initialize 0-3 with our Admittedly I'm just reading the code now, so I could totally be missing another path! We're happy to open an issue on the Abseil side, but I don't think they'll change things there, since they are specifically querying for the online CPU count. What's the best path forward here? |
It's worth filing a bug, IMO, if only to signal that it's a real problem affecting real users. (Ping, @jyknight, because reasons.) |
Yeah, I think Maybe they'd be interested in adding a function that returns the other value (e.g. |
Yes, that's my outside perspective, however each of you know this code much better than I do. Would writing a test similar to what I have below be a reasonable start?
I've been out of the C++ game for a bit, and suspect someone here knows the desired style better. (I could probably write it, but it would probably be ugly, and we'd probably have code contribution agreement hoops) Once we had a branch with a failing test, it seems like we can probably find a good fix. |
@clundquist-stripe : |
We should have that capability then, we're around here https://packages.ubuntu.com/focal/linux-aws I don't know the tcmalloc source, so I could have easily chased the wrong lead! Our mitigation is I don't think we can post a coredump here, but is there anything else we can help provide to debug this issue? email wise I'm [email protected], if it helps! |
I think the underlying problems are around To separate out that issue, if you replaced |
Yes, that matches my understanding.
I very likely linked/found the wrong snippet of code though.
This would take us a while to test, since we'd have to rebuild Envoy. If it did "fix" it, what would be the end game though? |
Endgame would be to modify the implementation of |
This is impacting Envoy when running inside a container that uses lxcfs to mount What is the CPU count used for in tcmalloc? Our workaround is to let the container read the host's |
TCMalloc is "Thread Caching malloc". It is used to initialize a slab to allocate from in user space per CPU, for better cache coherency |
Are you aware of any potential downsides or drawbacks of allowing TCMalloc within a container to believe it has the CPU count of the host? |
I can only speculate since I don't know TCMalloc too well by code, but having written similar things for games, the downside would be slightly higher overhead in initializing structures and less efficient heap usage. In practice, it probably won't matter unless you have some pathological worst case. |
If the The CPU count returned by |
@jfernandez I think @ckennelly is getting at it could still be an issue since the CPU IDs may not line up if there are holes in the CPU list, it just reduces the chances with more CPUs. We used something like this: Which seemed to work for us |
@ckennelly any traction here with the Abseil folks? |
PiperOrigin-RevId: 547256513 Change-Id: I44c42b154241bbbba21efa0cce1e8e4bc0f6a625
PiperOrigin-RevId: 550628877 Change-Id: Id6e6bb6fe8148a692a0376aeb0d0172a9dc74038
We opened an issue with envoyproxy envoyproxy/envoy#27775 about it's crashing on validating bootstrap config
We actually found a potential root cause to be a tcmalloc bug that it is unable to handle non-sequential online CPUs. Those segfaults happen on ec2 instances with nitro-enclaves enabled so there are some hot-plugged off CPUs, i.e.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 16
On-line CPU(s) list: 0,2-8,10-15
Off-line CPU(s) list: 1,9
And the theory is tcmalloc uses the cpu's id to index into the per-cpu arrays that hold the per cpu data structures. If tcmalloc allocates 14 entries because ncpu is 14, but the 14th cpu id is 15 then its array access is out of bounds.
Can you confirm if that's the valid root cause, and has it been fixed by any commit?
Thanks
The text was updated successfully, but these errors were encountered: