Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about per-cpu cache interaction with hot added vCPUs to VMs. #267

Closed
adityajaltade opened this issue Oct 16, 2024 · 4 comments
Closed

Comments

@adityajaltade
Copy link

Hello,
This is more of a question but could be a bug report if it stands.
Background:
We recently observed crashes while running Envoyproxy (v1.25.9 + custom filters, tcmalloc: 59400332b9cff9920b6a1da203ac1575272a9f44) in our environment and the crash was determined to coincide with addition of new vCPUs to the running VM. The process was running natively on the VM, not in a docker container.
This was 100% reproducible, but the stack traces differed each time. However, most times the common denominator in the stack traces included some tcmalloc functions.
Compiling with gperftools made the crash go away so we had it narrowed down to tcmalloc.

We tried writing a small producer/consumer program where the producer allocates heap memory, multiple consumers try to write to this memory and then free it and surely enough we see similar crashes with this small test program as well when vCPUs are hot plugged.

Admittedly, we've not been able to test this with a newer version of tcmalloc beyond e33c7bc60415127c104006d3301c96902f98d42a which is the latest version that Envoyproxy depends on.

Question:
How does tcmalloc handle hot plugging vCPUs to VMs with per-cpu caches? I'm very interested in knowing how this is or could be done.
Is this something that is a bug or unsupported or something that is fixed since e33c7bc60415127c104006d3301c96902f98d42a?

Any pointers are much appreciated.

-Aditya

Example traces

0x00005555555c3bd0 in tcmalloc::tcmalloc_internal::central_freelist_internal::StaticForwarder::MapObjectToSpan(void const*) ()
(gdb) bt
#0  0x00005555555c3bd0 in tcmalloc::tcmalloc_internal::central_freelist_internal::StaticForwarder::MapObjectToSpan(void const*) ()
#1  0x00005555555c0019 in tcmalloc::tcmalloc_internal::central_freelist_internal::CentralFreeList<tcmalloc::tcmalloc_internal::central_freelist_internal::StaticForwarder>::InsertRange(absl::Span<void*>) ()
#2  0x00005555555c0657 in tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<tcmalloc::tcmalloc_internal::cpu_cache_internal::StaticForwarder>::Deallocate(void*, unsigned long)::Helper::Overflow(int, unsigned long, void*, void*) ()
#3  0x00005555555b40ba in free ()
#4  0x000055555557f57e in consumer(void*) ()
#5  0x00007ffff78331ca in start_thread (arg=<optimized out>) at pthread_create.c:479
#6  0x00007ffff748e8d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
0x00005555555c23bc in tcmalloc::tcmalloc_internal::central_freelist_internal::CentralFreeList<tcmalloc::tcmalloc_internal::central_freelist_internal::StaticForwarder>::RemoveRange(void**, int) ()
(gdb) bt
#0  0x00005555555c23bc in tcmalloc::tcmalloc_internal::central_freelist_internal::CentralFreeList<tcmalloc::tcmalloc_internal::central_freelist_internal::StaticForwarder>::RemoveRange(void**, int) ()
#1  0x00005555555c1c5a in tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<tcmalloc::tcmalloc_internal::cpu_cache_internal::StaticForwarder>::Refill(int, unsigned long) ()
#2  0x00005555555c2f8a in tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<tcmalloc::tcmalloc_internal::cpu_cache_internal::StaticForwarder>::Allocate<&tcmalloc::tcmalloc_internal::TCMallocPolicy<tcmalloc::tcmalloc_internal::MallocOomPolicy, tcmalloc::tcmalloc_internal::AlignAsPolicy, tcmalloc::tcmalloc_internal::AllocationAccessHotPolicy, tcmalloc::tcm
alloc_internal::InvokeHooksPolicy, tcmalloc::tcmalloc_internal::LocalNumaPartitionPolicy>::handle_oom>(unsigned long)::Helper::Underflow(int, unsigned long, void*) ()
#3  0x00005555555b4d1a in memalign ()
#4  0x000055555557f42a in producer(void*) ()
#5  0x00007ffff78331ca in start_thread (arg=<optimized out>) at pthread_create.c:479
#6  0x00007ffff748e8d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
@ckennelly
Copy link
Collaborator

I think it'd be important to try to reproduce with a build at or after 5823a86 (from July 2023). e33c7bc is from October 2022.

Prior to that, we used absl::base_internal::NumCPUs() to get the number of CPUs for sizing our per-CPU arrays. If a CPU was offlined, it wasn't included in the count and the array would be too small for the cpu_id values the kernel might give us.

That patch switches us to reading /sys/devices/system/cpu/possible instead. There was prior discussion of offlining CPUs in #188.

@adityajaltade
Copy link
Author

adityajaltade commented Oct 17, 2024

Thanks for the information, @ckennelly.
When compiled with 5823a86, I see that the crash does not happen every time a new vCPU is added. It is sporadic but always coincides with adding a vCPU.
It now happens only ~10% of the times. So there is definite improvement, but exists nonetheless.

#0  0x00005555555c226f in tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<tcmalloc::tcmalloc_internal::cpu_cache_internal::StaticForwarder>::Deallocate(void*, unsigned long)::Helper::Overflow(int, unsigned long, void*, void*) ()
#1  0x00005555555b2a13 in free ()
#2  0x0000555555581bde in consumer(void*) ()
#3  0x00007ffff78331ca in start_thread (arg=<optimized out>) at pthread_create.c:479
#4  0x00007ffff748e8d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

and

#0  0x00005555555c206f in tcmalloc::tcmalloc_internal::cpu_cache_internal::CpuCache<tcmalloc::tcmalloc_internal::cpu_cache_internal::StaticForwarder>::Deallocate(void*, unsigned long)::Helper::Overflow(int, unsigned long, void*, void*) ()
#1  0x00005555555b2813 in free ()
#2  0x0000555555581a0e in consumer(void*) ()
#3  0x00007ffff78331ca in start_thread (arg=<optimized out>) at pthread_create.c:479
#4  0x00007ffff748e8d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@ckennelly
Copy link
Collaborator

It might be easier to start with the latest version at head. A lot has changed about the per-CPU cache in the last 12 months.

Is the crash consistently in the deallocation parts of the per-CPU cache, though? If you can compile with file+line debugging information, that might give some indication as to what's being accessed.

@ckennelly ckennelly reopened this Oct 18, 2024
@adityajaltade
Copy link
Author

adityajaltade commented Oct 21, 2024

I could not reproduce this with the version at a more recent f9f84f7d93a51ad1d70b5dd3a693cbce8c29bd82.

Thanks for the pointers @ckennelly. While I still don't know what caused this issue or how it exactly got fixed, it does look like the issue does not exist anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants