-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Envoy cpuset size to set the default number or worker threads #5975
Conversation
After (re-)reading the commit guidelines, I created issue #5976 for discussion. |
@ipuustin thanks for working on this. My feeling is that this is too risky to universally change. I'm thinking we can potentially add a new CLI option to enable this behavior? WDYT? Also, I would prefer if we could factor out the Linux specific code into a dedicated CC file without any ifdefs, and then this file can be excluded on OSX/Windows, etc. |
@mattklein123 Thanks for review comments. I understand very well the risk consideration. I'm totally fine with adding a command line switch. Maybe |
This proposal sounds great to me. |
@ipuustin, @mattklein123 There is a potential issue with this with the integration of Envoy and VPP. VPP comes up early and grabs some cores for DPDK and packet graph processing. I don't know how deterministic VPP's core selection is, but it is guaranteed that those cores can be pegged at 100% of CPU under load. I figured we'd be fine as long as we don't try to set core affinities for Envoy's cpuset. Is this the case? |
@sbelair2 This PR doesn't really deal with how Envoy's cpuset might be assigned. It just tries to make the default number of worker threads smaller if Envoy is running on a limited cpuset. If Envoy and VPP are running in the same container, they indeed share the same assigned cpuset. When VPP "grabs some cores" it probably limits itself running on those, but Envoy's cpuset isn't limited unless there's some startup wrapper which makes Envoy run on the remaining of the original cpuset. I think the consideration there should just be making the cpuset to be large enough to accommodate both Envoy and VPP. It shouldn't matter too much if there's one or two Envoy worker threads too many because VPP keeps one or two cores at full load. |
@ipuustin Thanks for the clarification. Btw, VPP will be running outside of any containers, but I see no issue as you state it. |
865395f
to
fbe451a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this is great, some high level comments and then will take a deeper look on the next pass. Thank you!
/wait
docs/root/operations/cli.rst
Outdated
@@ -76,6 +76,12 @@ following are the command line options that Envoy supports. | |||
`connection` component to run at `trace` level, you should pass ``upstream:debug,connection:trace`` to | |||
this flag. See ``ALL_LOGGER_IDS`` in :repo:`/source/common/common/logger.h` for a list of components. | |||
|
|||
.. option:: --cpuset-threads | |||
|
|||
*(optional)* This flag is used to control the number of worker threads if --concurrency option is not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I assume this is only a Linux thing? Can we potentially make that a little more and possible link to some definition / man page / etc. of cpuset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a mention that this is about Linux and a link to the kernel docs.
|
||
namespace Envoy { | ||
|
||
uint32_t OptionsImplPlatform::getCpuCount() { return std::thread::hardware_concurrency(); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we print a warning here that the platform does not support cpuset so Envoy is defaulting to HW concurrency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's a good idea (since it's obviously not working as the user tries to use that). I added the warning.
|
||
namespace Envoy { | ||
|
||
uint32_t OptionsImplPlatform::getCpuCount() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get some test coverage of this code? Potentially a test that only runs in Linux? If you want to test it using mocks you can also consider using OsSysCalls (that might be good also to test the edge cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some tests. I chose to refactor the getCpuCount()
function into two to make the testing easier.
return hw_threads; | ||
} | ||
|
||
std::string statusLine; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oschaaf what are your thoughts on this vs. https://github.com/envoyproxy/envoy-perf/blob/master/nighthawk/source/common/utility.cc#L17?
@ipuustin can you merge master to pick up the |
if (threads > 0 && threads <= hw_threads) { | ||
return threads; | ||
} | ||
return hw_threads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might mean we return 0, which may surprise an unsuspecting caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. I added a check to make sure hw_threads >= 1
.
|
||
uint32_t OptionsImplPlatform::getCpuCount() { | ||
unsigned int hw_threads = std::thread::hardware_concurrency(); | ||
std::string path("/proc/" + std::to_string(getpid()) + "/status"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Would '/proc/self/status' work as well here? Perhaps slightly simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, fixed.
681d16a
to
2f5f912
Compare
So I have been mulling this over for a bit, and one thing that crossed my mind, is that at some point it may make sense to discern between optimizing for throughput and optimizing for latency when considering available hardware resources. Specific to this proposed feature, such priority would make a difference in how to treat logical cpu's resulting from hyperthreading, depending on what should be prioritized. If latency is important, then maybe you want to avoid using logical cores.. if bandwidth is important, then maybe there's a factor we can derive that works best for Envoy (1.5 threads per HT-enabled cpu? 2? I'm not sure). In any case, over at https://github.com/envoyproxy/envoy-perf/blob/master/nighthawk/source/common/utility.cc#L17, However, I also peeked at https://github.com/open-mpi/hwloc/blob/master/hwloc/topology-linux.c#L4941 Also, I noticed that libhwloc does not seem to use
Considering our purpose to have a machine parse it, I wonder, is (ps - As far as I can tell the libhwloc license is pretty permissive: https://github.com/open-mpi/hwloc/blob/master/COPYING) |
@oschaaf Thanks for the insightful comment! Having heuristics which try to estimate whether HT is enabled and estimate the thread number based on that might be very tricky, because I think that the performance difference is depending on the SMT implementation that the processor has. Another way could be providing known-to-be-fast templates for different purposes: "If you want to optimize for XYZ, try starting with this configuration, and benchmark your workload..." For a single system (or a class of systems) it could be possible to find good thread factors. I've heard (but haven't benchmarked this myself) that disabling HT on some cores can be pretty closely "emulated" by this algorithm: find out the HT pairs in the CPU topology ( If we can come up with a suitable "default" thread number factor, that factor can be made to scale with the number of CPUs available. For a example, maybe there could be two worker threads if there was only one CPU, but 25 worker threads for 20 CPU cores? The current factor of 1 is pretty simplistic. But coming back to this PR, what changes are you proposing? Should we add libhwloc dependency and use that? Or use |
I feel adding libhwloc as a dependency for just this may be a bit heavy-weight if we only need a small subset of functionality out of it, but perhaps it could serve as inspiration when pursuing the current approach to implement this feature? For example, it also considers I'm not a lawyer, but I think even just copying some code out of it is OK as long as care is taken to properly attribute. For just counting cpu's with affinity, then imho |
@oschaaf : I added a patch changing the logic to use Do you mean by cgroups that we would access |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this looks great. A few small comments.
/wait
@@ -0,0 +1,24 @@ | |||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add something like:
#ifndef __LINUX__
#error "Linux platform included on non-Linux"
#endif
(or something like that). Same for the impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done.
namespace Envoy { | ||
namespace Api { | ||
|
||
class LinuxOsSysCallsImpl : public OsSysCalls, LinuxOsSysCalls { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there value in embedding and calling through? It seems like we could just have 2 syscall impls that are used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I changed this to work the way you suggested. I was originally thinking that maybe it would make sense that there would be only one *OsSysCalls
version for each platform, which would support all the system calls available for the platform. I'm totally fine with either way though.
// that can be known. | ||
concurrency_ = OptionsImplPlatform::getCpuCount(); | ||
} else { | ||
if (concurrency.isSet() && cpuset_threads_ && cpuset_threads.isSet()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a test for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test.
linux_os_syscalls.sched_getaffinity(pid, sizeof(cpu_set_t), &mask); | ||
if (result.rc_ == -1) { | ||
// Fall back to number of hardware threads. | ||
return hw_threads; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test for this case? The value in adding linux syscalls is now you can have a mock and unit test this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Add a function for finding the size of the cpuset of the process. In non-Linux platforms, fall back to the number of HW threads in the system. Signed-off-by: Ismo Puustinen <[email protected]>
Add a new configuration option "--cpuset-threads". If that option is set and "--concurrency" option is not set, only start a number of worker threads which equals to the size of the available cpuset. For example, if Envoy is started in a Docker container like this, only four worker threads are created if only "--cpuset-threads" option is set: $ docker run --cpuset-cpus="0-3" ... Signed-off-by: Ismo Puustinen <[email protected]>
Signed-off-by: Ismo Puustinen <[email protected]>
Also describe the changed server_info structure. Signed-off-by: Ismo Puustinen <[email protected]>
Signed-off-by: Ismo Puustinen <[email protected]>
Signed-off-by: Ismo Puustinen <[email protected]>
Signed-off-by: Ismo Puustinen <[email protected]>
Signed-off-by: Ismo Puustinen <[email protected]>
…nabled. Signed-off-by: Ismo Puustinen <[email protected]>
Signed-off-by: Ismo Puustinen <[email protected]>
Signed-off-by: Ismo Puustinen <[email protected]>
Signed-off-by: Ismo Puustinen <[email protected]>
Do not call-through to the non-Linux-specific system calls using the Linux system call API object. Signed-off-by: Ismo Puustinen <[email protected]>
Signed-off-by: Ismo Puustinen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you looks awesome!
/retest
🔨 rebuilding |
…voyproxy#5975) Signed-off-by: Ismo Puustinen <[email protected]> Signed-off-by: Fred Douglas <[email protected]>
srcs = ["os_sys_calls_impl.cc"], | ||
hdrs = ["os_sys_calls_impl.h"], | ||
srcs = ["os_sys_calls_impl.cc"] + select({ | ||
"@bazel_tools//src/conditions:linux_x86_64": ["os_sys_calls_impl_linux.cc"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR, this breaks the Google import since we don't have @bazel_tools
conditional in this form internally. I'd prefer to stick with //bazel:BUILD
conditionals, I will do an update PR for this.
unsigned int fake_cpuset_size = std::thread::hardware_concurrency(); | ||
unsigned int fake_hw_threads = fake_cpuset_size; | ||
|
||
EXPECT_EQ(OptionsImplPlatformLinux::getCpuAffinityCount(fake_hw_threads), fake_cpuset_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're seeing this break inside sandboxed build environments.. I think we will work around this by just disabling this feature, but if others notice it, might be worth further investigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a bug in the test design -- if you run the test in an environment where cpuset is reduced, the hardware thread count will not match with the cpuset size. What we can say is that the actual cpuset size <=
hw thread count. I need to think if the test can be fixed in a better way.
* Use cpuset size instead of total hardware threads to determine total worker threads. * Can be overridden by explicitly setting `--concurrency` arg More in https://www.envoyproxy.io/docs/envoy/latest/operations/cli#cmdoption-cpuset-threads Discussion in envoyproxy/envoy#5975 Relates to envoyproxy#1718 Signed-off-by: Arko Dasgupta <[email protected]>
* Set `--cpuset-threads` in EnvoyProxy cmdline arg * Use cpuset size instead of total hardware threads to determine total worker threads. * Can be overridden by explicitly setting `--concurrency` arg More in https://www.envoyproxy.io/docs/envoy/latest/operations/cli#cmdoption-cpuset-threads Discussion in envoyproxy/envoy#5975 Relates to #1718 Signed-off-by: Arko Dasgupta <[email protected]> * lint Signed-off-by: Arko Dasgupta <[email protected]> --------- Signed-off-by: Arko Dasgupta <[email protected]>
Description:
If
--concurrency
option is not specified, set the default number of worker threads based on the cpuset of the process. This means that the amount of worker threads is not controlled by the hardware thread count but instead the cpu allocation.For example, if Envoy is started in a Docker container like this, only four worker threads are created if
--concurrency
option is not specified:The original behavior was that the number of CPU threads in the system would be used to determine the number of worker threads. This number could be very large for servers with multiple physical CPUs and hyperthreading enabled.
This change has a potentially large performance impact, since it changes the default number of threads assigned to Envoy. However, since the original intent of the default concurrency value was to have the number of worker threads matching to the HW capabilities, after this change the Envoy behavior is probably closer to what is expected. Also,
std::thread::hardware_concurrency()
docs say that the returned value "should be considered only a hint".There are no tests yet, because I wanted to have some input whether a change like this would be desirable before finalizing the work. Another potential improvement possibility would be monitoring the process cpuset and changing the amount of worker threads dynamically.
Risk Level: Medium
Testing: manual testing with
taskset
anddocker
Docs Changes: Yes, explained the change of behavior
Release Notes: Yes