Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Envoy cpuset size to set the default number or worker threads #5975

Merged
merged 14 commits into from
Mar 4, 2019

Conversation

ipuustin
Copy link
Member

Description:

If --concurrency option is not specified, set the default number of worker threads based on the cpuset of the process. This means that the amount of worker threads is not controlled by the hardware thread count but instead the cpu allocation.

For example, if Envoy is started in a Docker container like this, only four worker threads are created if --concurrency option is not specified:

$ docker run --cpuset-cpus="0-3" ...

The original behavior was that the number of CPU threads in the system would be used to determine the number of worker threads. This number could be very large for servers with multiple physical CPUs and hyperthreading enabled.

This change has a potentially large performance impact, since it changes the default number of threads assigned to Envoy. However, since the original intent of the default concurrency value was to have the number of worker threads matching to the HW capabilities, after this change the Envoy behavior is probably closer to what is expected. Also, std::thread::hardware_concurrency() docs say that the returned value "should be considered only a hint".

There are no tests yet, because I wanted to have some input whether a change like this would be desirable before finalizing the work. Another potential improvement possibility would be monitoring the process cpuset and changing the amount of worker threads dynamically.

Risk Level: Medium
Testing: manual testing with taskset and docker
Docs Changes: Yes, explained the change of behavior
Release Notes: Yes

@ipuustin
Copy link
Member Author

After (re-)reading the commit guidelines, I created issue #5976 for discussion.

@mattklein123
Copy link
Member

@ipuustin thanks for working on this. My feeling is that this is too risky to universally change. I'm thinking we can potentially add a new CLI option to enable this behavior? WDYT?

Also, I would prefer if we could factor out the Linux specific code into a dedicated CC file without any ifdefs, and then this file can be excluded on OSX/Windows, etc.

@mattklein123 mattklein123 self-assigned this Feb 15, 2019
@ipuustin
Copy link
Member Author

@mattklein123 Thanks for review comments. I understand very well the risk consideration. I'm totally fine with adding a command line switch. Maybe --cpuset-threads, which, if enabled, would mean that the default worker threads number would be calculated based on the assigned cpuset? Having the Linux-specific code in a separate file makes also a lot of sense. I'll do the changes once we get the potential design somewhat nailed down.

@mattklein123
Copy link
Member

I'll do the changes once we get the potential design somewhat nailed down.

This proposal sounds great to me.

@sbelair2
Copy link
Contributor

@ipuustin, @mattklein123 There is a potential issue with this with the integration of Envoy and VPP. VPP comes up early and grabs some cores for DPDK and packet graph processing. I don't know how deterministic VPP's core selection is, but it is guaranteed that those cores can be pegged at 100% of CPU under load. I figured we'd be fine as long as we don't try to set core affinities for Envoy's cpuset. Is this the case?

@ipuustin
Copy link
Member Author

@sbelair2 This PR doesn't really deal with how Envoy's cpuset might be assigned. It just tries to make the default number of worker threads smaller if Envoy is running on a limited cpuset.

If Envoy and VPP are running in the same container, they indeed share the same assigned cpuset. When VPP "grabs some cores" it probably limits itself running on those, but Envoy's cpuset isn't limited unless there's some startup wrapper which makes Envoy run on the remaining of the original cpuset. I think the consideration there should just be making the cpuset to be large enough to accommodate both Envoy and VPP. It shouldn't matter too much if there's one or two Envoy worker threads too many because VPP keeps one or two cores at full load.

@sbelair2
Copy link
Contributor

@ipuustin Thanks for the clarification. Btw, VPP will be running outside of any containers, but I see no issue as you state it.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is great, some high level comments and then will take a deeper look on the next pass. Thank you!

/wait

@@ -76,6 +76,12 @@ following are the command line options that Envoy supports.
`connection` component to run at `trace` level, you should pass ``upstream:debug,connection:trace`` to
this flag. See ``ALL_LOGGER_IDS`` in :repo:`/source/common/common/logger.h` for a list of components.

.. option:: --cpuset-threads

*(optional)* This flag is used to control the number of worker threads if --concurrency option is not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I assume this is only a Linux thing? Can we potentially make that a little more and possible link to some definition / man page / etc. of cpuset?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a mention that this is about Linux and a link to the kernel docs.


namespace Envoy {

uint32_t OptionsImplPlatform::getCpuCount() { return std::thread::hardware_concurrency(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we print a warning here that the platform does not support cpuset so Envoy is defaulting to HW concurrency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good idea (since it's obviously not working as the user tries to use that). I added the warning.


namespace Envoy {

uint32_t OptionsImplPlatform::getCpuCount() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get some test coverage of this code? Potentially a test that only runs in Linux? If you want to test it using mocks you can also consider using OsSysCalls (that might be good also to test the edge cases).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some tests. I chose to refactor the getCpuCount() function into two to make the testing easier.

return hw_threads;
}

std::string statusLine;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htuch
Copy link
Member

htuch commented Feb 21, 2019

@ipuustin can you merge master to pick up the compile_time_options CI fix?

if (threads > 0 && threads <= hw_threads) {
return threads;
}
return hw_threads;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might mean we return 0, which may surprise an unsuspecting caller?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I added a check to make sure hw_threads >= 1.


uint32_t OptionsImplPlatform::getCpuCount() {
unsigned int hw_threads = std::thread::hardware_concurrency();
std::string path("/proc/" + std::to_string(getpid()) + "/status");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Would '/proc/self/status' work as well here? Perhaps slightly simpler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, fixed.

@oschaaf
Copy link
Member

oschaaf commented Feb 26, 2019

So I have been mulling this over for a bit, and one thing that crossed my mind, is that at some point it may make sense to discern between optimizing for throughput and optimizing for latency when considering available hardware resources. Specific to this proposed feature, such priority would make a difference in how to treat logical cpu's resulting from hyperthreading, depending on what should be prioritized. If latency is important, then maybe you want to avoid using logical cores.. if bandwidth is important, then maybe there's a factor we can derive that works best for Envoy (1.5 threads per HT-enabled cpu? 2? I'm not sure).

In any case, over at https://github.com/envoyproxy/envoy-perf/blob/master/nighthawk/source/common/utility.cc#L17, determineCpuCoresWithAffinity() relies on pthread_getaffinity_np() and CPU_COUNT. Unfortunately that also does not make a distinction between physical and logical cores, but I feel that relying on api calls to do a straightforward count of cpu's results in an implementation that is easy to grok.

However, I also peeked at libhwloc while implementing determineCpuCoresWithAffinity(). I think libhwloc does what this PR implements (and a lot more). As such, it may be worth taking a peek at their linux implementation of this specific functionality, which is looks similar to the change proposed here?

https://github.com/open-mpi/hwloc/blob/master/hwloc/topology-linux.c#L4941

Also, I noticed that libhwloc does not seem to use /proc/self/status. Not sure if it is related, but the man pages say something potentially interesting about it:

/proc/[pid]/status
              Provides much of the information in /proc/[pid]/stat and
              /proc/[pid]/statm in a format that's easier for humans to
              parse.

Considering our purpose to have a machine parse it, I wonder, is /proc/self/status the best field to use for our purpose?

(ps - As far as I can tell the libhwloc license is pretty permissive: https://github.com/open-mpi/hwloc/blob/master/COPYING)

@ipuustin
Copy link
Member Author

@oschaaf Thanks for the insightful comment!

Having heuristics which try to estimate whether HT is enabled and estimate the thread number based on that might be very tricky, because I think that the performance difference is depending on the SMT implementation that the processor has. Another way could be providing known-to-be-fast templates for different purposes: "If you want to optimize for XYZ, try starting with this configuration, and benchmark your workload..."

For a single system (or a class of systems) it could be possible to find good thread factors. I've heard (but haven't benchmarked this myself) that disabling HT on some cores can be pretty closely "emulated" by this algorithm: find out the HT pairs in the CPU topology (/proc/cpuinfo core ids), disable the other CPU in the pair (echo 0 > /sys/devices/system/cpu/cpuX/online), put the remaining CPU cores into a cpuset, and give that cpuset to the process. I don't know what's the scope of envoy-perf, but maybe it could use this method to find factors that work well for several workloads in both HT-enabled and HT-disabled systems?

If we can come up with a suitable "default" thread number factor, that factor can be made to scale with the number of CPUs available. For a example, maybe there could be two worker threads if there was only one CPU, but 25 worker threads for 20 CPU cores? The current factor of 1 is pretty simplistic.

But coming back to this PR, what changes are you proposing? Should we add libhwloc dependency and use that? Or use sched_getaffinity() / pthread_getaffinity_np() for getting the CPU count? I'm fine with both approaches.

@oschaaf
Copy link
Member

oschaaf commented Feb 27, 2019

I feel adding libhwloc as a dependency for just this may be a bit heavy-weight if we only need a small subset of functionality out of it, but perhaps it could serve as inspiration when pursuing the current approach to implement this feature? For example, it also considers cgroups, which could be nice (and which is mentioned in the linked kernel docs)?

I'm not a lawyer, but I think even just copying some code out of it is OK as long as care is taken to properly attribute.

For just counting cpu's with affinity, then imho sched_getaffinity() / pthread_getaffinity_np() may be easier to grok and test?

@ipuustin
Copy link
Member Author

@oschaaf : I added a patch changing the logic to use sched_getaffinity(). Please take a look if this is more what you had in mind.

Do you mean by cgroups that we would access cpu cgroup and allocate right amount of threads when Envoy is be running on all cores but with a limited cpu quota? I agree that the cpuset size calculation doesn't help with that, but I'm not sure if limiting the amount of worker threads is even the right thing to do in that case.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this looks great. A few small comments.

/wait

@@ -0,0 +1,24 @@
#pragma once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add something like:

#ifndef __LINUX__
#error "Linux platform included on non-Linux"
#endif 

(or something like that). Same for the impl.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, done.

namespace Envoy {
namespace Api {

class LinuxOsSysCallsImpl : public OsSysCalls, LinuxOsSysCalls {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there value in embedding and calling through? It seems like we could just have 2 syscall impls that are used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed this to work the way you suggested. I was originally thinking that maybe it would make sense that there would be only one *OsSysCalls version for each platform, which would support all the system calls available for the platform. I'm totally fine with either way though.

// that can be known.
concurrency_ = OptionsImplPlatform::getCpuCount();
} else {
if (concurrency.isSet() && cpuset_threads_ && cpuset_threads.isSet()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a test for this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test.

linux_os_syscalls.sched_getaffinity(pid, sizeof(cpu_set_t), &mask);
if (result.rc_ == -1) {
// Fall back to number of hardware threads.
return hw_threads;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a test for this case? The value in adding linux syscalls is now you can have a mock and unit test this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

ipuustin added 14 commits March 4, 2019 10:58
Add a function for finding the size of the cpuset of the process. In
non-Linux platforms, fall back to the number of HW threads in the
system.

Signed-off-by: Ismo Puustinen <[email protected]>
Add a new configuration option "--cpuset-threads". If that option is set
and "--concurrency" option is not set, only start a number of worker
threads which equals to the size of the available cpuset.

For example, if Envoy is started in a Docker container like this, only
four worker threads are created if only "--cpuset-threads" option is
set:

  $ docker run --cpuset-cpus="0-3" ...

Signed-off-by: Ismo Puustinen <[email protected]>
Also describe the changed server_info structure.

Signed-off-by: Ismo Puustinen <[email protected]>
Do not call-through to the non-Linux-specific system calls using the
Linux system call API object.

Signed-off-by: Ismo Puustinen <[email protected]>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you looks awesome!

/retest

@repokitteh-read-only
Copy link

🔨 rebuilding ci/circleci: asan (failed build)

🐱

Caused by: a #5975 (review) was submitted by @mattklein123.

see: more, trace.

@mattklein123 mattklein123 merged commit f59a30e into envoyproxy:master Mar 4, 2019
fredlas pushed a commit to fredlas/envoy that referenced this pull request Mar 5, 2019
@lizan lizan mentioned this pull request Mar 6, 2019
srcs = ["os_sys_calls_impl.cc"],
hdrs = ["os_sys_calls_impl.h"],
srcs = ["os_sys_calls_impl.cc"] + select({
"@bazel_tools//src/conditions:linux_x86_64": ["os_sys_calls_impl_linux.cc"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR, this breaks the Google import since we don't have @bazel_tools conditional in this form internally. I'd prefer to stick with //bazel:BUILD conditionals, I will do an update PR for this.

unsigned int fake_cpuset_size = std::thread::hardware_concurrency();
unsigned int fake_hw_threads = fake_cpuset_size;

EXPECT_EQ(OptionsImplPlatformLinux::getCpuAffinityCount(fake_hw_threads), fake_cpuset_size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're seeing this break inside sandboxed build environments.. I think we will work around this by just disabling this feature, but if others notice it, might be worth further investigation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a bug in the test design -- if you run the test in an environment where cpuset is reduced, the hardware thread count will not match with the cpuset size. What we can say is that the actual cpuset size <= hw thread count. I need to think if the test can be fixed in a better way.

arkodg added a commit to arkodg/gateway that referenced this pull request Jul 28, 2023
* Use cpuset size instead of total hardware threads to determine
total worker threads.

* Can be overridden by explicitly setting `--concurrency` arg

More in https://www.envoyproxy.io/docs/envoy/latest/operations/cli#cmdoption-cpuset-threads

Discussion in envoyproxy/envoy#5975

Relates to envoyproxy#1718

Signed-off-by: Arko Dasgupta <[email protected]>
zirain pushed a commit to envoyproxy/gateway that referenced this pull request Jul 31, 2023
* Set `--cpuset-threads` in EnvoyProxy cmdline arg

* Use cpuset size instead of total hardware threads to determine
total worker threads.

* Can be overridden by explicitly setting `--concurrency` arg

More in https://www.envoyproxy.io/docs/envoy/latest/operations/cli#cmdoption-cpuset-threads

Discussion in envoyproxy/envoy#5975

Relates to #1718

Signed-off-by: Arko Dasgupta <[email protected]>

* lint

Signed-off-by: Arko Dasgupta <[email protected]>

---------

Signed-off-by: Arko Dasgupta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants