Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create an ES client per simulated client instead of per worker. #1516

Merged

Conversation

michaelbaamonde
Copy link
Contributor

This is a prerequisite for #1067 and also indirectly relates to #1350.

As discussed in the context of API keys, runners shouldn't have to concern themselves with ES transport options; they should just be provided ES client objects that already have the right options set. This PR is the first step in that direction.

Previously, an ES client was created per worker and shared among all of the worker's simulated clients. Now, each worker will instead create distinct ES client objects per simulated client that the worker is responsible for. This unique ES client object is what ultimately gets passed to the runner(s) invoked by that simulated client.

In lower-level terms, a worker is basically represented by an AsyncIoAdapter object, while a client basically maps to an AsyncExecutor object. An AsyncIoAdapter will now create an ES client per client_id that it's responsible for, and pass it to the AsyncExecutor that's instantiated for that client_id. The AsyncExecutor ultimately invokes the runner(s) that issue requests via the ES client.

This isn't the only possible implementation, though. As the discussion linked above details, version 8.0+ of the ES Python client offers a new .options() method on clients for overriding transport options on a per-request basis. It doesn't mutate the client, but rather returns a new client object with the provided transport options set.

We originally planned on using this new API, either by upgrading our client dependency (which we've paused for now) or by implementing .options() ourselves in Rally today. However, after getting into the details more, it seems that .options() isn't strictly necessary right now, and may actually make implementation more difficult and risky. Here's why:

  1. While the .options() method is meant for request-level overrides, creating clients with transport options defined in the first place is still the normal way of doing things. "Overriding default transport settings at request time" is one way to conceptualize what we're trying to do here, but it's still perfectly valid for workers to just instantiate ES clients upfront with the transport options they already know they'll need. Now, at a lower level, individual runners may have a genuine need for overriding those options (e.g. setting a request_timeout or ignore_status), but it's not necessary at the level of simulated clients in order to support something like client-specific authentication. The downside could be the overhead of creating potentially many more ES client objects, but...

  2. ....options() will do that, anyway. As mentioned, .options() doesn't mutate the object it's called on, but rather creates a new object of its type. And leads to the real trickiness, which is that Rally's async ES client objects aren't just vanilla instances of AsyncElasticsearch. They also inherit from RequestContextHolder and are therefore stateful. My initial experiments with porting .options() to Rally quickly resulted in benchmarks hanging forever. I have a sense of why this is happening, but in light of (1) above, it doesn't seem worth the time or risk to pursue, at least not right now. We will need to figure this out eventually for the client upgrade, however, since .options() is how individual runners will need to set options like request_timeout, ignore_status, as mentioned.

So, to sum up, implementing .options() ourselves doesn't seem to offer obvious advantages at the moment, and making it work would probably require refactoring the code that Rally relies on for measuring timings and scheduling task execution, which is risky to say the least.

Regarding the potential overhead of creating more clients, I'm curious to hear input on this from the likes of @DJRickyB and @dliappis. If we merge as-is, we'll quickly get some feedback from the nightlies, but I'd think that if this is going to cause problems, it will be with very high client counts. I ran a logging indexing benchmark with 1,000 bulk indexing clients with both master and this branch of Rally and found no evidence of this change causing a client-side bottleneck. It's just one data point, so I'm happy to flesh out a larger number of test cases if we feel that's in order, but wanted to get the team's thoughts before investing a bunch of time in that.

Previously, Rally would instantiate an ES client per worker and scale its
connection pool to match the number of simulated clients associated with that
worker. This worker-level ES client would be shared across all of the
simulated clients spawned by the worker.

This meant that any simulated client and its siblings would have to use exactly
the same ES transport options, as they all shared a single ES client object.
There are use cases, however, where sibling clients may each need to set
unique transport options (e.g. for authentication). This commit facilitates
support for these lower-level overrides by creating a distinct ES client per
simulated client, not per worker.
@michaelbaamonde michaelbaamonde added the :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc. label Jun 13, 2022
@michaelbaamonde michaelbaamonde requested a review from pquentin June 13, 2022 20:47
@pquentin
Copy link
Member

Thanks for the detailed explanation! I like the approach but would like to hear from others too.

@michaelbaamonde
Copy link
Contributor Author

We ran several tests with both this branch and master to determine whether this change introduces any signs of a client-side bottleneck at higher client counts. We did not find any evidence to that effect, but, unfortunately, we discovered that even master has issues scaling beyond ~2k clients on a single load driver due to #1448 in some cases.

There’s no reason in principle to expect this change to significantly affect resource consumption on the load driver in terms of memory usage, file handles, etc. So while we don’t have corroboration at very high client counts, it seems that this change aligns with those expectations and doesn’t do worse than master.

So, we will merge this change and look into the scalability issues separately.

@michaelbaamonde michaelbaamonde merged commit 0ac68fc into elastic:master Jun 21, 2022
@pquentin pquentin added the enhancement Improves the status quo label Jun 23, 2022
@pquentin pquentin added this to the 2.5.0 milestone Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants