-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[POC] Fleet agent concurrency #70495
Conversation
@@ -282,7 +282,7 @@ export class HttpServer { | |||
this.log.warn(`registerOnPreAuth called after stop`); | |||
} | |||
|
|||
this.server.ext('onRequest', adoptToHapiOnPreAuthFormat(fn, this.log)); | |||
this.server.ext('onPreAuth', adoptToHapiOnPreAuthFormat(fn, this.log)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The onRequest
lifecycle event occurs prior to "route lookup", which prevented the use of request.route.options.tags
within the pre-auth handler to determine whether or not the route is a fleet-agent specific. Changing this from onRequest
to onPreAuth
fixes this specific issue, but potentially introduces others.
@restrry this was originally using onRequest
as part of #36690, are you aware of any reason why this can't be changed to onPreAuth
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onRequest was required to support Spaces rewriting url.
kibana/x-pack/plugins/spaces/server/lib/request_interceptors/on_request_interceptor.ts
Line 47 in 40a456d
return toolkit.rewriteUrl(newUrl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, thanks! Could we theoretically change http.registerOnPreAuth
to use onPreAuth
, and introduce a http.registerOnPreRouting
which uses onRequest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, thanks! Could we theoretically change http.registerOnPreAuth to use onPreAuth, and introduce a http.registerOnPreRouting which uses onRequest?
Yeah, it shouldn't be that hard. I want to make sure it's necessary to extend API for the current implementation. Let me know and the platform team can provide an interceptor implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
); | ||
core.http.registerOnPreResponse( | ||
(request: KibanaRequest, preResponse: OnPreResponseInfo, toolkit: OnPreResponseToolkit) => { | ||
if (isAgentRequest(request) && preResponse.statusCode !== 429) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that only the pre-auth handler could cause a response with status code 429, which isn't absolutely certain. It's possible that one of the HTTP route handlers is replying with a 429 itself, or propagating a 429 from Elasticsearch, which would mess up the counter.
The request
object is different from the pre-auth to the pre-response, so I wasn't able to use a Set
or WeakSet
to track whether or not this was a 429 that we returned from within the pre-auth handler... Any other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we wrap Ingest manager
specific route handlers in a function instead of adding global interceptors? https://github.com/elastic/kibana/pull/70495/files#diff-280f58825bd033ffbe7792f8423f6122R88
I'm also not sure that we want to provide access to response data in interceptors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something along lines (not tested):
import { RequestHandlerWrapper } from 'src/core/server';
class Lock {
constructor(private readonly maxConnections: number = 1) {}
private counter = 0;
increase() {
this.counter += 1;
}
decrease() {
this.counter += 1;
}
canHandle() {
return this.counter < this.maxConnections;
}
}
const lock = new Lock();
export const concurrencyLimit: RequestHandlerWrapper = (handler) => {
return async (context, request, response) => {
if (!lock.canHandle()) {
return response.customError({ body: 'Too Many Agents', statusCode: 429 });
}
try {
lock.increase();
return handler(context, request, response);
} finally {
lock.decrease();
}
};
};
concurrencyLimit(postAgentEnrollHandler);
concurrencyLimit(postAgentCheckinHandler);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't let us reject traffic without validating API Keys as described here, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... I haven't seen the comment. Now I'm surprised that the auth has such a huge penalty.
As @kobelb mentioned, it's critical that we can reject incoming traffic as cheaply as possible. It translates to saved $$ for our customers, making us able to offer a more competitive solution.
@kobelb for what use-cases we should consider this as a critical option? Auth & spaces do a huge number of requests to ES on every page load.
Would server session help to reduce the number of requests? Does this problem exist for API keys only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fleet's usage of Kibana APIs is different than our traditional usage. The Elastic Agent will be using Kibana APIs to enroll themselves and retrieve their configuration. As such, we're potentially dealing with thousands of Elastic Agents hitting these APIs... Whether or not this is "critical" is debatable and largely dependent on what the ingest-management team is seeing during their load-testing, but skipping auth reduces the load on Kibana when this circuit breaker is hit.
@roncohen despite the current implementation being imperfect and potentially misinterpreting the 429
, can we perform load-testing with the circuit breaking being done before authentication and after authentication to determine what type of impact this has on Fleet's scalability?
Would server session help to reduce the number of requests?
I don't think it will, Kibana will still have to make a query to Elasticsearch to return the server-side session document to authenticate the user.
Does this problem exist for API keys only?
Any call to an Elasticsearch API will incur some performance penalty. However, there are differences in the caching strategy for API Keys vs username/password. Regardless of the type of credentials, performing any unnecessary work before their circuit breaker has a chance to short-circuit the operation is inefficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@roncohen despite the current implementation being imperfect and potentially misinterpreting the 429, can we perform load-testing with the circuit breaking being done before authentication and after authentication to determine what type of impact this has on Fleet's scalability?
+1 I'd love to adjust the concurrency value (what should the value be) and merge this into master ASAP so we can test in cloud as we did with long polling
@@ -152,6 +157,35 @@ export class IngestManagerPlugin | |||
} | |||
|
|||
public async setup(core: CoreSetup, deps: IngestManagerSetupDeps) { | |||
const maxConcurrentRequests = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this logic was put here out of laziness, it should really be placed in another file...
@joshdover any input on the general approach and/or the limitations which I've explicitly noted? |
What's the primary reason for executing this logic in the pre auth stage rather than in the regular request handler? Are we trying to avoid executing some expensive logic or are we just trying to register some behavior in single place? If it's the latter, I would recommend implementing this as a wrapper or even a Proxy object around the |
It was intending to address #67987 (comment). Authenticating incurs some overhead, so if we can short-circuit before performing the authentication we minimize the load on Elasticsearch even more. |
thanks so much for taking stab at this! I really appreciate the effort here. As @kobelb mentioned, it's critical that we can reject incoming traffic as cheaply as possible. It translates to saved $$ for our customers, making us able to offer a more competitive solution. |
Thanks for getting #70775 together so quickly @restrry! @roncohen, is my understanding correct that Fleet's performance tests can't be done against feature-branches? While I'd prefer to hold off on adding disparate |
ccing @jfsiii and @scunningham: what do you think about testing this feature branch locally/self-hosted GCP instance just to compare the difference this makes? e.g. try to enroll 7k-10k agents over a very short period and see how Kibana/Elasticsearch does in the branch vs. master? A simple "autocannon" test would also be sufficient. It's probably worth trying out a few different @kobelb should we do a version/branch that doesn't add the |
As long as it's feasible, that'd be great. |
@elasticmachine merge upstream |
💔 Build Failed
Failed CI Steps
Test FailuresChrome X-Pack UI Functional Tests.x-pack/test/functional/apps/discover/feature_controls/discover_spaces·ts.discover feature controls spaces space with no features disabled shows discover navlinkStandard Out
Stack Trace
Chrome X-Pack UI Functional Tests.x-pack/test/functional/apps/dev_tools/feature_controls/dev_tools_spaces·ts.console feature controls spaces space with no features disabled shows 'Dev Tools' navlinkStandard Out
Stack Trace
Chrome X-Pack UI Functional Tests.x-pack/test/functional/apps/canvas/feature_controls/canvas_spaces·ts.Canvas app spaces feature controls space with no features disabled shows canvas navlinkStandard Out
Stack Trace
and 13 more failures, only showing the first 3. Build metrics
History
To update your PR or re-run it, just comment with: |
@kobelb @roncohen just wanted to mention that I am working on this locally #67987 (comment) I'll push any changes/numbers to my fork and we can decide what to do from there |
I put a draft PR up at #71221 but it's having some issues. I'd love some additional 👀 |
@jfsiii I overlooked the fact that |
I was wrong about this...
Whether or not we should be using the completion of the |
So we do have test coverage that explicitly verifies that the |
Works by the design of the Observables. @joshdover yes, you can add kibana/src/core/server/http/router/request.ts Lines 188 to 191 in 159369b
|
I should have clarified. The aborted API is working by design, but using the completion as a signal that the request completed would be abusing the API. I agree we should add the completed event if it is necessary. @kobelb LMK if you would like a PR for that |
It looks like In my testing, it seem like Could you help me verify if this is true, and whether it has recently changed or was always the case? |
i reread the discussion. Looks like it's wrong to use toPromise here. Sorry for the noise |
This is a proof-of-concept for implementing concurrency controls for a maximum number of agents being connected to a single instance of Kibana. There are some deficiencies and open questions inline using PR comments.
Proof-of-concept for #67987