-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use updated onPreAuth from Platform #71552
Use updated onPreAuth from Platform #71552
Conversation
Pinging @elastic/ingest-management (Team:Ingest Management) |
return tags.includes(LIMITED_CONCURRENCY_ROUTE_TAG); | ||
} | ||
|
||
const LIMITED_CONCURRENCY_MAX_REQUESTS = 250; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should this be configurable? Do we expose as a config flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ to having it as a config option, it fits nicely under xpack.ingestManager.fleet
the default value of 250
seems low to me, but I'm not up to date on our load testing results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indeed seems low, how did you get to that number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolving this since we've switched to 0 (off) by default
import { KibanaRequest, LifecycleResponseFactory, OnPreAuthToolkit } from 'kibana/server'; | ||
import { LIMITED_CONCURRENCY_ROUTE_TAG } from '../../common'; | ||
|
||
class MaxCounter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can handle this any number of ways. I just chose something that allowed the logic to be defined outside the route handler
const LIMITED_CONCURRENCY_MAX_REQUESTS = 250; | ||
const counter = new MaxCounter(LIMITED_CONCURRENCY_MAX_REQUESTS); | ||
|
||
export function preAuthHandler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I test this handler? If so, can someone point me in the right direction :) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for testing, maybe looking at routes/package_config/handlers.test.ts
would be helpful? it has usage of httpServerMock.createKibanaRequest
for creating a request object, that method has options for creating it with tags, etc.
then I think you could mock LIMITED_CONCURRENCY_MAX_REQUESTS
to 1 and mock toolkit.next()
to resolve after waiting for a second. then kicking off multiple calls to the handler inside a Promise.all
should return an array of of responses whose status codes are [200, 503, 503, ...]
return tags.includes(LIMITED_CONCURRENCY_ROUTE_TAG); | ||
} | ||
|
||
const LIMITED_CONCURRENCY_MAX_REQUESTS = 250; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ to having it as a config option, it fits nicely under xpack.ingestManager.fleet
the default value of 250
seems low to me, but I'm not up to date on our load testing results
const LIMITED_CONCURRENCY_MAX_REQUESTS = 250; | ||
const counter = new MaxCounter(LIMITED_CONCURRENCY_MAX_REQUESTS); | ||
|
||
export function preAuthHandler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to name this something Fleet or agent-related to emphasize that this handler is for those routes
|
||
counter.increase(); | ||
|
||
// requests.events.aborted$ has a bug where it's fired even when the request completes... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can we decrease the counter if/when the bug is fixed? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was some discussion about this starting in #70495 (comment)
It's a "bug" but there is a test to ensure it works. There's also a possibility Platform will add a request.events.completed$
which we can use instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a "bug" but there is a test to ensure it works. There's also a possibility Platform will add a request.events.completed$ which we can use instead
This is entirely pedantic, it's not really a bug as much as an "implementation detail" per #70495 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const LIMITED_CONCURRENCY_MAX_REQUESTS = 250; | ||
const counter = new MaxCounter(LIMITED_CONCURRENCY_MAX_REQUESTS); | ||
|
||
export function preAuthHandler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for testing, maybe looking at routes/package_config/handlers.test.ts
would be helpful? it has usage of httpServerMock.createKibanaRequest
for creating a request object, that method has options for creating it with tags, etc.
then I think you could mock LIMITED_CONCURRENCY_MAX_REQUESTS
to 1 and mock toolkit.next()
to resolve after waiting for a second. then kicking off multiple calls to the handler inside a Promise.all
should return an array of of responses whose status codes are [200, 503, 503, ...]
@elasticmachine merge upstream |
@jfsiii Can you clarify how it work?
What is the actual behavior here? does it:
@nchaulet could you take a look? |
@ph I updated the description to include more details |
thanks @jfsiii!
|
thanks, @roncohen
I updated the description. Can you say more about why 429 vs 503? I think a 503 is more accurate as it's a temporary system/server issue, not one with the request. A 429 would be if they were exceeding our API request rate, but we don't have one.
Adding the
I wondered the same thing and figured we could come back to it. Thanks for (3). I'll ping people in the code comments. |
|
||
// requests.events.aborted$ has a bug where it's fired even when the request completes... | ||
// we can take advantage of this bug just for load testing... | ||
request.events.aborted$.toPromise().then(() => counter.decrease()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the comment says this decrements the counter when a request completes. However, /checkin
is using long-polling so I don't think the request "completes" in the same way.
As @roncohen mentions in (3) in #71552 (comment), I think the counter will increment with each /checkin
but not decrement.
It seems like we could revert to using core.http.registerOnPreResponse
to decrement during a checkin response , but the reason we moved away from that was it missed aborted/failed connections #70495 (comment) and as the first line says requests.events.aborted$ has a bug where it's fired even when the request completes
so that will result in double-decrementing.
Perhaps we could do some checking in both the response and aborted handler to see if theres a 429/503 like the initial POC https://github.com/elastic/kibana/pull/70495/files#diff-f3ee51f84520a4eb03ca041ff4c3d9a2R182
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jfsiii I think we need to do the decrementing inside the check-in route when the actual work of checking in the agent has completed and the request transitions to long polling mode. Could the interceptor augment the context passed to the route such that the route can decrement when it sees fit? And then we need to ensure that a given request can only be decremented once, so it doesn't get decremented again when the response finally gets send back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @roncohen here, I think the behavior is to "protect from multiple connection" until enough agents have transitioned to long polling where their usage become low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the interceptor augment the context passed to the route such that the route can decrement when it sees fit?
I don't believe that it can at the moment... I think we have two options here
- Don't use the pre-auth handler to reply with a 429 for these long-polling routes. If we're doing all of the rate-limiting within the route handler itself, this becomes possible without any changes from the Kibana platform.
- Make changes to the Kibana platform to allow us to correlate the request seen in the pre-auth handler to the request within the route handler and the other lifecycle methods. This would allow us to only decrement the counter once by using a
WeakSet
. @joshdover, is there some other primitive or concept within the Kibana platform that we could use to fulfill this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. is there some other primitive or concept within the Kibana platform that we could use to fulfill this behavior?
Not that is exposed currently. I am working on adding support for an id
property on KibanaRequest
that would be stable across lifecycle methods (#71019), but that will not make it for 7.9. I'm also not sure it would be safe to use for this use-case because the id
could come from a proxy that is sending an X-Opaque-Id
header which we can't guarantee will be unique? Maybe we need an internal ID that is unique that we can rely on?
We do keep a reference to the original Hapi request on the KibanaRequest
object but it's been very purposefully hidden so that plugins can't abuse it.
The way I see it, 503 means the service is unavailable. In this case, it's not actually unavailable, it's just that there are currently too many requests going on to specific routes and the server is opting to reject some of those requests to make sure it can deal properly with the ones that are ongoing and allow other kinds of request (from the user browsing the UI for example) to proceed. So on the contrary, the service is alive and well because it's protecting itself ;) If there's no connection to Elasticsearch, Kibana will return a 503 AFAIK. There are probably other scenarios where Kibana returns a 503. If we used 503s here, someone looking at HTTP response code graphs for Kibana will not be able to tell whether there's a significant problem like Elasticsearch being unavailable or it's just the protection from this PR being activated. Finally, 429 doesn't have to mean that a particular user/account is making too many requests. The spec says:
|
I didn't know that Kibana could return a 503, in that case going with 429 might be a better solution, and this is the exact behavior with Elasticsearch when too many bulk requests are made to the server. |
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a smoke test locally enrolling 1 agent, collecting system metrics, and deploying an updated config, everything looks good.
Giving 👍 due to time constraints, but I would like to see test coverage added for limited_concurrency.ts
. At the minimum validate that we can get 429 status codes, and also validate that the counter decreases accordingly.
@jen-huang thanks! I put a test to confirm we don't add the preAuth handler unless we have a config. We'll see what CI holds, but this is what I got locally
I'll try to add some more (like "do we only process the correct routes?") before merging, but I'll create a link for follow tests and link it when I merge this. |
When enrolling and the server currently handle to many concurrent request it will return a 429 status code. The enroll subcommand will retry to enroll with an exponential backoff. (Init 15sec and max 10mins) This also adjust the backoff logic in the ACK. Requires: elastic/kibana#71552
💚 Build SucceededBuild metrics
History
To update your PR or re-run it, just comment with: |
#19918) * [Elastic Agent] Handle 429 response from the server and adjust backoff When enrolling and the server currently handle to many concurrent request it will return a 429 status code. The enroll subcommand will retry to enroll with an exponential backoff. (Init 15sec and max 10mins) This also adjust the backoff logic in the ACK. Requires: elastic/kibana#71552 * changelog * Change values
elastic#19918) * [Elastic Agent] Handle 429 response from the server and adjust backoff When enrolling and the server currently handle to many concurrent request it will return a 429 status code. The enroll subcommand will retry to enroll with an exponential backoff. (Init 15sec and max 10mins) This also adjust the backoff logic in the ACK. Requires: elastic/kibana#71552 * changelog * Change values (cherry picked from commit 2db2152)
#19918) (#19926) * [Elastic Agent] Handle 429 response from the server and adjust backoff When enrolling and the server currently handle to many concurrent request it will return a 429 status code. The enroll subcommand will retry to enroll with an exponential backoff. (Init 15sec and max 10mins) This also adjust the backoff logic in the ACK. Requires: elastic/kibana#71552 * changelog * Change values (cherry picked from commit 2db2152)
* Use updated onPreAuth from Platform * Add config flag. Increase default value. * Set max connections flag default to 0 (disabled) * Don't use limiting logic on checkin route * Confirm preAuth handler only added when max > 0 Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: Elastic Machine <[email protected]>
elastic#19918) * [Elastic Agent] Handle 429 response from the server and adjust backoff When enrolling and the server currently handle to many concurrent request it will return a 429 status code. The enroll subcommand will retry to enroll with an exponential backoff. (Init 15sec and max 10mins) This also adjust the backoff logic in the ACK. Requires: elastic/kibana#71552 * changelog * Change values
Summary
Track the number of open requests to certain routes and reject them with a 429 if we are above that limit.
The limit is controlled via the new config value
xpack.ingestManager.fleet.maxConcurrentConnections
. As a new flag it won't be immediately available in Cloud UI. I'll double check the timeline, but it's possible it won't be available for 7.9.This is implemented with the updated
onPreAuth
platform lifecycle from #70775 and code/ideas from #70495 & https://github.com/jfsiii/kibana/pull/4We add a global interceptor (meaning it's run on every request, not just those for our plugin) which uses a single counter to track total open connections to any route which includes the
LIMITED_CONCURRENCY_ROUTE_TAG
tag. The counter will increment when the request starts and decrement on close. The only routes which have this logic are the Fleet enroll & ack routes; not checkinOpenQuestionsxpack.ingestManager.fleet
?Answers
xpack.ingestManager.fleet.maxConcurrentConnections
Checklist
Documentation was added for features that require explanation or tutorials
Unit or functional tests were updated or added to match the most common scenarios.
added 3 tests to make sure the preAuth handler isn't added if the flag is set to 0 b8288ae
Created [Ingest Manager] Add tests for the limited concurrency routes #71744 for more tests
PR vs now from a PR last week. Will update with up-to-date comparison