-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ingest Manager] Protect Kibana of spikes cause by Elastic Agent enrollment or checkin. #67987
Comments
Pinging @elastic/ingest-management (Team:Ingest Management) |
We've traditionally recommended that users deploy Kibana in a high-availability deployment and perform this type of logic at the load balancer. However, I totally get why this is inconvenient, specifically for Fleet. Do Agents access Fleet specific APIs in Kibana, or are they also consuming "general purpose APIs", which have other consumers? |
I think we should add a setting, specified in the While I wish that there was some intelligent way for Kibana to determine when it's being overloaded by requests from Agents, and begin replying with a 429, I don't think this is feasible. The number of Agents which an instance of Kibana can support is highly dependent on the size of the Elasticsearch cluster it's connected to, the load on the Elasticsearch cluster, the resources (cpu, memory, file descriptors) available to Kibana itself, and the other requests and background tasks which Kibana is currently running. In short, "it depends...". I'm not aware of anywhere else in Kibana where we wish to limit the number of concurrent requests to a group of APIs, so it doesn't feel like this is something we should generalize to the Kibana Platform. @joshdover have you heard of any similar requests or envision us benefitting from a generalized solution to this problem? A generalized solution potentially becomes even harder if the Agents use long-polling, as we should have a limit for the number of Agents which should have long-polling connections established, but we also potentially should be queuing or limiting the requests internal to the long-polling connections to prevent these from overwhelming Kibana. |
@kobelb I realize Fleet is use Kibana in a different way than is typical but I think the stress testing is highlighting issues which affect all/many plugins. The gains @rudolf got from #68221 make things faster for everyone. One option that I've mentioned while talking with @joshdover and @rudolf is using Hapi's I played with some event loop values and it definitely helps when waves of requests come in. The stack monitoring view (or maybe something elsewhere) handled the 503 without any code changes I think Platform applying (or providing) some guardrails / circuit-breakers makes sense and would be welcome. A default (max?) event loop delay of 500ms or 1s seems reasonable. The I can open or comment more on a separate ticket if you like. |
Exposing some of these settings via equivalent Finding reasonable defaults for these global load settings is potentially challenging as well. For example, it really depends on the usage pattern of Kibana for what event loop delay should prevent the HTTP server from accepting all new HTTP requests. If the event loop delay is being caused by background tasks in Kibana, we risk not accepting any new HTTP requests. This is potentially problematic because it prevents them from using Kibana itself to determine what is going awry. The user is potentially willing to accept a one second delay imposed by the event loop as long as they can continue using Kibana. A similar situation applies to the number of concurrent requests, which we potentially don't want to expose because Hapi dropped support for it. It really depends on the resources available to Kibana for how many concurrent requests are acceptable. I also don't think we'd want to solely rely on settings that apply to all of Kibana's HTTP traffic in this situation. Kibana commonly serves HTTP requests for consumers besides Fleet. In this configuration, we'll want to prevent Fleet from starving Kibana of the resources that it requires to continue serving end-user initiated HTTP requests. |
I totally understand. I didn't mention it in this comment, but I've said as much when talking to Josh & Rudolf.
This is node, so we're all in the same event loop together. I'm hoping to avoid new settings (event loop delay, concurrent connections) and new behavior for existing flows (returning 429 or 503) and concentrate on identifying bottlenecks that affect everyone so we make things fast(er) by default. |
I'm not aware of any other plugins with similar requirements.
I think we could experiment with implementing a wrapper for the This would allow us to experiment with these safeguards and finding good defaults without affecting the rest of Kibana. One reason to not apply these safeguards to all of Kibana is that some APIs should not be affected by these at all. A good example is the |
The more that I've thought about it, I wouldn't recommend using the event-loop delay as a signal for when Kibana should start rejecting HTTP requests, even just for Agents. Any number of factors could lead to the event-loop being delayed, and there's no certainty that rejecting HTTP requests will help remedy the situation. If Kibana doesn't fall-over and is able to eventually reply to the HTTP request, even when Kibana is under heavy load, it creates a better experience for all clients. Are you all seeing the Kibana process crash when there are too many Elastic Agents enrolling or checking in? |
@kobelb I don't think we've seen an outright crash, but Kibana becomes overloaded to the point where it's no longer possible to use it. @jfsiii to start with something simple, could we "just" count how many enrollments there is currently ongoing and return a 429 if the number of concurrent enrollment requests goes above X? X could be a fleet configuration setting in kibana.yml. We could do the same for the "check in" part of the check in route. After we've handled the check in request and the Agent is in the long polling state, we should decrement the counter to allow for more agents to check in. And then, we also need to teach Elastic Agent to add a reasonable amount of jitter to its retry. It's not ideal, but maybe it's a good enough start? cc @nchaulet |
Despite Kibana becoming overloaded to the point when it's no longer possible to use it, I think this is a lot better than Kibana crashing. If Kibana is deployed in a high-availability deployment behind a load-balancer which properly does health-checks, the load balancer should be able to detect that Kibana has become overloaded and either redirect or reject new HTTP requests if there is no more capacity. |
Yes and that would be a bad experience :) Meanwhile, users cannot use Kibana and the whole system will feel "frozen up". Only if you go and kill your agents or dramatically scale up the number of Kibanas, will you get control back. If you've just rolled out Agent on 20k desktop machines, it's not necessarily a simple task to roll those back. |
I agree. There are a few ways in which we can prevent this from happening, and I stand by the mitigation and reasoning I previously expressed in #67987 (comment). Perhaps we're in agreement here, and I'm just misunderstanding. |
@kobelb OK. We'd like to do something for 7.9. In theory we can just do something on the fleet side, but there, we have no control over checking the API Keys. That is, ideally, we'd reject traffic before Kibana starts checking the API Key for it to be as cheap as possible. Could you work with us to allow us to hook in at the right place? |
That would be before fleet/kibana code? It needs to be really high in the request execution. |
I threw up a quick proof-of-concept here. There are some open questions, which I've asked various Kibana platform team members for their feedback on. |
@jfsiii Did you do some test with the above proof-of-concept? |
@ph I did and am looking into some issues I found. My wave-of-connnections test eventually got to a place where we were always over the limit. I think it has to do with catching/decrementing on errors but didn't get far into it. The example is clear/simple so it shouldn't be bad. There was also an increase in the event loop lag, but I wanted to get it running correctly before I put to much thought into what I see in Monitoring. |
@jfsiii it is optimistic to say that we can have something simple for 7.9? |
@ph I think we can have something that's running as kobelb intended very shortly (in time for 7.9). However, one key aspect of that PR is a the I'll do a version which uses the existing platform code, but ideally we'd also have the |
@ph #71221 uses the new/updated lifecycle methods from Platform but a) there's some small improvements still pending from Platform b) I need to push the code I cleaned up from the POC. To ensure we have some mechanism in place for 7.9, I'll open a new PR today which does the same count/rejection using existing lifecycle methods. That way our team can approve & merge it before FF. |
Thanks @jfsiii |
this is done. Let's open separate issues for any follow-ups. |
When you first enroll a large fleet of Elastic Agents or if your workforce turns on their personal computers within a short timeframe we can risk a stampede of agents talking to Kibana and negatively impact the performance of Kibana.
We have to investigate ways to reduce the problem. For example, we might be able to add protection at the server level to reject many simultaneous incoming connections from Elastic Agent.
The text was updated successfully, but these errors were encountered: