-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[alerting] API key lost when frequently disabling / enabling a rule #106292
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
I wasn't able to narrow down which function is actually throwing an exception that we're not catching - and I'm just assuming it is alerting code. So that's not good - seems like we need to find it and add a try/catch. However, I was able to prevent the crash from happening, by adding some code here: kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts Lines 460 to 465 in 4e26bfe
Adding the following code after the try/catch and before the call to if (apiKey == null) {
throw new Error('api key is not available');
} At one point, I had gotten a rule in such a state that Kibana would crash whenever it tried to execute the rule (or could it even be the action - not sure!). Adding this additional check and throw seems to prevent Kibana from crashing. It does log the following lines to the console though:
Oddly, I'm also seeing this message multiple times, as if there are multiple task records for this rule running. And occaisonally seeing the following, which would also kind of be a clue that the alert is being run multiple times, close together:
Checking the task manager index, it appears there are 9 task records that got created for this one alert, somehow. So it seems like maybe not just a race condition with api key / enable/disable getting out of sync, but also task creation. I'm going to start with a clean set of indices and the new |
well dang it, even with the additional throw, I'm seeing additional task manager docs getting added. I can disable the alert, which will remove one of the docs, but the others remain, and will cause the alert to continue to execute, even though it's disabled. Or at least it tried to - it now logs the error about the api key being missing which I added. So, seems like this is a bigger problem. I'm worried that we could potentially see other cases of this, not just with the rapid enable/disable (that we'll fix in issue #106276). Feels like we need to figure out how to make all the things that happen in enable/disable atomic, to keep things consistent. |
As part of this PR, Kibana security suggested in the case of decryption failures to strip the encrypted attribute before migrating (vs migrating the original, unencrypted document). Once I did that, I was seeing the same In that PR I added a special case for when |
I also spent some time trying to figure out where exactly that exception was being thrown but was unsuccessful. It seems to be happening somewhere in the rules client get function, possible during |
@ymao1 have you tried running Kibana with |
I have not...I'll give that a shot thanks! |
It looks like the
If I wrap that in a try/catch, I no longer see the I notice that the difference in the request between a rule that has an with api key field:
no api key field:
|
I ended up putting a
yielded this:
Looking at the place where kibana/x-pack/plugins/alerting/server/authorization/alerting_authorization.ts Lines 105 to 107 in e952f9a
so, missing a |
I think what's going on in that stack trace, is that we end up calling |
The line for But that makes sense what you're saying. |
Yes! @ymao1 is correct that we're moving to use the synchronous |
I should also note, the stack trace in #106292 (comment) may be a red herring. Looks like there were a number of calls to check privileges, here's another stack:
Could probably try to do a little more precision try/catching in the checkPrivileges call to narrow down the exact stack ... And I tend to think of things before/after Lastly, not sure if this helped, it didn't seem to do anything before I added the explicit
|
I'm thinking that if we can cut down the number of "things" we are creating / destroying, over the various alerting rule lifecycle, we'll have fewer chances of race conditions. Even if we can't do it over the whole lifecycle, just enable/disable, would probably be a win. There's a separate issue open discussing changing how often we regenerate API keys - #106876 . For enable/disable, it would mean that disable would not invalidate the API key, but leave it alone. And enable wouldn't then have to generate a new API key. The downside is that we use this disable/re-enable flow today specifically to regenerate a new API key, so we'd need to switch have users use "update api key" HTTP endpoint instead, and provide somewhere to do this in the UI. Next thing that it would be nice to not have to re-create - task manager tasks! I'll have to poke around and see if it's possible to "disable" a task, as opposed to deleting and re-creating it. The situation I had with multiple tasks associated with the same rule was one that we don't really have any way at all to remediate, without explicitly deleting documents from the task manager index. |
I found another way to make the rule run while losing the API key.. If you export a running rule and import w/ the overwrite option, it will overwrite the rule object with enabled:false and no apiKey. Since it's a simple overwrite, the task in task manager remains, thinking the rule is still running. The next time the task runs, it will run the rule without an apiKey and fail. |
Having chated with @mikecote I now realise these are unrelated problems... so changing the API Key behaviour in enable/disable will not help us here as much as we'd need it to. |
I think the main problem here is that rules are running when the rule is disabled. From what I recall, rules can either not have an API key and be disabled, or have an API key and be enabled. If we add guardrails to ensure rules don't run when disabled, I believe this problem goes away. We could also consider my scenario above (#106292 (comment)) when adding guardrails to ensure the task id is the same as the one associated with the rule ( |
I agree with most of this comment. The second comment should be moot, as this recent PR should ensure we don't have multiple tasks for the same rule. We need to check if the rule is enabled or not before attempting to execute it, and we can modify this function to also return the The outstanding question is what do we do in this scenario? Do we treat this as an error state that goes into the event log as such? Should we add a new error state for "disabled" and log it as such? |
See issue #106276 for the repro. Click the enable / disable button, a lot, frequently, and you will likely see the following error in console (some lines deleted from original error),
Turning on some debug logging, you can also see the following:
Note the lines
fetching api key for alert
are from someconsole.log()
s I added here, which seems to imply the API key is null.kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts
Lines 111 to 124 in ca17b2d
I'm not sure about that line
The following attributes of saved object "alert,..." should have been decrypted: apiKey, but found only:
; is that expected for a disabled alert? So we see that all the time? Or did something get broken somewhere?I'm not sure if the simple fix is to treat this as a further indication that the rule is disabled. If we got this far into rule execution, because we think the rule is enabled, but happen to get a null api key, indicating that it is disabled, then just treat it as a disabled at that point.
But I'm also wondering if the rule is inconsistent at this point. Maybe the rule is enabled, but the apiKey is null? So we'll keep trying to run it, keep not being able to because of the API key, and ... will the user notice?
I'm also not completely sure who should be trapping that exception that is resulting in the unhandled promise rejection - that certainly needs to be fixed as well.
The text was updated successfully, but these errors were encountered: