[alerting] API key lost when frequently disabling / enabling a rule #106292

pmuellr · 2021-07-20T17:00:02Z

See issue #106276 for the repro. Click the enable / disable button, a lot, frequently, and you will likely see the following error in console (some lines deleted from original error),

Unhandled Promise rejection detected:

ResponseError: security_exception: [security_exception] Reason: missing authentication credentials for REST request [/_security/user/_has_privileges]
    at onBody (/Users/pmuellr/Projects/elastic/kibana-prs/node_modules/@elastic/elasticsearch/lib/Transport.js:337:23)
    at IncomingMessage.onEnd (/Users/pmuellr/Projects/elastic/kibana-prs/node_modules/@elastic/elasticsearch/lib/Transport.js:264:11)
  meta: {
    body: { error: [Object], status: 401 },
    statusCode: 401,
    headers: {
      'www-authenticate': 'Basic realm="security" charset="UTF-8", Bearer realm="security", ApiKey',
      'content-type': 'application/json;charset=utf-8',
      'content-length': '519'
    },
  }
}

Turning on some debug logging, you can also see the following:

[2021-07-20T12:34:57.394-04:00][DEBUG][plugins.alerting] executing alert example.always-firing:64302740-e8c9-11eb-9ca0-a7e4e62d9b7a at 2021-07-20T16:34:57.394Z
fetching api key for alert <-
GET /.kibana_8.0.0/_doc/alert%3A64302740-e8c9-11eb-9ca0-a7e4e62d9b7a
server    log   [12:34:57.404] [debug][encryptedSavedObjects][plugins] The following attributes of saved object "alert,64302740-e8c9-11eb-9ca0-a7e4e62d9b7a" should have been decrypted: apiKey, but found only:
fetching api key for alert -> null
server    log   [12:34:57.407] [debug][data][elasticsearch][query] 200
POST /.kibana_8.0.0/_search?from=0&rest_total_hits_as_int=true&size=100
{"size":100,"seq_no_primary_term":true,"from":0,"query":{"bool":{"filter":[{"bool":{"filter":[{"bool":{"should":[{"match":{"search-session.persisted":"true"}}],"minimum_should_match":1}},{"bool":{"should":[{"match":{"search-session.status":"in_progress"}}],"minimum_should_match":1}}]}},{"bool":{"should":[{"bool":{"must":[{"term":{"type":"search-session"}}],"must_not":[{"exists":{"field":"namespaces"}}]}}],"minimum_should_match":1}}]}},"sort":[{"search-session.touched":{"order":"asc","unmapped_type":"date"}}]}
server    log   [12:34:57.408] [debug][data][elasticsearch][query] 401
POST /_security/user/_has_privileges
{"index":[],"application":[{"application":"kibana-.kibana","resources":["space:default"],"privileges":["version:8.0.0","login:"]}]} [security_exception]: missing authentication credentials for REST request [/_security/user/_has_privileges]
server    log   [12:34:57.409] [debug][server][Kibana][cookie-session-storage][http] TypeError: Cannot read property 'toolkit' of undefined

Note the lines fetching api key for alert are from some console.log()s I added here, which seems to imply the API key is null.

kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts

Lines 111 to 124 in ca17b2d

    
           async getApiKeyForAlertPermissions(alertId: string, spaceId: string) { 
        
             const namespace = this.context.spaceIdToNamespace(spaceId); 
        
             // Only fetch encrypted attributes here, we'll create a saved objects client 
        
             // scoped with the API key to fetch the remaining data. 
        
             const { 
        
               attributes: { apiKey }, 
        
             } = await this.context.encryptedSavedObjectsClient.getDecryptedAsInternalUser<RawAlert>( 
        
               'alert', 
        
               alertId, 
        
               { namespace } 
        
             ); 
        
             return apiKey; 
        
           }

I'm not sure about that line The following attributes of saved object "alert,..." should have been decrypted: apiKey, but found only:; is that expected for a disabled alert? So we see that all the time? Or did something get broken somewhere?

I'm not sure if the simple fix is to treat this as a further indication that the rule is disabled. If we got this far into rule execution, because we think the rule is enabled, but happen to get a null api key, indicating that it is disabled, then just treat it as a disabled at that point.

But I'm also wondering if the rule is inconsistent at this point. Maybe the rule is enabled, but the apiKey is null? So we'll keep trying to run it, keep not being able to because of the API key, and ... will the user notice?

I'm also not completely sure who should be trapping that exception that is resulting in the unhandled promise rejection - that certainly needs to be fixed as well.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-07-20T17:00:07Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2021-07-21T17:19:48Z

I wasn't able to narrow down which function is actually throwing an exception that we're not catching - and I'm just assuming it is alerting code. So that's not good - seems like we need to find it and add a try/catch.

However, I was able to prevent the crash from happening, by adding some code here:

kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts

Lines 460 to 465 in 4e26bfe

    
           try { 
        
             apiKey = await this.getApiKeyForAlertPermissions(alertId, spaceId); 
        
           } catch (err) { 
        
             throw new ErrorWithReason(AlertExecutionStatusErrorReasons.Decrypt, err); 
        
           } 
        
           const [services, alertsClient] = this.getServicesWithSpaceLevelPermissions(spaceId, apiKey);

Adding the following code after the try/catch and before the call to this.getServicesWithSpaceLevelPermissions()

    if (apiKey == null) {
      throw new Error('api key is not available');
    }

At one point, I had gotten a rule in such a state that Kibana would crash whenever it tried to execute the rule (or could it even be the action - not sure!). Adding this additional check and throw seems to prevent Kibana from crashing.

It does log the following lines to the console though:

Executing Alert {{rule-id}} has resulted in Error: api key is not available

Oddly, I'm also seeing this message multiple times, as if there are multiple task records for this rule running. And occaisonally seeing the following, which would also kind of be a clue that the alert is being run multiple times, close together:

error updating alert execution status for example.always-firing:{{rule-id}} [{{rule-id}}]: version conflict, 
  required seqNo [26467], primary term [1]. current document has seqNo [26468] and primary term [1]:
  version_conflict_engine_exception: [version_conflict_engine_exception] Reason: {{message repeated}}

Checking the task manager index, it appears there are 9 task records that got created for this one alert, somehow. So it seems like maybe not just a race condition with api key / enable/disable getting out of sync, but also task creation.

I'm going to start with a clean set of indices and the new throw logic, to see if I can reproduce this behavior - hoping we won't see it, but ...

pmuellr · 2021-07-21T17:49:04Z

well dang it, even with the additional throw, I'm seeing additional task manager docs getting added. I can disable the alert, which will remove one of the docs, but the others remain, and will cause the alert to continue to execute, even though it's disabled. Or at least it tried to - it now logs the error about the api key being missing which I added.

So, seems like this is a bigger problem. I'm worried that we could potentially see other cases of this, not just with the rapid enable/disable (that we'll fix in issue #106276). Feels like we need to figure out how to make all the things that happen in enable/disable atomic, to keep things consistent.

ymao1 · 2021-07-22T14:40:02Z

As part of this PR, Kibana security suggested in the case of decryption failures to strip the encrypted attribute before migrating (vs migrating the original, unencrypted document). Once I did that, I was seeing the same Unhandled Promise rejection detected: error and I assumed it was due to the apiKey being completely missing (vs set to null) but maybe it happens as well when the apiKey is null?

In that PR I added a special case for when apiKey is undefined to throw an error but maybe it should just throw if it is null or undefined?

ymao1 · 2021-07-22T14:43:07Z

I also spent some time trying to figure out where exactly that exception was being thrown but was unsuccessful. It seems to be happening somewhere in the rules client get function, possible during ensureAuthorized but I tried wrapping many things inside try/catches and was still seeing this Unhandled Promised rejection

mikecote · 2021-07-22T14:59:51Z

I also spent some time trying to figure out where exactly that exception was being thrown but was unsuccessful.

@ymao1 have you tried running Kibana with --trace-warnings. There's a few NodeJS options that may help (https://nodejs.org/api/cli.html#cli_trace_warnings).

ymao1 · 2021-07-22T15:06:14Z

I have not...I'll give that a shot thanks!

ymao1 · 2021-07-22T17:32:52Z

It looks like the Uncaught Promise rejection comes from

kibana/x-pack/plugins/security/server/authorization/check_privileges.ts

Lines 54 to 67 in f30cfd4

    
           const { body } = await clusterClient.asScoped(request).asCurrentUser.security.hasPrivileges({ 
        
             body: { 
        
               cluster: privileges.elasticsearch?.cluster, 
        
               index: Object.entries(privileges.elasticsearch?.index ?? {}).map( 
        
                 ([name, indexPrivileges]) => ({ 
        
                   names: [name], 
        
                   privileges: indexPrivileges, 
        
                 }) 
        
               ), 
        
               application: [ 
        
                 { application: applicationName, resources, privileges: allApplicationPrivileges }, 
        
               ], 
        
             }, 
        
           });

If I wrap that in a try/catch, I no longer see the Unhandled Promise rejection error

I notice that the difference in the request between a rule that has an apiKey and a rule that doesn't (or apiKey is null) is in the headers:

with api key field:

{
	"params": {},
	"query": {},
	"body": {},
	"withoutSecretHeaders": true,
	"id": "b24d57c1-697c-43f6-9338-11f03409eafc",
	"uuid": "c278ca70-1232-49c2-9546-2e0920e77220",
	"url": {
		"href": "/"
	},
	"route": {
		"path": "/",
		"options": {
			"authRequired": true,
			"xsrfRequired": true,
			"tags": [],
			"timeout": {},
			"body": {}
		}
	},
	"headers": {
		"authorization": "ApiKey eVNSVnNIb0JMTVdoSUMyTEtyc046MGZuUE1FYnlRNks2WjJ2a3hFSEtMQQ=="
	},
	"isSystemRequest": false,
	"socket": {},
	"auth": {
		"isAuthenticated": false
	}
}

no api key field:

{
	"params": {},
	"query": {},
	"body": {},
	"withoutSecretHeaders": true,
	"id": "9449407d-0cc8-4794-bec4-f984b31feb35",
	"uuid": "67aa4a1c-7ab9-4b7d-a523-f248493fa3f3",
	"url": {
		"href": "/"
	},
	"route": {
		"path": "/",
		"options": {
			"authRequired": true,
			"xsrfRequired": true,
			"tags": [],
			"timeout": {},
			"body": {}
		}
	},
	"headers": {},
	"isSystemRequest": false,
	"socket": {},
	"auth": {
		"isAuthenticated": false
	}
}

pmuellr · 2021-07-22T18:39:45Z

I ended up putting a console.trace(), at the top of this function, since I think it's the call the exception is being thrown from:

kibana/x-pack/plugins/security/server/authorization/check_privileges.ts

Lines 41 to 51 in e952f9a

    
           return function checkPrivilegesWithRequest(request: KibanaRequest): CheckPrivileges { 
        
             const checkPrivilegesAtResources = async ( 
        
               resources: string[], 
        
               privileges: CheckPrivilegesPayload 
        
             ): Promise<CheckPrivilegesResponse> => { 
        
               const kibanaPrivileges = Array.isArray(privileges.kibana) 
        
                 ? privileges.kibana 
        
                 : privileges.kibana 
        
                 ? [privileges.kibana] 
        
                 : []; 
        
               const allApplicationPrivileges = uniq([actions.version, actions.login, ...kibanaPrivileges]);

yielded this:

Trace: checkPrivilegesWithRequest
    at Object.checkPrivilegesWithRequest (x-pack/plugins/security/server/authorization/check_privileges.ts:42:13)
    at SecureSpacesClientWrapper.ensureAuthorizedAtSpace (x-pack/plugins/security/server/spaces/secure_spaces_client_wrapper.ts:384:48)
    at SecureSpacesClientWrapper.get (x-pack/plugins/security/server/spaces/secure_spaces_client_wrapper.ts:164:20)
    at Object.getActiveSpace (x-pack/plugins/spaces/server/spaces_service/spaces_service.ts:120:64)
    at getSpace (x-pack/plugins/alerting/server/plugin.ts:332:46)
    at new AlertingAuthorization (x-pack/plugins/alerting/server/authorization/alerting_authorization.ts:107:24)
    at AlertingAuthorizationClientFactory.create (x-pack/plugins/alerting/server/alerting_authorization_client_factory.ts:47:12)
    at AlertsClientFactory.create (x-pack/plugins/alerting/server/alerts_client_factory.ts:90:41)
    at Object.getAlertsClientWithRequest (x-pack/plugins/alerting/server/plugin.ts:360:35)
    at TaskRunner.getServicesWithSpaceLevelPermissions (x-pack/plugins/alerting/server/task_runner/task_runner.ts:159:61)
    at TaskRunner.loadAlertAttributesAndRun (x-pack/plugins/alerting/server/task_runner/task_runner.ts:462:43)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
    at errorAsAlertTaskRunResult (x-pack/plugins/alerting/server/task_runner/task_runner.ts:947:12)
    at TaskRunner.run (x-pack/plugins/alerting/server/task_runner/task_runner.ts:549:33)
    at TaskManagerRunner.run (x-pack/plugins/task_manager/server/task_running/task_runner.ts:259:22)

Looking at the place where getSpace() is called, (alerting_authorization), there's some code:

kibana/x-pack/plugins/alerting/server/authorization/alerting_authorization.ts

Lines 105 to 107 in e952f9a

    
           this.spaceId = getSpace(request).then((maybeSpace) => maybeSpace?.id); 
        
           this.featuresIds = getSpace(request)

so, missing a .catch() on that first getSpace() call, but we shouldn't be calling this twice if each one is checking privileges, heh. There is a .catch() on that second getSpace() call ...

pmuellr · 2021-07-22T18:51:18Z

I think what's going on in that stack trace, is that we end up calling new AlertingAuthorization() to construct an object, sync, but it's making async calls, which it can handle via .then() (can't use await in a constructor). Seems like having a constructor do that kind of thing is bad form anyway - perhaps the logic should be split so the constructor does minimal sync work, and then the async work is done in a new async method.

ymao1 · 2021-07-22T18:59:04Z

The line for this.spaceId = getSpace(request).then((maybeSpace) => maybeSpace?.id); is getting changed soon to use the synchronous getSpaceId method in this PR,

But that makes sense what you're saying.

yctercero · 2021-07-22T19:01:37Z

Yes! @ymao1 is correct that we're moving to use the synchronous getSpaceId as we did notice that that line you pointed out is currently causing an unhandled rejection promise to throw.

pmuellr · 2021-07-22T19:21:09Z

I should also note, the stack trace in #106292 (comment) may be a red herring. Looks like there were a number of calls to check privileges, here's another stack:

    at checkPrivilegesWithRequest (x-pack/plugins/security/server/authorization/check_privileges.ts:42:13)
    at Object.checkPrivilegesDynamicallyWithRequest (x-pack/plugins/security/server/authorization/check_privileges_dynamically.ts:30:29)
    at AlertingAuthorization.ensureAuthorized (x-pack/plugins/alerting/server/authorization/alerting_authorization.ts:190:45)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
    at AlertsClient.get (x-pack/plugins/alerting/server/alerts_client/alerts_client.ts:370:7)
    at TaskRunner.loadAlertAttributesAndRun (x-pack/plugins/alerting/server/task_runner/task_runner.ts:468:15)
    at errorAsAlertTaskRunResult (x-pack/plugins/alerting/server/task_runner/task_runner.ts:947:12)
    at TaskRunner.run (x-pack/plugins/alerting/server/task_runner/task_runner.ts:549:33)
    at TaskManagerRunner.run (x-pack/plugins/task_manager/server/task_running/task_runner.ts:259:22)

Could probably try to do a little more precision try/catching in the checkPrivileges call to narrow down the exact stack ...

And I tend to think of things before/after runMicrotasks() as potentially being unrelated, given the way v8 implements this stuff.

Lastly, not sure if this helped, it didn't seem to do anything before I added the explicit console.trace(), but I'm running Kibana with the following:

export NODE_OPTIONS="--trace-uncaught --report-uncaught-exception"

pmuellr · 2021-07-27T22:16:11Z

I'm thinking that if we can cut down the number of "things" we are creating / destroying, over the various alerting rule lifecycle, we'll have fewer chances of race conditions. Even if we can't do it over the whole lifecycle, just enable/disable, would probably be a win.

There's a separate issue open discussing changing how often we regenerate API keys - #106876 . For enable/disable, it would mean that disable would not invalidate the API key, but leave it alone. And enable wouldn't then have to generate a new API key. The downside is that we use this disable/re-enable flow today specifically to regenerate a new API key, so we'd need to switch have users use "update api key" HTTP endpoint instead, and provide somewhere to do this in the UI.

Next thing that it would be nice to not have to re-create - task manager tasks! I'll have to poke around and see if it's possible to "disable" a task, as opposed to deleting and re-creating it. The situation I had with multiple tasks associated with the same rule was one that we don't really have any way at all to remediate, without explicitly deleting documents from the task manager index.

mikecote · 2021-08-09T19:38:25Z

I found another way to make the rule run while losing the API key.. If you export a running rule and import w/ the overwrite option, it will overwrite the rule object with enabled:false and no apiKey. Since it's a simple overwrite, the task in task manager remains, thinking the rule is still running. The next time the task runs, it will run the rule without an apiKey and fail.

gmmorris · 2021-11-04T09:43:52Z

I'm thinking that if we can cut down the number of "things" we are creating / destroying, over the various alerting rule lifecycle, we'll have fewer chances of race conditions. Even if we can't do it over the whole lifecycle, just enable/disable, would probably be a win.

~~This really chimes with me. Sure, it doesn't fix the race condition, but:~~

I'm +1 on using this issue as an opportunity to reduce the cases where we regenerate an API Key.
Any thoughts @mikecote @arisonl ?

Having chated with @mikecote I now realise these are unrelated problems... so changing the API Key behaviour in enable/disable will not help us here as much as we'd need it to.

mikecote · 2021-11-04T14:23:48Z

I think the main problem here is that rules are running when the rule is disabled. From what I recall, rules can either not have an API key and be disabled, or have an API key and be enabled. If we add guardrails to ensure rules don't run when disabled, I believe this problem goes away.

We could also consider my scenario above (#106292 (comment)) when adding guardrails to ensure the task id is the same as the one associated with the rule (!rule.scheduleTaskId || rule.scheduledTaskId === task.id). !rule.scheduleTaskId is important as tasks can start running during the rule creation process while we haven't had a chance to update scheduleTaskId yet (https://github.com/elastic/kibana/blob/main/x-pack/plugins/alerting/server/rules_client/rules_client.ts#L397-L399).

chrisronline · 2021-11-19T17:35:26Z

I agree with most of this comment. The second comment should be moot, as this recent PR should ensure we don't have multiple tasks for the same rule.

We need to check if the rule is enabled or not before attempting to execute it, and we can modify this function to also return the enabled property and if that's not enabled, we do not attempt to fetch the rule here (which will eventually lead to rule execution).

The outstanding question is what do we do in this scenario? Do we treat this as an error state that goes into the event log as such? Should we add a new error state for "disabled" and log it as such?

pmuellr added bug Fixes for quality problems that affect the customer experience Feature:Alerting Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jul 20, 2021

pmuellr mentioned this issue Jul 20, 2021

[Alerting UI] Add visual indicator of the next request delay between rules enable/disable switch clicking. #106276

Closed

mikecote added the loe:needs-research This issue requires some research before it can be worked on or estimated label Jul 21, 2021

pmuellr mentioned this issue Jul 27, 2021

[alerting][discuss] decrease the number of scenarios where we regenerate API keys #106876

Closed

gmmorris added resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility estimate:needs-research Estimated as too large and requires research to break down into workable issues labels Aug 13, 2021

pmuellr mentioned this issue Aug 25, 2021

Active alerts do not recover after re-enabling a rule #110080

Closed

gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021

gmmorris added the impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. label Sep 16, 2021

pmuellr mentioned this issue Oct 20, 2021

[alerting] short-term fix to prevent task manager doc duplication race condition #115791

Closed

pmuellr mentioned this issue Nov 17, 2021

[Alerting][Security] Rules fail due to a security exception: missing authentication credentials for REST request #118520

Open

chrisronline self-assigned this Nov 18, 2021

chrisronline mentioned this issue Nov 19, 2021

[Alerting] Skip running disabled rules #119239

Merged

chrisronline removed the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Nov 23, 2021

chrisronline closed this as completed in #119239 Nov 29, 2021

spong mentioned this issue Dec 9, 2021

[Security Solution] Rule is getting failed intermittently when enable or disable the rule even not able to enable the failed rule from rule details page #120872

Closed

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[alerting] API key lost when frequently disabling / enabling a rule #106292

[alerting] API key lost when frequently disabling / enabling a rule #106292

pmuellr commented Jul 20, 2021

elasticmachine commented Jul 20, 2021

pmuellr commented Jul 21, 2021

pmuellr commented Jul 21, 2021

ymao1 commented Jul 22, 2021

ymao1 commented Jul 22, 2021

mikecote commented Jul 22, 2021

ymao1 commented Jul 22, 2021

ymao1 commented Jul 22, 2021 •

edited

Loading

pmuellr commented Jul 22, 2021

pmuellr commented Jul 22, 2021

ymao1 commented Jul 22, 2021

yctercero commented Jul 22, 2021

pmuellr commented Jul 22, 2021

pmuellr commented Jul 27, 2021

mikecote commented Aug 9, 2021

gmmorris commented Nov 4, 2021 •

edited

Loading

mikecote commented Nov 4, 2021

chrisronline commented Nov 19, 2021

[alerting] API key lost when frequently disabling / enabling a rule #106292

[alerting] API key lost when frequently disabling / enabling a rule #106292

Comments

pmuellr commented Jul 20, 2021

elasticmachine commented Jul 20, 2021

pmuellr commented Jul 21, 2021

pmuellr commented Jul 21, 2021

ymao1 commented Jul 22, 2021

ymao1 commented Jul 22, 2021

mikecote commented Jul 22, 2021

ymao1 commented Jul 22, 2021

ymao1 commented Jul 22, 2021 • edited Loading

pmuellr commented Jul 22, 2021

pmuellr commented Jul 22, 2021

ymao1 commented Jul 22, 2021

yctercero commented Jul 22, 2021

pmuellr commented Jul 22, 2021

pmuellr commented Jul 27, 2021

mikecote commented Aug 9, 2021

gmmorris commented Nov 4, 2021 • edited Loading

mikecote commented Nov 4, 2021

chrisronline commented Nov 19, 2021

ymao1 commented Jul 22, 2021 •

edited

Loading

gmmorris commented Nov 4, 2021 •

edited

Loading