Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kibana alerting acts strangely when Elasticsearch and/or Kibana clocks are out of sync #87664

Closed
mikecote opened this issue Jan 7, 2021 · 5 comments
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Feature:Task Manager resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented Jan 7, 2021

There is ongoing work to document that alerting requires clocks to be in sync between all Elasticsearch and Kibana instances (#81532). It would be nice to mitigate this problem and also avoid ourselves debugging such scenarios without knowing.

@mikecote mikecote added Feature:Alerting Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels Jan 7, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@mikecote
Copy link
Contributor Author

mikecote commented Jan 7, 2021

One way to get the date from Elasticsearch would be to do a call like:

GET */_search
{
  "size": 1, 
  "script_fields": {
    "now": {
      "script": "new Date().getTime()"
    }
  }
}

Some brain dump: I was thinking this could be used on task manager startup and ensure the date returned is between the start and end of that request to Elasticsearch. Otherwise, it would mean the clocks are not in sync. This approach would only work on the node that responded and wouldn't work if ever one of the ES nodes is out of sync. For that, I was thinking this script / get ES date could be part of every task manager claim query and we can make sure the responding node has its clocks in sync with the Kibana requesting to claim tasks.

@pmuellr
Copy link
Member

pmuellr commented Jan 13, 2021

I think we'd want to do it on every TM claim query - Kibana start up time only would miss too many cases. Although that's generally going to be too often, especially if it's an additional HTTP request. Or could we bundle this into one of our existing searches somehow as an aggregation?

And this won't work if the customer disabled scripts. I'd prefer to use a Date header in the http response, but apparently ES doesn't add Date headers to responses. Getting it via a header would be one less HTTP request to be made to ES. I wonder if we could add some option to requests (perhaps via a header) to tell ES to add a Date header to the responses.

I suspect we are seeing this in alerting, because most of the other parts of Kibana don't really require now interpretation in ES, that would be expected to be tightly aligned with Kibana's time. But that will likely change in the future. Which makes this more of a system problem, not just alerting.

@pmuellr
Copy link
Member

pmuellr commented Jan 13, 2021

One "simple" way to fix the original issue is to not use now in our queries, instead replacing it with the literal date computed by Kibana (eg, Date.now()). Seems like the critical usages are in this module, here's an example:

must: [
{
bool: {
should: [{ term: { 'task.status': 'running' } }, { term: { 'task.status': 'claiming' } }],
},
},
{ range: { 'task.retryAt': { lte: 'now' } } },
],

It's kind of brushing the dirt under the rug. You would certainly still see weird stuff in a multi-Kibana deployment where the Kibana clocks are not in sync. But would likely fix the problem in a single Kibana deployment.

@gmmorris gmmorris added the Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework label Jul 1, 2021
@gmmorris gmmorris added loe:needs-research This issue requires some research before it can be worked on or estimated resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility labels Jul 14, 2021
@gmmorris gmmorris added the estimate:needs-research Estimated as too large and requires research to break down into workable issues label Aug 18, 2021
@gmmorris gmmorris removed the loe:needs-research This issue requires some research before it can be worked on or estimated label Sep 2, 2021
@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@mikecote
Copy link
Contributor Author

Closing issue as it seems it would be a core issue if the clocks were out of sync and we haven't seen this happen yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss estimate:needs-research Estimated as too large and requires research to break down into workable issues Feature:Alerting/RulesFramework Issues related to the Alerting Rules Framework Feature:Alerting Feature:Task Manager resilience Issues related to Platform resilience in terms of scale, performance & backwards compatibility Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
No open projects
Development

No branches or pull requests

5 participants