7.11.0/7.12.0 upgrade migrations take very long to complete or timeout due to huge number of saved objects #91869

rudolf · 2021-02-18T16:01:47Z

Issue

Some users have had migrations take really long to complete (> an hour) due to a huge number of fleet-agent-events, action_task_params or tasks documents (> 100k documents). To check how many of these documents you have run the following aggregation:

GET .kibana,.kibana_task_manager/_search?filter_path=aggregations
{
  "aggs": {
    "saved_object_type": {
      "terms": {"field": "type"}
    }
  }
}

Shutdown all kibana instances
If you've had a failed upgrade, perform a rollback to the previously working version but don't start Kibana up again:
1. Rollback 7.11 https://www.elastic.co/guide/en/kibana/7.11/upgrade-migrations.html#upgrade-migrations-rolling-back
2. Rollback 7.12+ https://www.elastic.co/guide/en/kibana/7.12/upgrade-migrations.html#upgrade-migrations-rolling-back

(ensure kibana isn't running then) Delete the saved objects.

fleet-agent-events These saved objects are no longer used by the fleet plugin and can safely be deleted.

POST .kibana/_delete_by_query?conflicts=proceed&wait_for_completion=false
{
    "query": {
    "bool": {
        "must": {
        "term": {
            "type": "fleet-agent-events"
        }
        }
    }
    }
}
# Check for the completion of the task by using the return task id with GET _tasks/<id>

action_task_params This could potentially cause scheduled tasks that have not yet run to fail once.

POST .kibana/_delete_by_query?conflicts=proceed&wait_for_completion=false
{
    "query": {
    "bool": {
        "must": {
        "term": {
            "type": "action_task_params"
        }
        }
    }
    }
}
# Check for the completion of the task by using the return task id with GET _tasks/<id>

task
Before deleting failed tasks it's useful to understand which actions might be triggering the high number of failed tasks by running:

GET .kibana_task_manager/_search
{
 "query": {
   "bool": {
     "must": 
       {
         "term": {
           "task.status": "failed"
         }
       }
   }
 },
"size":0,
 "aggs" : {
    "types" : { "terms" : { "field" : "task.taskType" } }
  }
}

Ensure that the reason for the failed tasks has been addressed to prevent more failed tasks from building up. Then delete all failed tasks with:

POST .kibana_task_manager/_delete_by_query?conflicts=proceed&wait_for_completion=false
"query": {
    "bool": {
       "must": 
          {
            "term": {
              "task.status": "failed"
          }
       }
    }
  }
# Check for the completion of the task by using the return task id with GET _tasks/<id>

Upgrade Kibana

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-02-18T16:12:31Z

Pinging @elastic/fleet (Team:Fleet)

ruflin · 2021-02-19T09:03:51Z

@nchaulet How often are these fleet-agent-events created? What actions triggers these?

nchaulet · 2021-02-19T13:08:28Z

@nchaulet How often are these fleet-agent-events created? What actions triggers these?

Every time an agent checkin with a new status change, that could be a lot.

nchaulet · 2021-02-19T13:55:04Z

We should try the migration from 7.11 to 7.12 to see if v2 migrations are fixing this or not.

reighnman · 2021-02-19T14:18:33Z

I had a support case open for this (7.10 to 7.11 migration failing) before finding this thread. Even with batch size set to 1000 it took several hours for the migration to complete. I was assuming it was hanging after waiting an hour then cleaning up the migration indexes and restarting, but just waiting it out seemed to be the answer.

Our kibana upgrade/migrations had never taken longer than a few minutes but this is the first upgrade since we had a small fleet deployment in the environment.

rudolf · 2021-02-19T14:37:32Z

@reighnman Can you share the output of:

GET .kibana/_search?filter_path=aggregations
{
  "aggs": {
    "saved_object_type": {
      "terms": {"field": "type"}
    }
  }
}

(if you have already deleted fleet-agent-events and finished the upgrade, you would have to run the aggregation over an older index like .kibana_N)

If you have enough fleet-agent-events you might need an even bigger batchSize to get it to complete in a reasonable amount of time, but the higher the batchSize the more load is placed on Elastic.

reighnman · 2021-02-19T18:21:48Z

This is with the agent running on 8 servers for about a month using the windows integration (7.10).

$result.aggregations.saved_object_type.buckets

key                     doc_count
---                     ---------
fleet-agent-events       33421552
visualization                 883
application_usage_daily       242
ui-metric                     181
lens-ui-telemetry             147
dashboard                      99
search                         91
index-pattern                  46
fleet-agent-actions            30
alert                          23

Looking at .kibana_5 and the current .kibana_6 post upgrade the results are about the same.

rudolf added bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team labels Feb 18, 2021

rudolf mentioned this issue Feb 22, 2021

v1 migrations: drop fleet-agent-events during a migration #92188

Merged

9 tasks

rudolf closed this as completed in #92188 Feb 23, 2021

jfsiii mentioned this issue Mar 11, 2021

[Fleet] Remove use of agent events saved object #94470

Closed

rudolf mentioned this issue Apr 9, 2021

Migrations v2 ignore fleet agent events #96690

Merged

9 tasks

rudolf changed the title ~~7.11.0 saved object migrations take very long to complete due to huge number of fleet-agent-events~~ 7.11.0/7.12.0 upgrade migrations take very long to complete or timeout due to huge number of saved objects Apr 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

7.11.0/7.12.0 upgrade migrations take very long to complete or timeout due to huge number of saved objects #91869

7.11.0/7.12.0 upgrade migrations take very long to complete or timeout due to huge number of saved objects #91869

rudolf commented Feb 18, 2021 •

edited

Loading

elasticmachine commented Feb 18, 2021

ruflin commented Feb 19, 2021

nchaulet commented Feb 19, 2021

nchaulet commented Feb 19, 2021

reighnman commented Feb 19, 2021

rudolf commented Feb 19, 2021

reighnman commented Feb 19, 2021 •

edited

Loading

7.11.0/7.12.0 upgrade migrations take very long to complete or timeout due to huge number of saved objects #91869

7.11.0/7.12.0 upgrade migrations take very long to complete or timeout due to huge number of saved objects #91869

Comments

rudolf commented Feb 18, 2021 • edited Loading

Issue

elasticmachine commented Feb 18, 2021

ruflin commented Feb 19, 2021

nchaulet commented Feb 19, 2021

nchaulet commented Feb 19, 2021

reighnman commented Feb 19, 2021

rudolf commented Feb 19, 2021

reighnman commented Feb 19, 2021 • edited Loading

rudolf commented Feb 18, 2021 •

edited

Loading

reighnman commented Feb 19, 2021 •

edited

Loading