Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7.11.0/7.12.0 upgrade migrations take very long to complete or timeout due to huge number of saved objects #91869

Closed
rudolf opened this issue Feb 18, 2021 · 7 comments · Fixed by #92188 or #96690
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@rudolf
Copy link
Contributor

rudolf commented Feb 18, 2021

Issue

Some users have had migrations take really long to complete (> an hour) due to a huge number of fleet-agent-events, action_task_params or tasks documents (> 100k documents). To check how many of these documents you have run the following aggregation:

GET .kibana,.kibana_task_manager/_search?filter_path=aggregations
{
  "aggs": {
    "saved_object_type": {
      "terms": {"field": "type"}
    }
  }
}
  1. Shutdown all kibana instances
  2. If you've had a failed upgrade, perform a rollback to the previously working version but don't start Kibana up again:
    1. Rollback 7.11 https://www.elastic.co/guide/en/kibana/7.11/upgrade-migrations.html#upgrade-migrations-rolling-back
    2. Rollback 7.12+ https://www.elastic.co/guide/en/kibana/7.12/upgrade-migrations.html#upgrade-migrations-rolling-back
  3. (ensure kibana isn't running then) Delete the saved objects.
    1. fleet-agent-events These saved objects are no longer used by the fleet plugin and can safely be deleted.
    POST .kibana/_delete_by_query?conflicts=proceed&wait_for_completion=false
    {
        "query": {
        "bool": {
            "must": {
            "term": {
                "type": "fleet-agent-events"
            }
            }
        }
        }
    }
    # Check for the completion of the task by using the return task id with GET _tasks/<id>
    
    1. action_task_params This could potentially cause scheduled tasks that have not yet run to fail once.
    POST .kibana/_delete_by_query?conflicts=proceed&wait_for_completion=false
    {
        "query": {
        "bool": {
            "must": {
            "term": {
                "type": "action_task_params"
            }
            }
        }
        }
    }
    # Check for the completion of the task by using the return task id with GET _tasks/<id>
    
    1. task
      Before deleting failed tasks it's useful to understand which actions might be triggering the high number of failed tasks by running:
    GET .kibana_task_manager/_search
    {
     "query": {
       "bool": {
         "must": 
           {
             "term": {
               "task.status": "failed"
             }
           }
       }
     },
    "size":0,
     "aggs" : {
        "types" : { "terms" : { "field" : "task.taskType" } }
      }
    }
    
    Ensure that the reason for the failed tasks has been addressed to prevent more failed tasks from building up. Then delete all failed tasks with:
    POST .kibana_task_manager/_delete_by_query?conflicts=proceed&wait_for_completion=false
    "query": {
        "bool": {
           "must": 
              {
                "term": {
                  "task.status": "failed"
              }
           }
        }
      }
    # Check for the completion of the task by using the return task id with GET _tasks/<id>
    
  4. Upgrade Kibana
@rudolf rudolf added bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team labels Feb 18, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@ruflin
Copy link
Member

ruflin commented Feb 19, 2021

@nchaulet How often are these fleet-agent-events created? What actions triggers these?

@nchaulet
Copy link
Member

@nchaulet How often are these fleet-agent-events created? What actions triggers these?

Every time an agent checkin with a new status change, that could be a lot.

@nchaulet
Copy link
Member

We should try the migration from 7.11 to 7.12 to see if v2 migrations are fixing this or not.

@reighnman
Copy link

I had a support case open for this (7.10 to 7.11 migration failing) before finding this thread. Even with batch size set to 1000 it took several hours for the migration to complete. I was assuming it was hanging after waiting an hour then cleaning up the migration indexes and restarting, but just waiting it out seemed to be the answer.

Our kibana upgrade/migrations had never taken longer than a few minutes but this is the first upgrade since we had a small fleet deployment in the environment.

@rudolf
Copy link
Contributor Author

rudolf commented Feb 19, 2021

@reighnman Can you share the output of:

GET .kibana/_search?filter_path=aggregations
{
  "aggs": {
    "saved_object_type": {
      "terms": {"field": "type"}
    }
  }
}

(if you have already deleted fleet-agent-events and finished the upgrade, you would have to run the aggregation over an older index like .kibana_N)

If you have enough fleet-agent-events you might need an even bigger batchSize to get it to complete in a reasonable amount of time, but the higher the batchSize the more load is placed on Elastic.

@reighnman
Copy link

reighnman commented Feb 19, 2021

This is with the agent running on 8 servers for about a month using the windows integration (7.10).

$result.aggregations.saved_object_type.buckets

key                     doc_count
---                     ---------
fleet-agent-events       33421552
visualization                 883
application_usage_daily       242
ui-metric                     181
lens-ui-telemetry             147
dashboard                      99
search                         91
index-pattern                  46
fleet-agent-actions            30
alert                          23

Looking at .kibana_5 and the current .kibana_6 post upgrade the results are about the same.

@rudolf rudolf changed the title 7.11.0 saved object migrations take very long to complete due to huge number of fleet-agent-events 7.11.0/7.12.0 upgrade migrations take very long to complete or timeout due to huge number of saved objects Apr 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
5 participants