The first snapshot stuck in inconsistent state after 6.8 -> 7.12 upgrade #75339

mieciu · 2021-07-14T14:19:26Z

Elasticsearch version (bin/elasticsearch --version): 6.8.16 upgraded to 7.12 (or 7.13, it looks like minor version doesn't matter here)

Plugins installed: [repository-s3]

Steps to reproduce:

Steps to reproduce for Elastic Cloud:

Create minimalistic 6.8.16 Cloud deployment - single node 1GB ES, 1GB Kibana
Upgrade to 7.12 / 7.13 (it's going to be a a rolling upgrade which eventually succeeds)
Apply a noop plan (which will trigger snapshot capture)
Watch Elasticsearch struggle to take a snapshot, Cloud change is eventually going to timeout claiming that Elasticsearch failed to capture a snapshot.

If we examine _snapshot API, we're going to find an oddity:

GET _snapshot/found-snapshots/cloud-snapshot-2021.06.30-tal3bny0rfq17eumwbhnmg

reports "state": "IN_PROGRESS", whereas:

_snapshot/found-snapshots/_status

reports "state": "SUCCESS"

Apparently, that snapshot is going to remain in this "inconsistent" state forever. FWIW, this doesn't prevent the snapshot mechanism from being operational - if we delete aforementioned snapshot and capture new one - it's going to work without any issues.

Full API responses:

//_snapshot/found-snapshots/*
//_snapshot/found-snapshots/cloud-snapshot-2021.06.30-tal3bny0rfq17eumwbhnmg
{
  "snapshots": [
    {
      "snapshot": "cloud-snapshot-2021.06.30-tal3bny0rfq17eumwbhnmg",
      "uuid": "XqY6xh7ZTbGAC7KOABdYnw",
      "version_id": 7130299,
      "version": "7.13.2",
      "indices": [
        ".kibana_7.13.2_001",
        ".apm-custom-link",
        ".kibana-event-log-7.13.2-000001",
        ".apm-agent-configuration",
        ".kibana_task_manager_pre7.4.0_001",
        ".kibana_security_session_1",
        ".ds-ilm-history-5-2021.06.30-000001",
        ".kibana_task_manager_7.13.2_001",
        ".kibana_1",
        ".security-6",
        ".tasks"
      ],
      "data_streams": [
        "ilm-history-5"
      ],
      "include_global_state": true,
      "metadata": {
        "policy": "cloud-snapshot-policy"
      },
      "state": "IN_PROGRESS",
      "start_time": "2021-06-30T08:11:20.691Z",
      "start_time_in_millis": 1625040680691,
      "end_time": "1970-01-01T00:00:00.000Z",
      "end_time_in_millis": 0,
      "duration_in_millis": 0,
      "failures": [],
      "shards": {
        "total": 0,
        "failed": 0,
        "successful": 0
      },
      "feature_states": [
        {
          "feature_name": "security",
          "indices": [
            ".security-6"
          ]
        },
        {
          "feature_name": "kibana",
          "indices": [
            ".kibana_task_manager_pre7.4.0_001",
            ".kibana_task_manager_7.13.2_001",
            ".kibana_security_session_1",
            ".kibana_7.13.2_001",
            ".kibana_1",
            ".apm-agent-configuration",
            ".apm-custom-link"
          ]
        },
        {
          "feature_name": "tasks",
          "indices": [
            ".tasks"
          ]
        }
      ]
    }
  ]
}

//_snapshot/found-snapshots/_status

{
  "snapshots": [
    {
      "snapshot": "cloud-snapshot-2021.06.30-tal3bny0rfq17eumwbhnmg",
      "repository": "found-snapshots",
      "uuid": "XqY6xh7ZTbGAC7KOABdYnw",
      "state": "SUCCESS",
      "include_global_state": true,
      "shards_stats": {
        "initializing": 0,
        "started": 0,
        "finalizing": 0,
        "done": 11,
        "failed": 0,
        "total": 11
      },
      "stats": {
        "incremental": {
          "file_count": 176,
          "size_in_bytes": 2624566
        },
        "processed": {
          "file_count": 137,
          "size_in_bytes": 2608294
        },
        "total": {
          "file_count": 186,
          "size_in_bytes": 2638788
        },
        "start_time_in_millis": 1625040680691,
        "time_in_millis": 1573599
      },
      "indices": {
        ".kibana_7.13.2_001": {
          "shards_stats": {
            "initializing": 0,
            "started": 0,
            "finalizing": 0,
            "done": 1,
            "failed": 0,
            "total": 1
          },
          "stats": {
            "incremental": {
              "file_count": 45,
              "size_in_bytes": 2255550
            },
            "processed": {
              "file_count": 35,
              "size_in_bytes": 2251118
            },
            "total": {
              "file_count": 45,
              "size_in_bytes": 2255550
            },
            "start_time_in_millis": 1625040683194,
            "time_in_millis": 27894
          },
          "shards": {
            "0": {
              "stage": "DONE",
              "stats": {
                "incremental": {
                  "file_count": 45,
                  "size_in_bytes": 2255550
                },
                "processed": {
                  "file_count": 35,
                  "size_in_bytes": 2251118
                },
                "total": {
                  "file_count": 45,
                  "size_in_bytes": 2255550
                },
                "start_time_in_millis": 1625040683194,
                "time_in_millis": 27894
              }
            }
          }
        },
        ".apm-custom-link": {
          "shards_stats": {
            "initializing": 0,
            "started": 0,
            "finalizing": 0,
            "done": 1,
            "failed": 0,
            "total": 1
          },
          "stats": {
            "incremental": {
              "file_count": 1,
              "size_in_bytes": 208
            },
            "processed": {
              "file_count": 0,
              "size_in_bytes": 0
            },
            "total": {
              "file_count": 1,
              "size_in_bytes": 208
            },
            "start_time_in_millis": 1625040680895,
            "time_in_millis": 409
          },
          "shards": {
            "0": {
              "stage": "DONE",
              "stats": {
                "incremental": {
                  "file_count": 1,
                  "size_in_bytes": 208
                },
                "processed": {
                  "file_count": 0,
                  "size_in_bytes": 0
                },
                "total": {
                  "file_count": 1,
                  "size_in_bytes": 208
                },
                "start_time_in_millis": 1625040680895,
                "time_in_millis": 409
              }
            }
          }
        },
        ".kibana-event-log-7.13.2-000001": {
          "shards_stats": {
            "initializing": 0,
            "started": 0,
            "finalizing": 0,
            "done": 1,
            "failed": 0,
            "total": 1
          },
          "stats": {
            "incremental": {
              "file_count": 4,
              "size_in_bytes": 5748
            },
            "processed": {
              "file_count": 2,
              "size_in_bytes": 5080
            },
            "total": {
              "file_count": 4,
              "size_in_bytes": 5748
            },
            "start_time_in_millis": 1625040682188,
            "time_in_millis": 6100
          },
          "shards": {
            "0": {
              "stage": "DONE",
              "stats": {
                "incremental": {
                  "file_count": 4,
                  "size_in_bytes": 5748
                },
                "processed": {
                  "file_count": 2,
                  "size_in_bytes": 5080
                },
                "total": {
                  "file_count": 4,
                  "size_in_bytes": 5748
                },
                "start_time_in_millis": 1625040682188,
                "time_in_millis": 6100
              }
            }
          }
        },
        ".kibana_task_manager_pre7.4.0_001": {
          "shards_stats": {
            "initializing": 0,
            "started": 0,
            "finalizing": 0,
            "done": 1,
            "failed": 0,
            "total": 1
          },
          "stats": {
            "incremental": {
              "file_count": 4,
              "size_in_bytes": 6620
            },
            "processed": {
              "file_count": 2,
              "size_in_bytes": 5952
            },
            "total": {
              "file_count": 4,
              "size_in_bytes": 6620
            },
            "start_time_in_millis": 1625040681988,
            "time_in_millis": 5411
          },
          "shards": {
            "0": {
              "stage": "DONE",
              "stats": {
                "incremental": {
                  "file_count": 4,
                  "size_in_bytes": 6620
                },
                "processed": {
                  "file_count": 2,
                  "size_in_bytes": 5952
                },
                "total": {
                  "file_count": 4,
                  "size_in_bytes": 6620
                },
                "start_time_in_millis": 1625040681988,
                "time_in_millis": 5411
              }
            }
          }
        },
        ".apm-agent-configuration": {
          "shards_stats": {
            "initializing": 0,
            "started": 0,
            "finalizing": 0,
            "done": 1,
            "failed": 0,
            "total": 1
          },
          "stats": {
            "incremental": {
              "file_count": 1,
              "size_in_bytes": 208
            },
            "processed": {
              "file_count": 0,
              "size_in_bytes": 0
            },
            "total": {
              "file_count": 1,
              "size_in_bytes": 208
            },
            "start_time_in_millis": 1625040681304,
            "time_in_millis": 283
          },
          "shards": {
            "0": {
              "stage": "DONE",
              "stats": {
                "incremental": {
                  "file_count": 1,
                  "size_in_bytes": 208
                },
                "processed": {
                  "file_count": 0,
                  "size_in_bytes": 0
                },
                "total": {
                  "file_count": 1,
                  "size_in_bytes": 208
                },
                "start_time_in_millis": 1625040681304,
                "time_in_millis": 283
              }
            }
          }
        },
        ".ds-ilm-history-5-2021.06.30-000001": {
          "shards_stats": {
            "initializing": 0,
            "started": 0,
            "finalizing": 0,
            "done": 1,
            "failed": 0,
            "total": 1
          },
          "stats": {
            "incremental": {
              "file_count": 10,
              "size_in_bytes": 26486
            },
            "processed": {
              "file_count": 6,
              "size_in_bytes": 24944
            },
            "total": {
              "file_count": 10,
              "size_in_bytes": 26486
            },
            "start_time_in_millis": 1625040682994,
            "time_in_millis": 10394
          },
          "shards": {
            "0": {
              "stage": "DONE",
              "stats": {
                "incremental": {
                  "file_count": 10,
                  "size_in_bytes": 26486
                },
                "processed": {
                  "file_count": 6,
                  "size_in_bytes": 24944
                },
                "total": {
                  "file_count": 10,
                  "size_in_bytes": 26486
                },
                "start_time_in_millis": 1625040682994,
                "time_in_millis": 10394
              }
            }
          }
        },
        ".kibana_security_session_1": {
          "shards_stats": {
            "initializing": 0,
            "started": 0,
            "finalizing": 0,
            "done": 1,
            "failed": 0,
            "total": 1
          },
          "stats": {
            "incremental": {
              "file_count": 1,
              "size_in_bytes": 208
            },
            "processed": {
              "file_count": 0,
              "size_in_bytes": 0
            },
            "total": {
              "file_count": 1,
              "size_in_bytes": 208
            },
            "start_time_in_millis": 1625040681587,
            "time_in_millis": 200
          },
          "shards": {
            "0": {
              "stage": "DONE",
              "stats": {
                "incremental": {
                  "file_count": 1,
                  "size_in_bytes": 208
                },
                "processed": {
                  "file_count": 0,
                  "size_in_bytes": 0
                },
                "total": {
                  "file_count": 1,
                  "size_in_bytes": 208
                },
                "start_time_in_millis": 1625040681587,
                "time_in_millis": 200
              }
            }
          }
        },
        ".kibana_task_manager_7.13.2_001": {
          "shards_stats": {
            "initializing": 0,
            "started": 0,
            "finalizing": 0,
            "done": 1,
            "failed": 0,
            "total": 1
          },
          "stats": {
            "incremental": {
              "file_count": 53,
              "size_in_bytes": 77870
            },
            "processed": {
              "file_count": 48,
              "size_in_bytes": 75189
            },
            "total": {
              "file_count": 53,
              "size_in_bytes": 77870
            },
            "start_time_in_millis": 1625040683887,
            "time_in_millis": 29801
          },
          "shards": {
            "0": {
              "stage": "DONE",
              "stats": {
                "incremental": {
                  "file_count": 53,
                  "size_in_bytes": 77870
                },
                "processed": {
                  "file_count": 48,
                  "size_in_bytes": 75189
                },
                "total": {
                  "file_count": 53,
                  "size_in_bytes": 77870
                },
                "start_time_in_millis": 1625040683887,
                "time_in_millis": 29801
              }
            }
          }
        },
        ".kibana_1": {
          "shards_stats": {
            "initializing": 0,
            "started": 0,
            "finalizing": 0,
            "done": 1,
            "failed": 0,
            "total": 1
          },
          "stats": {
            "incremental": {
              "file_count": 0,
              "size_in_bytes": 0
            },
            "total": {
              "file_count": 10,
              "size_in_bytes": 14222
            },
            "start_time_in_millis": 1625040684089,
            "time_in_millis": 603
          },
          "shards": {
            "0": {
              "stage": "DONE",
              "stats": {
                "incremental": {
                  "file_count": 0,
                  "size_in_bytes": 0
                },
                "total": {
                  "file_count": 10,
                  "size_in_bytes": 14222
                },
                "start_time_in_millis": 1625040684089,
                "time_in_millis": 603
              }
            }
          }
        },
        ".security-6": {
          "shards_stats": {
            "initializing": 0,
            "started": 0,
            "finalizing": 0,
            "done": 1,
            "failed": 0,
            "total": 1
          },
          "stats": {
            "incremental": {
              "file_count": 41,
              "size_in_bytes": 216412
            },
            "processed": {
              "file_count": 34,
              "size_in_bytes": 213171
            },
            "total": {
              "file_count": 41,
              "size_in_bytes": 216412
            },
            "start_time_in_millis": 1625040682388,
            "time_in_millis": 27012
          },
          "shards": {
            "0": {
              "stage": "DONE",
              "stats": {
                "incremental": {
                  "file_count": 41,
                  "size_in_bytes": 216412
                },
                "processed": {
                  "file_count": 34,
                  "size_in_bytes": 213171
                },
                "total": {
                  "file_count": 41,
                  "size_in_bytes": 216412
                },
                "start_time_in_millis": 1625040682388,
                "time_in_millis": 27012
              }
            }
          }
        },
        ".tasks": {
          "shards_stats": {
            "initializing": 0,
            "started": 0,
            "finalizing": 0,
            "done": 1,
            "failed": 0,
            "total": 1
          },
          "stats": {
            "incremental": {
              "file_count": 16,
              "size_in_bytes": 35256
            },
            "processed": {
              "file_count": 10,
              "size_in_bytes": 32840
            },
            "total": {
              "file_count": 16,
              "size_in_bytes": 35256
            },
            "start_time_in_millis": 1625040680691,
            "time_in_millis": 15600
          },
          "shards": {
            "0": {
              "stage": "DONE",
              "stats": {
                "incremental": {
                  "file_count": 16,
                  "size_in_bytes": 35256
                },
                "processed": {
                  "file_count": 10,
                  "size_in_bytes": 32840
                },
                "total": {
                  "file_count": 16,
                  "size_in_bytes": 35256
                },
                "start_time_in_millis": 1625040680691,
                "time_in_millis": 15600
              }
            }
          }
        }
      }
    }
  ]
}

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-07-16T19:34:39Z

Pinging @elastic/es-distributed (Team:Distributed)

…#75501) This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time. But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in #75530. These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this: * snapshot-1 for index-A with uuid-A runs and is partial * index-A is deleted and re-created and now has uuid-B * snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index * snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id * this change fixes all these spots by always taking the round trip via `RepositoryShardId` planned follow-ups here are: * dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps * serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct closes #75423 relates (#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)

…elastic#75501) This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time. But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in elastic#75530. These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this: * snapshot-1 for index-A with uuid-A runs and is partial * index-A is deleted and re-created and now has uuid-B * snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index * snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id * this change fixes all these spots by always taking the round trip via `RepositoryShardId` planned follow-ups here are: * dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps * serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct closes elastic#75423 relates (elastic#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)

…#75501) (#76539) This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time. But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in #75530. These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this: * snapshot-1 for index-A with uuid-A runs and is partial * index-A is deleted and re-created and now has uuid-B * snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index * snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id * this change fixes all these spots by always taking the round trip via `RepositoryShardId` planned follow-ups here are: * dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps * serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct closes #75423 relates (#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)

…elastic#75501) (elastic#76539) This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time. But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in elastic#75530. These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this: * snapshot-1 for index-A with uuid-A runs and is partial * index-A is deleted and re-created and now has uuid-B * snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index * snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id * this change fixes all these spots by always taking the round trip via `RepositoryShardId` planned follow-ups here are: * dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps * serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct closes elastic#75423 relates (elastic#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)

…#75501) (#76539) (#76547) This refactors the snapshots-in-progress logic to work from `RepositoryShardId` when working out what parts of the repository are in-use by writes for snapshot concurrency safety. This change does not go all the way yet on this topic and there are a number of possible follow-up further improvements to simplify the logic that I'd work through over time. But for now this allows fixing the remaining known issues that snapshot stress testing surfaced when combined with the fix in #75530. These issues all come from the fact that `ShardId` is not a stable key across multiple snapshots if snapshots are partial. The scenarios that are broken are all roughly this: * snapshot-1 for index-A with uuid-A runs and is partial * index-A is deleted and re-created and now has uuid-B * snapshot-2 for index-A is started and we now have it queued up behind snapshot-1 for the index * snapshot-1 finishes and the logic tries to start the next snapshot for the same shard-id * this fails because the shard-id is not the same, we can't compare index uuids, just index name + shard id * this change fixes all these spots by always taking the round trip via `RepositoryShardId` planned follow-ups here are: * dry up logic across cloning and snapshotting more as both now essentially run the same code in many state-machine steps * serialize snapshots-in-progress efficiently instead of re-computing the index and by-repository-shard-id lookups in the constructor every time * refactor the logic in snapshots-in-progress away from maps keyed by shard-id in almost all spots to this end, just keep an index name to `Index` map to work out what exactly is being snapshotted * refactoring snapshots-in-progress to be a map of list of operations keyed by repository shard id instead of a list of maps as it currently is to make the concurrency simpler and more obviously correct closes #75423 relates (#75339 ... should also fix this, but I have to verify by testing with a backport to 7.x)

mieciu added >bug needs:triage Requires assignment of a team area label labels Jul 14, 2021

mieciu assigned original-brownbear Jul 14, 2021

jtibshirani added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed needs:triage Requires assignment of a team area label labels Jul 16, 2021

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jul 16, 2021

original-brownbear mentioned this issue Jul 21, 2021

Refactor SnapshotsInProgress to Use RepositoryId for Concurency Logic #75501

Merged

original-brownbear mentioned this issue Aug 15, 2021

Refactor SnapshotsInProgress to Use RepositoryId for Concurency Logic(#75501) #76539

Merged

original-brownbear mentioned this issue Aug 16, 2021

Refactor SnapshotsInProgress to Use RepositoryId for Concurency Logic(#75501) (#76539) #76547

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The first snapshot stuck in inconsistent state after 6.8 -> 7.12 upgrade #75339

The first snapshot stuck in inconsistent state after 6.8 -> 7.12 upgrade #75339

mieciu commented Jul 14, 2021 •

edited

Loading

elasticmachine commented Jul 16, 2021

The first snapshot stuck in inconsistent state after 6.8 -> 7.12 upgrade #75339

The first snapshot stuck in inconsistent state after 6.8 -> 7.12 upgrade #75339

Comments

mieciu commented Jul 14, 2021 • edited Loading

elasticmachine commented Jul 16, 2021

mieciu commented Jul 14, 2021 •

edited

Loading