Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix CheckTargetShardsCountStep #(48460) (#89176)
**The issue** The flaky test simulates the following: - Create a shrink policy with an invalid target shard count - Then change the policy to have a valid target shard count - Expectation: the `check-target-shards-count` will return true and the shrink operation will be successful. What was happening in some cases in the background: - Create the shrink policy with an invalid target shard count - The `check-target-shards-count` gets created and queued to be executed with the invalid target shards count. - The task doesn't get enough priority to be executed - We change the policy to have a valid target shards count - We execute the queued task which still has the outdated target shard count. **Proof** We enriched the code with some extra logging to verify that the above scenario is actually correct: ``` ## Adding the check target shards to the executingTasks [2022-08-08T18:02:52,824][INFO ][o.e.x.i.IndexLifecycleRunner] [javaRestTest-0] #78460: Adding task to queue check if I can shrink to numberOfShards = 5 [2022-08-08T18:02:52,825][TRACE][o.e.x.i.h.ILMHistoryStore] [javaRestTest-0] queueing ILM history item for indexing [ilm-history-5]: [{"index":"index-zmmrkzfhht","policy":"policy-bEmKF","@timestamp":1659970972825,"index_age":12608,"success":true,"state":{"phase":"warm","phase_definition":"{\"policy\":\"policy-bEmKF\",\"phase_definition\":{\"min_age\":\"0ms\",\"actions\":{\"shrink\":{\"number_of_shards\":5}}},\"version\":1,\"modified_date_in_millis\":1659970962968}","action_time":"1659970968847","phase_time":"1659970966014","action":"shrink","step":"check-target-shards-count","creation_date":"1659970960217","step_time":"1659970972076"}}] ## We change the policy before even the condition is never even evaluated. [2022-08-08T18:02:52,825][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [javaRestTest-0] updating index lifecycle policy [policy-bEmKF] [2022-08-08T18:02:52,826][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-zmmrkzfhht] updated policy [policy-bEmKF] contains the same phase step keys and can be refreshed [2022-08-08T18:02:52,826][TRACE][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] [index-zmmrkzfhht] updating cached phase definition for policy [policy-bEmKF] [2022-08-08T18:02:52,826][DEBUG][o.e.x.c.i.PhaseCacheManagement] [javaRestTest-0] refreshed policy [policy-bEmKF] phase definition for [1] indices ## We check the condition for the first time but the target shard count is already outdated [2022-08-08T18:02:53,406][ERROR][o.e.x.c.i.CheckTargetShardsCountStep] [javaRestTest-0] #78460: Policy has different target number of shards in cluster state 2 vs what will be executed 5. [2022-08-08T18:02:53,441][DEBUG][o.e.x.c.i.CheckTargetShardsCountStep] [javaRestTest-0] lifecycle action of policy [policy-bEmKF] for index [index-zmmrkzfhht] cannot make progress because the target shards count [5] must be a factor of the source index's shards count [4] ``` **Impact** We do not think that the impact is that big for production clusters because there are many more cluster state updates. However, it might cause some inconvenience to the users who fixed a policy and do not see the effect as soon as they could have. **The fix** Our proposed fix is to not provide the target shard count upon task creation but to retrieve from the cluster state. This way we ensure it will have the newest value. **Future work** Currently for every cluster state we go through all the indices and we check if any step needs to be executed. This doesn't scale well. We would like to try to switch to a more efficient model potentially with a cluster state observer. Issue will be created soon. Resolves #78460
- Loading branch information