[META] Insufficient guardrails leading to disk going full on nodes #5712
Labels
distributed framework
enhancement
Enhancement or improvement to existing feature or request
Meta
Meta issue, not directly linked to a PR
Roadmap:Stability/Availability/Resiliency
Project-wide roadmap label
Is your feature request related to a problem? Please describe.
Currently, we are observing multiple instances where data volume is getting 100% filled up on one or more node. We have guardrails like flood stage watermark in place which ensures that OpenSearch put blocks at the right time and enough amount of space is available for OpenSearch to perform internal operations (like segment merge, cluster state update etc.). Still, we sometime observe that available space on few or all data nodes of a domain goes to 0. This can cause node from getting removed from cluster (by either FSHealthService checks or due to cluster state update) which may ultimately result in red clusters (if it contains active primary).
Describe the solution you'd like
OpenSearch should ensure that guardrails like FloodStage watermarks are applied correctly and enough amount of space is available for OpenSearch to perform internal operations (like segment merge, cluster state update etc.).
OpenSearch Subtasks
The text was updated successfully, but these errors were encountered: