Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit default disk watermark on different data tiers #81406

Closed
Leaf-Lin opened this issue Dec 7, 2021 · 19 comments · Fixed by #88639
Closed

Revisit default disk watermark on different data tiers #81406

Leaf-Lin opened this issue Dec 7, 2021 · 19 comments · Fixed by #88639
Assignees
Labels
:Core/Infra/Settings Settings infrastructure and APIs :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Core/Infra Meta label for core/infra team Team:Data Management Meta label for data/management team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@Leaf-Lin
Copy link
Contributor

Leaf-Lin commented Dec 7, 2021

Today, different data tiers all have the same cluster.routing.allocation.disk.watermark.low|high|flood_stage settings. These settings by default are relative values means that for a larger disk, the default value can introduce some level of wastage.

Do we want to revisit the default value for disk watermark especially for different data tiers, there might be a different level of space requirement?

@Leaf-Lin Leaf-Lin added >enhancement Team:Data Management Meta label for data/management team needs:triage Requires assignment of a team area label labels Dec 7, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@not-napoleon not-napoleon added :Data Management/Other and removed needs:triage Requires assignment of a team area label labels Dec 7, 2021
@dakrone
Copy link
Member

dakrone commented Dec 7, 2021

One thing that would make this difficult is that these are dynamic settings, and we currently don't have the Settings functionality to be able to specify setting a particular value on a subset of nodes. This would mean that we could have separate defaults, but if the user ever dynamically changed one of the settings, it would change for all tiers.

I'm adding the core/infra label for this, since they'd need to help us with the Settings infrastructure if we wanted to support something like this fully.

@dakrone dakrone added the :Core/Infra/Settings Settings infrastructure and APIs label Dec 7, 2021
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Dec 7, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@Leaf-Lin
Copy link
Contributor Author

Leaf-Lin commented Dec 7, 2021

There was a discussion that this would be an operator-only setting on the cloud once the default settings are sensible.

@dakrone
Copy link
Member

dakrone commented Dec 7, 2021

There was a discussion that this would be an operator-only setting on the cloud once the default settings are sensible.

That helps a bit. I think it'd still be good to think about where we want to go with tier-specific settings in the future, both for the purpose of on-prem users, but also for other settings (that may be non-operator-only) that we want to set to different values (for example, we already do this with a recovery setting: #68480)

@Leaf-Lin Leaf-Lin added the :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) label Jun 15, 2022
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 15, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor

I wonder if we really need more control here, as opposed to changing how Elasticsearch computes its disk watermarks to better utilise larger nodes. Today ES (roughly) aims to keep 10% of disk free on each node, but if we say changed this to target min(10%,100GB) of free space then larger nodes would be better-utilised automatically without needing to adjust any settings.

@henningandersen
Copy link
Contributor

I agree with David here, doing similarly to what we did for frozen tier on all tiers would in my mind effectively solve this problem.

Making it tier specific is not enough/ideal during a rolling up-size anyway.

@dakrone
Copy link
Member

dakrone commented Jun 15, 2022

Supporting a rudimentary syntax of either: 0.75 or 10% or 100gb or min(10%,100gb) or max(10%,100gb) (and then changing the default for our needs) seems like a reasonable thing to do for this to me also.

@kingherc
Copy link
Contributor

kingherc commented Jul 5, 2022

From conversing with @henningandersen , we could use 20GB headroom for flood_stage, 100GB for high, and 150GB for low watermarks.

@kingherc
Copy link
Contributor

kingherc commented Jul 11, 2022

@henningandersen , as we said, we could add a max_headroom setting (similar to the existing cluster.routing.allocation.disk.watermark.flood_stage.frozen.max_headroom) for the low, high and flood_stage watermarks. A couple of notes/questions that come to mind:

  • Seeing the logic of the existing max_headroom, it applies a default value (of 20GB for the frozen flood stage) only if the original frozen threshold is not set. Else if the frozen threshold is set, the user would need to explicitly set the max_headroom as well. I am thinking of following the same logic, although it initially felt more natural to me to have the default value even if the frozen threshold is set.
  • If we introduce the max headrooms, will it be weird that there will be still a special watermark and headroom setting just for the frozen tier? Would it make sense to deprecate it and tell people to use the new cluster.routing.allocation.disk.watermark.flood_stage.max_headroom setting (which would apply irrespective of data tier)? We do not need to throw an error, just ignore it and produce a warning (telling to use the new headroom settings).

@kingherc
Copy link
Contributor

Team, as mentioned in my comment above, I plan to start seeing how to implement the max_headroom idea I discussed with @henningandersen . Feel free to provide me feedback / questions / objections if you have any, also for the questions I write above in my comment. cc @DaveCTurner

@kingherc
Copy link
Contributor

In implementing this, I am thinking of adopting @henningandersen 's new RelativeByteSizeValue.java. It was only introduced for the frozen-tier flood stage watermark and headroom, but I believe it would make sense to adopt it in this issue for all disk watermarks and introduce the max headrooms in a similar way. I believe this will simplify the code (which now tends to separately check for the thresholds in either exact bytes or percentages) without breaking backwards compatibility. Please raise your voice if you would have any concerns over this.

@kingherc
Copy link
Contributor

cc'ing also @gmarouli who I saw authored Add disk thresholds in the cluster state recently and could provide feedback as well.

@kingherc
Copy link
Contributor

Hi! I have been dealing with this issue for some days now, and I have been doing several changes across DiskThresholdSettings.java, DiskThresholdMonitor.java, DiskThresholdDecider.java, HealthMetadata.java, and relevant tests. Just to double confirm I am going the correct way, I would like at least one more opinion on this one. Since a couple of people are away, @fcofdez , would you mind taking a look at my comments above and see if I'm going about the right way?

In short:

  • I'm introducing some "max_headroom" settings for the low/high/flood disk watermark settings
  • I'm converting the low/high/flood settings to RelativeByteSizeValue.java classes
  • I believe what I'm doing is backwards compatible, but I do need to change some Allocation "reasons" and some log messages every now and then.

Feedback is welcome

@fcofdez
Copy link
Contributor

fcofdez commented Jul 14, 2022

I'll take a look into this next Monday @kingherc if that's ok

@fcofdez
Copy link
Contributor

fcofdez commented Jul 18, 2022

I'm introducing some "max_headroom" settings for the low/high/flood disk watermark settings

I think this makes sense.

I'm converting the low/high/flood settings to RelativeByteSizeValue.java classes

Maybe we can leave this for a follow-up PR? Since I think this is just a refactoring to clean up some old code, right?

I believe what I'm doing is backwards compatible, but I do need to change some Allocation "reasons" and some log messages every now and then.

I would expect to keep the old behaviour when we don't specify the new max_headroom settings. That way we won't break any old cluster settings. If the changes are just about the new max headroom, I think you're in the good path.

If we introduce the max headrooms, will it be weird that there will be still a special watermark and headroom setting just for the frozen tier?

Yes, that's confusing. But we need to behave correctly in that case, i.e: use the frozen max_headroom when both general and specific are specified. I think we should deprecate that setting, but I'm not sure if this needs to go through the deprecation approval process.

cc @kingherc

@kingherc
Copy link
Contributor

Awesome, thanks for the pointers @fcofdez ! I think the most important thing was to see I'm on a correct path which it seems so from your answer, so I will continue on this path.

Maybe we can leave this for a follow-up PR?

I did not go that way because converting them to RelativeByteSizeValue would simplify the code (it was already done for the frozen max_headroom setting, so I could adopt similar code). But it turns out I need to touch a lot of test code to make sure everything works appropriately, so if I see it needs a lot more work, I may then do it in a separate PR.

I also believe I am not introducing backwards incompatible changes/behaviour -- as I believe if the new max_headroom settings are not set, the old ones will behave as they were.

For the frozen setting, I found that I can leave it as it is. That way, we do not need to deprecate anything or make something backwards incompatible. However, I do agree that it may be confusing, and I would welcome especially @henningandersen (who introduced the frozen setting that I'm mimicking) to comment on whether we should do anything.

kingherc added a commit to kingherc/elasticsearch that referenced this issue Jul 20, 2022
Introduce max headroom settings for the low, high, and
flood disk watermark stages, similar to the existing
max headroom setting for the flood stage of the frozen
tier. Also, convert the disk watermarks to
RelativeByteSizeValue, similar to the existing setting
for the flood stage of the frozen tier.

Introduce new max headrooms in HealthMetadata and in
ReactiveStorageDeciderService.

Add multiple tests in DiskThresholdDeciderUnitTests,
DiskThresholdDeciderTests and DiskThresholdMonitorTests.

Fixes elastic#81406
@henningandersen
Copy link
Contributor

henningandersen commented Jul 27, 2022

I think we should deprecate that setting, but I'm not sure if this needs to go through the deprecation approval process.

We cannot deprecate it since it has a lower default value. For frozen there is no merging and we can go much closer to the limit.

For the frozen setting, I found that I can leave it as it is

👍

kingherc added a commit to kingherc/elasticsearch that referenced this issue Oct 10, 2022
kingherc added a commit that referenced this issue Oct 20, 2022
kingherc added a commit to kingherc/elasticsearch that referenced this issue Oct 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Settings Settings infrastructure and APIs :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement Team:Core/Infra Meta label for core/infra team Team:Data Management Meta label for data/management team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants