-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow total memory to be overridden #78750
Conversation
Since elastic#65905 Elasticsearch has determined the Java heap settings from node roles and total system memory. This change allows the total system memory used in that calculation to be overridden with a user-specified value. This is intended to be used when Elasticsearch is running on a machine where some other software that consumes a non-negligible amount of memory is running. For example, a user could tell Elasticsearch to assume it was running on a machine with 3GB of RAM when actually it was running on a machine with 4GB of RAM. The system property is `es.total_memory_bytes`, so, for example, could be specified using `-Des.total_memory_bytes=3221225472`. (It is specified in bytes rather than using a unit, because it needs to be parsed by startup code that does not have access to the utility classes that interpret byte size units.)
Pinging @elastic/es-delivery (Team:Delivery) |
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Pinging @elastic/ml-core (Team:ML) |
Hi @droberts195, I've created a changelog YAML for you. |
Hi @droberts195, I've updated the changelog YAML for you. |
String totalMemoryBytesOption = userDefinedJvmOptions.stream() | ||
.filter(option -> option.startsWith("-Des.total_memory_bytes=")) | ||
.findFirst() | ||
.orElse(null); | ||
if (totalMemoryBytesOption == null) { | ||
return null; | ||
} | ||
try { | ||
return Long.parseLong(totalMemoryBytesOption.split("=", 2)[1]); | ||
} catch (NumberFormatException e) { | ||
throw new IllegalArgumentException("Unable to parse number of bytes from [" + totalMemoryBytesOption + "]"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you don't need an intermediate variable if you used Optional.map
?
String totalMemoryBytesOption = userDefinedJvmOptions.stream() | |
.filter(option -> option.startsWith("-Des.total_memory_bytes=")) | |
.findFirst() | |
.orElse(null); | |
if (totalMemoryBytesOption == null) { | |
return null; | |
} | |
try { | |
return Long.parseLong(totalMemoryBytesOption.split("=", 2)[1]); | |
} catch (NumberFormatException e) { | |
throw new IllegalArgumentException("Unable to parse number of bytes from [" + totalMemoryBytesOption + "]"); | |
} | |
return userDefinedJvmOptions.stream() | |
.filter(option -> option.startsWith("-Des.total_memory_bytes=")) | |
.findFirst() | |
.map(totalMemoryBytesOption -> { | |
try { | |
return Long.parseLong(totalMemoryBytesOption.split("=", 2)[1]); | |
} catch (NumberFormatException e) { | |
throw new IllegalArgumentException("Unable to parse number of bytes from [" + totalMemoryBytesOption + "]"); | |
} | |
}) | |
.orElse(null); |
`total_override`:: | ||
(<<byte-units,byte value>>) | ||
If the amount of physical memory has been overridden using the `es.total_memory_bytes` | ||
system property on all selected nodes then this reports the sum of the overridden |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the total memory was only overridden on some of the selected nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a tricky one. I opted not to report any override at all at the cluster level if some nodes have overrides and others don't. (You can still get all the values from the node stats.) I guess the alternative would be to report the sum of overrides on the nodes that have overrides, plus un-overridden total on the nodes that don't, but only if at least one node has an override. Maybe that's better - I'd be interested to hear if subsequent reviewers have any thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also ties in with #78750 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need to have any distinct concept of override in the stats at all. With available disk space, when limited through cgroups, we do not show what the real disk has available. Why not just show this as “this is the memory available”?
assertThat(OsProbe.getTotalMemoryOverride("123456789123456789"), is(123456789123456789L)); | ||
IllegalArgumentException e = expectThrows(IllegalArgumentException.class, () -> OsProbe.getTotalMemoryOverride("abc")); | ||
assertThat(e.getMessage(), is("Invalid value for [es.total_memory_bytes]: [abc]")); | ||
// Although numeric, this value overflows long. This won't be a problem if the override is set appropriately on a 64 bit machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this won't be a problem on a 64 bit machine - can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 64 bit pointer can address 16 exabytes of memory. A signed long
can only store half that, but it's still huge. I think as machines start to have memory in the exabyte ranges we'll see 128 bit CPUs and languages will be changed to have 128 bit primitive integer types. At that time all the places where we've stored memory sizes in long
will need to be changed to that new 128 bit primitive type, but that's decades in the future.
Basically, for the foreseeable future there's no justification for anyone to need to tell Elasticsearch to work on the assumption that the total memory on the machine is greater than 8 exabytes, so it doesn't matter that attempting to do so is an error. I'll change the comment to say that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments. I think we should get somebody from core/infra to give this a once over as well. Otherwise 👍 .
...on/tools/launchers/src/main/java/org/elasticsearch/tools/launchers/MachineDependentHeap.java
Outdated
Show resolved
Hide resolved
* is read from the "es.total_memory_bytes" system property. Negative | ||
* values or not set at all mean no override. | ||
*/ | ||
public Long getTotalMemoryOverride() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should call this something less implementation-specific, like "totalUsableMemory"? I think in that case it would also make sense to have this return a long
primative as well, and if there's no override defined, we just return the same value as totalSystemMemory
. This saves downstream callers the trouble of null checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
assertThat(xArgs, hasItems("-Xms304m", "-Xmx304m")); | ||
} | ||
|
||
private List<String> machineDependentHeapTest(final String containerMemory, final List<String> extraJvmOptions) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make sure we're testing this scenario in our ArchiveTests
and PackageTests
as well to get full coverage across packaging types.
Just summarizing some of the comments that have been raised against different lines but are all very interrelated:
This is not exactly like cgroups. The key difference is that cgroups enforces boundaries but this functionality does not. So with cgroups, if the memory of the machine is 16GB but cgroups says this container can use 4GB then the OS will enforce that the container really can only use 4GB. With this functionality if a container has 4GB but we want Elasticsearch to work on the basis that something else will be using 1GB then there is nothing at the OS level that will stop Elasticsearch using more than 3GB, nor stop the other processes from using more than 1GB. It has been suggested that a better solution to doing all this would be to run the other processes in a separate container, and I guess this issue shows that that is indeed the better solution from the perspective of different processes fighting over memory. But I guess it also creates other problems when the other process is supposed to be monitoring the container where Elasticsearch is running and cannot get all the stats it wants when running outside the Elasticsearch container. This also makes it hard to modify As a result of the answers to points 1 and 2 I think the only way forward is to add a third field in addition to |
@@ -1286,7 +1286,8 @@ public static boolean allTemplatesInstalled(ClusterState clusterState) { | |||
|
|||
/** | |||
* Find the memory size (in bytes) of the machine this node is running on. | |||
* Takes container limits (as used by Docker for example) into account. | |||
* Takes container limits (as used by Docker for example) and forced memory | |||
* size overrides into account. | |||
*/ | |||
static long machineMemoryFromStats(OsStats stats) { | |||
long mem = stats.getMem().getTotal().getBytes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discussion about cgroups made me think that the 10 lines below (not changed by this PR, so expand the diff to see them) are unnecessary now. They certainly were necessary back in some 6.x release when they were first added.
Is it the case that OperatingSystemMXBean
takes cgroups into account only for Java 14 and above, hence:
Line 36 in b968fcb
throw new SystemMemoryInfoException("The minimum required Java version is 14 to use " + this.getClass().getName()); |
?
If that's the case then this method can be simplified for 8.0+ when we're mandating at least Java 17.
I went with I also removed ML's hand coded application of cgroups limits on the assumption Java 14+ does this automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
static Long getTotalMemoryOverride(String memoryOverrideProperty) { | ||
try { | ||
long memoryOverride = Long.parseLong(memoryOverrideProperty); | ||
return (memoryOverride < 0) ? null : memoryOverride; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why lenience for a negative value, shouldn't that be an error? Instead could we use null
as the sentinel for the property not existing?
assert total >= 0 : "expected total memory to be positive, got: " + total; | ||
assert free >= 0 : "expected free memory to be positive, got: " + total; | ||
assert adjustedTotal >= 0 : "expected adjusted total memory to be positive, got: " + adjustedTotal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see we are asserting here to ensure it is positive, but since asserts are not normally running in production, I think a hard check when parsing would still be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reluctant to make it throw exceptions in production here. The assertions date back to #42725, which was fixing #42157. #54415 and #56435 are also related. Even though those issues showed up in CI, I know there were also a load of problems reported by end users related to exceptions caused by negative memory values. This included nasty problems like ECK being unusable on certain operating systems.
This constructor is only called in 2 places in production code, and the code paths that lead to those calls were all changed in #42725 to ensure none of the arguments can be negative. Changing negative values reported by the JVM to zero was the general approach to solving all the related problems.
I'll make the two constructors log an error and change negative values to zero after the assertions. This will add an extra layer of protection that negative values won't corrupt the stats API outputs if a future code change means a negative value gets this far in production. And the assertions will still trip CI, so we should still find out about future JVM bugs or OS bugs as soon as we test with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, but I was specifically referring to the adjusted total, which does not come from the JVM. CI won’t catch that since it will be passed in externally at runtime? Surely we can enforce our own system property does not have a negative value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely we can enforce our own system property does not have a negative value?
Yes, sure, that's done in https://github.com/elastic/elasticsearch/pull/78750/files#diff-709e127f78f07be389e23093a3fd58cde32a752de9b80a7f64230e330d744e36R148 now.
So none of the arguments to this constructor should be negative now.
The adjusted value might still come from the JVM if the system property wasn't set at all, because it will have been set to be the same as total
. This class is really just a holder for passing previously calculated values around.
@elasticmachine update branch |
@elasticmachine run elasticsearch-ci/part-1 |
* upstream/master: (109 commits) Migrate custom role providers to licensed feature (elastic#79127) Remove stale AwaitsFix in InternalEngineTests (elastic#79323) Fix errors in RefreshListenersTests (elastic#79324) Reeable BwC Tests after elastic#79318 (elastic#79320) Mute BwC Tests for elastic#79318 (elastic#79319) Reenable BwC Tests after elastic#79308 (elastic#79313) Disable BwC Tests for elastic#79308 (elastic#79310) Adjust BWC for node-level field cap requests (elastic#79301) Allow total memory to be overridden (elastic#78750) Fix SnapshotBasedIndexRecoveryIT#testRecoveryIsCancelledAfterDeletingTheIndex (elastic#79269) Disable BWC tests Mute GeoIpDownloaderCliIT.testStartWithNoDatabases (elastic#79299) Add alias support to fleet search API (elastic#79285) Create a coordinating node level reader for tsdb (elastic#79197) Route documents to the correct shards in tsdb (elastic#77731) Inject migrate action regardless of allocate action (elastic#79090) Migrate to data tiers should always ensure a TIER_PREFERENCE is set (elastic#79100) Skip building of BWC distributions when building release artifacts (elastic#79180) Default ENFORCE_DEFAULT_TIER_PREFERENCE to true (elastic#79275) Deprecation of transient cluster settings (elastic#78794) ... # Conflicts: # server/src/main/java/org/elasticsearch/index/IndexMode.java # server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java
* upstream/master: (521 commits) Migrate custom role providers to licensed feature (elastic#79127) Remove stale AwaitsFix in InternalEngineTests (elastic#79323) Fix errors in RefreshListenersTests (elastic#79324) Reeable BwC Tests after elastic#79318 (elastic#79320) Mute BwC Tests for elastic#79318 (elastic#79319) Reenable BwC Tests after elastic#79308 (elastic#79313) Disable BwC Tests for elastic#79308 (elastic#79310) Adjust BWC for node-level field cap requests (elastic#79301) Allow total memory to be overridden (elastic#78750) Fix SnapshotBasedIndexRecoveryIT#testRecoveryIsCancelledAfterDeletingTheIndex (elastic#79269) Disable BWC tests Mute GeoIpDownloaderCliIT.testStartWithNoDatabases (elastic#79299) Add alias support to fleet search API (elastic#79285) Create a coordinating node level reader for tsdb (elastic#79197) Route documents to the correct shards in tsdb (elastic#77731) Inject migrate action regardless of allocate action (elastic#79090) Migrate to data tiers should always ensure a TIER_PREFERENCE is set (elastic#79100) Skip building of BWC distributions when building release artifacts (elastic#79180) Default ENFORCE_DEFAULT_TIER_PREFERENCE to true (elastic#79275) Deprecation of transient cluster settings (elastic#78794) ... # Conflicts: # rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/tsdb/10_settings.yml # server/src/main/java/org/elasticsearch/common/settings/IndexScopedSettings.java # server/src/main/java/org/elasticsearch/common/settings/Setting.java # server/src/main/java/org/elasticsearch/index/IndexMode.java # server/src/test/java/org/elasticsearch/index/TimeSeriesModeTests.java
The current capacity in use in autoscaling would use the full container memory and not the adjusted total memory. ES sometimes responds with `current_capacity` as `required_capacity` and the `current_capacity` therefore need to use the adjusted capacity instead (since the orchestration should add the memory reservation on top). Relates elastic#78750
The current capacity in use in autoscaling would use the full container memory and not the adjusted total memory. ES sometimes responds with `current_capacity` as `required_capacity` and the `current_capacity` therefore need to use the adjusted capacity instead (since the orchestration should add the memory reservation on top). Relates #78750
The current capacity in use in autoscaling would use the full container memory and not the adjusted total memory. ES sometimes responds with `current_capacity` as `required_capacity` and the `current_capacity` therefore need to use the adjusted capacity instead (since the orchestration should add the memory reservation on top). Relates #78750
Since #65905 Elasticsearch has determined the Java heap settings
from node roles and total system memory.
This change allows the total system memory used in that calculation
to be overridden with a user-specified value. This is intended to
be used when Elasticsearch is running on a machine where some other
software that consumes a non-negligible amount of memory is running.
For example, a user could tell Elasticsearch to assume it was
running on a machine with 3GB of RAM when actually it was running
on a machine with 4GB of RAM.
The system property is
es.total_memory_bytes
, so, for example,could be specified using
-Des.total_memory_bytes=3221225472
.(It is specified in bytes rather than using a unit, because it
needs to be parsed by startup code that does not have access to
the utility classes that interpret byte size units.)