No easy way to fix incorrect persistent cluster setting #47038

sherry-ger · 2019-09-24T17:36:23Z

Elasticsearch version (bin/elasticsearch --version):
7.3.0

Plugins installed: []

JVM version (java -version):

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:

While working to consolidate monitoring of the clusters, we applied the following dynamic setting to the cluster persistent settings xpack.monitoring.exporters.cloud_monitoring.host. Unfortunately, there was a mistake in the host value as we included a / at the end. For example,
https://myhost:9243/

The command was accepted and applied by the cluster. A bit later, the cluster became unresponsive. Upon investigation, we saw the following error message in the log:

org.elasticsearch.common.settings.SettingsException: [xpack.monitoring.exporters.cloud_monitoring.host] invalid host: [https://myhost:9243/]

Three issues here:

The setting was accepted by elasticsearch as valid when it is not
The cluster became unresponsive
There is not an easy way to fix the issue. (In the end, we were able to update the cluster state file via hex editor.)

Steps to reproduce:

Update cluster persistent setting xpack.monitoring.exporters.cloud_monitoring.host
Set the value to https://myhost:9243/ Please note / at the end
Observe the setting is accepted as valid
Cluster become unresponsive

Provide logs (if relevant):

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-09-24T18:53:21Z

Pinging @elastic/es-core-features

elasticmachine · 2019-09-24T18:53:22Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-09-24T18:58:33Z

In this case setting xpack.monitoring.enabled: false on each node would, I think, have brought the cluster back to life. But this warrants further investigation, as a simple settings change shouldn't break the cluster in this way.

sherry-ger · 2019-09-24T22:19:08Z

Updating steps to reproduce:

Updated cluster persistent setting xpack.monitoring.exporters.cloud_monitoring.host
Set the value to https://myhost:9243/ Please note / at the end
Observed the setting is accepted as valid
Cluster became unresponsive
Restarted whole cluster. At this point, the cluster won't form so we could not update any cluster setting hence the hex editor

tagging @e-mars

blinken · 2019-09-25T07:36:19Z

@DaveCTurner confirming per Sherry's comment that this did not work - because no cluster node was able to load the state file from disk. Each node was printing the following error regularly:

[2019-09-20T15:59:18,969][WARN ][o.e.c.s.ClusterApplierService] [node-1] failed to apply updated cluster state in [0s]:
version [131012], uuid [aaa_aaaaaaaaaaaaa-wxbw], source [becoming candidate: joinLeaderInTerm]
org.elasticsearch.common.settings.SettingsException: [xpack.monitoring.exporters.cloud_monitoring.host] invalid host: [https://hostremoved.found.io:9243/]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.createHosts(HttpExporter.java:400) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.createRestClient(HttpExporter.java:294) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.<init>(HttpExporter.java:219) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.<init>(HttpExporter.java:206) ~[?:?]
        at org.elasticsearch.xpack.monitoring.Monitoring.lambda$createComponents$1(Monitoring.java:134) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.Exporters.initExporters(Exporters.java:136) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.Exporters.setExportersSetting(Exporters.java:71) ~[?:?]
        at org.elasticsearch.common.settings.Setting$2.apply(Setting.java:659) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.settings.Setting$2.apply(Setting.java:632) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:610) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:191) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:460) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:418) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:165) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.3.0.jar:7.3.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_92]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_92]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
Caused by: java.lang.IllegalArgumentException: HttpHosts do not use paths [/]. see setRequestConfigCallback for proxies. value: [https://hostremoved.found.io:9243/]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpHostBuilder.<init>(HttpHostBuilder.java:157) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpHostBuilder.builder(HttpHostBuilder.java:98) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.createHosts(HttpExporter.java:398) ~[?:?]
        ... 19 more

Calling GET /_cluster/settings?pretty would return the following, though the cluster has many persistent settings during normal operation.

{
  "persistent": { }
  "transient": { }
}

Attempting to update these settings returned 400.

Calling GET /_cluster/state?pretty resulted in a response similar to

{
  "cluster_name" : "...",
  "cluster_uuid" : "...",
  "version" : 131084,
  "state_uuid" : "Tlh..-8m8g",
  "master_node" : "8uX...",
  "blocks" : {
    "global" : {
      "1" : {
        "description" : "state not recovered / initialized",
        "retryable" : true,
        "disable_state_persistence" : true,
        "levels" : [
          "read",
          "write",
          "metadata_read",
          "metadata_write"
        ]
      }
    }
<snip>

The only way we were able to recover the state was to do something like

# shut down all cluster nodes
cd nodes/0/_state/   # on a master
sed -i 's/found.io:9243\//found.iooooooo/' global-3333.st
# start the master node and note the error message similar to Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=63e0e171 actual=57268229. Stop the node.
xxd -p global-3333.st > global-3333.st.hex
# Edit the last four bytes to s/57268229/63e0e171/
xxd -r -p global-3333.st.hex > global-3333.st
# Repeat for three master nodes

Today we log and swallow exceptions during cluster state application, but such an exception should not occur. This commit adds assertions of this fact, and updates the Javadocs to explain it. Relates elastic#47038

DaveCTurner · 2019-09-25T08:43:48Z

confirming per Sherry's comment that this did not work

I am struggling to reproduce this fact from the information given. Here are the steps I'm following. I started up a new empty 3-node 7.3.0 cluster and ran the following command:

PUT /_cluster/settings
{
  "persistent": {
    "xpack.monitoring.exporters.cloud_monitoring.type": "http",
    "xpack.monitoring.exporters.cloud_monitoring.host": "https://myhost:9243/"
  }
}

I confirmed that all three nodes were stuck in a loop emitting exceptions like this:

[2019-09-25T09:35:09,587][WARN ][o.e.c.s.ClusterSettings  ] [node-2] failed to apply settings
org.elasticsearch.common.settings.SettingsException: [xpack.monitoring.exporters.cloud_monitoring.host] invalid host: [https://myhost:9243/]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.createHosts(HttpExporter.java:400) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.createRestClient(HttpExporter.java:294) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.<init>(HttpExporter.java:219) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.<init>(HttpExporter.java:206) ~[?:?]
	at org.elasticsearch.xpack.monitoring.Monitoring.lambda$createComponents$1(Monitoring.java:134) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters.initExporters(Exporters.java:136) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters.setExportersSetting(Exporters.java:71) ~[?:?]
	at org.elasticsearch.common.settings.Setting$2.apply(Setting.java:659) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.settings.Setting$2.apply(Setting.java:632) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:610) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:191) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:460) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:418) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:165) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.3.0.jar:7.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: java.lang.IllegalArgumentException: HttpHosts do not use paths [/]. see setRequestConfigCallback for proxies. value: [https://myhost:9243/]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpHostBuilder.<init>(HttpHostBuilder.java:157) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpHostBuilder.builder(HttpHostBuilder.java:98) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.createHosts(HttpExporter.java:398) ~[?:?]
	... 19 more

I restarted all three nodes and confirmed that they remained stuck in the same loop. I then added xpack.monitoring.enabled: false to all three nodes' elasticsearch.yml files and restarted them and observed that they all started up normally (except with monitoring disabled). This allowed me to remove the problematic setting with this command:

PUT /_cluster/settings
{
  "persistent": {
    "xpack.monitoring.exporters.cloud_monitoring.type": null,
    "xpack.monitoring.exporters.cloud_monitoring.host": null
  }
}

Finally I removed xpack.monitoring.enabled: false from all three nodes' elasticsearch.yml files and restarted them one last time.

blinken · 2019-09-25T09:10:49Z

That's interesting - the advice we received from support (and also my understanding) was that

For the most part, the logic is basically in this order of priority:
- transient settings
- persistent settings
- file defined settings
- defaults

As such, we didn't attempt to override the persistent settings using elasticsearch.yml. Are you saying that elasticsearch.yml can override persistent settings? In which cases - and where is this documented?

I'd also argue that the cluster in a persistent crashloop as a result of a change to monitoring configuration is a bug that needs to be fixed.

Really appreciate your time to investigate this!

DaveCTurner · 2019-09-25T09:42:17Z

the logic is basically in this order of priority

That's correct for each setting in isolation. However, we are not overriding any single setting in a way that contradicts this. xpack.monitoring.enabled cannot be set in the persistent or transient cluster settings so its value always comes from elasticsearch.yml or its default of true. This setting disables the monitoring component in a way that means that all the other xpack.monitoring.* settings have no effect, so the code that throws the exception here never runs. This is common behaviour across many of the optional components like monitoring that can be disabled with a setting.

I'd also argue that the cluster in a persistent crashloop as a result of a change to monitoring configuration is a bug that needs to be fixed.

Agreed.

Today we log and swallow exceptions during cluster state application, but such an exception should not occur. This commit adds assertions of this fact, and updates the Javadocs to explain it. Relates #47038

DaveCTurner · 2020-02-05T14:59:08Z

Closed by #50694.

Add validation for the following logfile audit settings: xpack.security.audit.logfile.events.include xpack.security.audit.logfile.events.exclude xpack.security.audit.logfile.events.ignore_filters.*.users xpack.security.audit.logfile.events.ignore_filters.*.realms xpack.security.audit.logfile.events.ignore_filters.*.roles xpack.security.audit.logfile.events.ignore_filters.*.indices Closes #52357 Relates #47711 #47038 Follows the example from #47246

Add validation for the following logfile audit settings: xpack.security.audit.logfile.events.include xpack.security.audit.logfile.events.exclude xpack.security.audit.logfile.events.ignore_filters.*.users xpack.security.audit.logfile.events.ignore_filters.*.realms xpack.security.audit.logfile.events.ignore_filters.*.roles xpack.security.audit.logfile.events.ignore_filters.*.indices Closes elastic#52357 Relates elastic#47711 elastic#47038 Follows the example from elastic#47246

Add validation for the following logfile audit settings: xpack.security.audit.logfile.events.include xpack.security.audit.logfile.events.exclude xpack.security.audit.logfile.events.ignore_filters.*.users xpack.security.audit.logfile.events.ignore_filters.*.realms xpack.security.audit.logfile.events.ignore_filters.*.roles xpack.security.audit.logfile.events.ignore_filters.*.indices Closes #52357 Relates #47711 #47038 Follows the example from #47246

sherry-ger added resiliency v7.3.0 labels Sep 24, 2019

DaveCTurner added :Data Management/Monitoring :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Sep 24, 2019

DaveCTurner mentioned this issue Sep 25, 2019

Assert no exceptions during state application #47090

Merged

DaveCTurner mentioned this issue Sep 25, 2019

Invalid http monitoring exporter settings throw exception when applying cluster state #47125

Closed

DaveCTurner removed the :Data Management/Monitoring label Sep 25, 2019

danhermann mentioned this issue Nov 4, 2019

Settings should be validated at parse time #47711

Closed

ywelsch removed the v7.3.0 label Dec 2, 2019

danhermann mentioned this issue Dec 18, 2019

Validate SSL settings at parse time #49196

Merged

DaveCTurner closed this as completed Feb 5, 2020

DaveCTurner mentioned this issue Feb 14, 2020

Logging audit trail exclude/include settings are not validated if disabled #52357

Closed

albertzaharovits mentioned this issue Feb 19, 2020

Logfile audit settings validation #52537

Merged

albertzaharovits mentioned this issue Feb 24, 2020

BACKPORT Logfile audit settings validation (#52537) #52699

Merged

albertzaharovits mentioned this issue Feb 24, 2020

BACKPORT 7.6 Logfile audit settings validation #52700

Merged

albertzaharovits mentioned this issue Feb 24, 2020

BACKPORT 6.8 Logfile audit settings validation (#52537) #52702

Closed

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 3) elastic/elasticsearch-net#4534

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No easy way to fix incorrect persistent cluster setting #47038

No easy way to fix incorrect persistent cluster setting #47038

sherry-ger commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

DaveCTurner commented Sep 24, 2019

sherry-ger commented Sep 24, 2019 •

edited

Loading

blinken commented Sep 25, 2019 •

edited

Loading

DaveCTurner commented Sep 25, 2019

blinken commented Sep 25, 2019

DaveCTurner commented Sep 25, 2019

DaveCTurner commented Feb 5, 2020

No easy way to fix incorrect persistent cluster setting #47038

No easy way to fix incorrect persistent cluster setting #47038

Comments

sherry-ger commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

elasticmachine commented Sep 24, 2019

DaveCTurner commented Sep 24, 2019

sherry-ger commented Sep 24, 2019 • edited Loading

blinken commented Sep 25, 2019 • edited Loading

DaveCTurner commented Sep 25, 2019

blinken commented Sep 25, 2019

DaveCTurner commented Sep 25, 2019

DaveCTurner commented Feb 5, 2020

sherry-ger commented Sep 24, 2019 •

edited

Loading

blinken commented Sep 25, 2019 •

edited

Loading