Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No easy way to fix incorrect persistent cluster setting #47038

Closed
sherry-ger opened this issue Sep 24, 2019 · 9 comments
Closed

No easy way to fix incorrect persistent cluster setting #47038

sherry-ger opened this issue Sep 24, 2019 · 9 comments
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. resiliency

Comments

@sherry-ger
Copy link

Elasticsearch version (bin/elasticsearch --version):
7.3.0

Plugins installed: []

JVM version (java -version):

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:

While working to consolidate monitoring of the clusters, we applied the following dynamic setting to the cluster persistent settings xpack.monitoring.exporters.cloud_monitoring.host. Unfortunately, there was a mistake in the host value as we included a / at the end. For example,
https://myhost:9243/

The command was accepted and applied by the cluster. A bit later, the cluster became unresponsive. Upon investigation, we saw the following error message in the log:

org.elasticsearch.common.settings.SettingsException: [xpack.monitoring.exporters.cloud_monitoring.host] invalid host: [https://myhost:9243/]

Three issues here:

  1. The setting was accepted by elasticsearch as valid when it is not
  2. The cluster became unresponsive
  3. There is not an easy way to fix the issue. (In the end, we were able to update the cluster state file via hex editor.)

Steps to reproduce:

  1. Update cluster persistent setting xpack.monitoring.exporters.cloud_monitoring.host
  2. Set the value to https://myhost:9243/ Please note / at the end
  3. Observe the setting is accepted as valid
  4. Cluster become unresponsive

Provide logs (if relevant):

@DaveCTurner DaveCTurner added :Data Management/Monitoring :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Sep 24, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor

In this case setting xpack.monitoring.enabled: false on each node would, I think, have brought the cluster back to life. But this warrants further investigation, as a simple settings change shouldn't break the cluster in this way.

@sherry-ger
Copy link
Author

sherry-ger commented Sep 24, 2019

Updating steps to reproduce:

  1. Updated cluster persistent setting xpack.monitoring.exporters.cloud_monitoring.host
  2. Set the value to https://myhost:9243/ Please note / at the end
  3. Observed the setting is accepted as valid
  4. Cluster became unresponsive
  5. Restarted whole cluster. At this point, the cluster won't form so we could not update any cluster setting hence the hex editor

tagging @e-mars

@blinken
Copy link

blinken commented Sep 25, 2019

@DaveCTurner confirming per Sherry's comment that this did not work - because no cluster node was able to load the state file from disk. Each node was printing the following error regularly:

[2019-09-20T15:59:18,969][WARN ][o.e.c.s.ClusterApplierService] [node-1] failed to apply updated cluster state in [0s]:
version [131012], uuid [aaa_aaaaaaaaaaaaa-wxbw], source [becoming candidate: joinLeaderInTerm]
org.elasticsearch.common.settings.SettingsException: [xpack.monitoring.exporters.cloud_monitoring.host] invalid host: [https://hostremoved.found.io:9243/]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.createHosts(HttpExporter.java:400) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.createRestClient(HttpExporter.java:294) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.<init>(HttpExporter.java:219) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.<init>(HttpExporter.java:206) ~[?:?]
        at org.elasticsearch.xpack.monitoring.Monitoring.lambda$createComponents$1(Monitoring.java:134) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.Exporters.initExporters(Exporters.java:136) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.Exporters.setExportersSetting(Exporters.java:71) ~[?:?]
        at org.elasticsearch.common.settings.Setting$2.apply(Setting.java:659) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.settings.Setting$2.apply(Setting.java:632) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:610) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:191) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:460) ~[elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:418) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:165) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.3.0.jar:7.3.0]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.3.0.jar:7.3.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_92]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_92]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_92]
Caused by: java.lang.IllegalArgumentException: HttpHosts do not use paths [/]. see setRequestConfigCallback for proxies. value: [https://hostremoved.found.io:9243/]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpHostBuilder.<init>(HttpHostBuilder.java:157) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpHostBuilder.builder(HttpHostBuilder.java:98) ~[?:?]
        at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.createHosts(HttpExporter.java:398) ~[?:?]
        ... 19 more

Calling GET /_cluster/settings?pretty would return the following, though the cluster has many persistent settings during normal operation.

{
  "persistent": { }
  "transient": { }
}

Attempting to update these settings returned 400.

Calling GET /_cluster/state?pretty resulted in a response similar to

{
  "cluster_name" : "...",
  "cluster_uuid" : "...",
  "version" : 131084,
  "state_uuid" : "Tlh..-8m8g",
  "master_node" : "8uX...",
  "blocks" : {
    "global" : {
      "1" : {
        "description" : "state not recovered / initialized",
        "retryable" : true,
        "disable_state_persistence" : true,
        "levels" : [
          "read",
          "write",
          "metadata_read",
          "metadata_write"
        ]
      }
    }
<snip>

The only way we were able to recover the state was to do something like

# shut down all cluster nodes
cd nodes/0/_state/   # on a master
sed -i 's/found.io:9243\//found.iooooooo/' global-3333.st
# start the master node and note the error message similar to Caused by: org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?) : expected=63e0e171 actual=57268229. Stop the node.
xxd -p global-3333.st > global-3333.st.hex
# Edit the last four bytes to s/57268229/63e0e171/
xxd -r -p global-3333.st.hex > global-3333.st
# Repeat for three master nodes

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Sep 25, 2019
Today we log and swallow exceptions during cluster state application, but such
an exception should not occur. This commit adds assertions of this fact, and
updates the Javadocs to explain it.

Relates elastic#47038
@DaveCTurner
Copy link
Contributor

confirming per Sherry's comment that this did not work

I am struggling to reproduce this fact from the information given. Here are the steps I'm following. I started up a new empty 3-node 7.3.0 cluster and ran the following command:

PUT /_cluster/settings
{
  "persistent": {
    "xpack.monitoring.exporters.cloud_monitoring.type": "http",
    "xpack.monitoring.exporters.cloud_monitoring.host": "https://myhost:9243/"
  }
}

I confirmed that all three nodes were stuck in a loop emitting exceptions like this:

[2019-09-25T09:35:09,587][WARN ][o.e.c.s.ClusterSettings  ] [node-2] failed to apply settings
org.elasticsearch.common.settings.SettingsException: [xpack.monitoring.exporters.cloud_monitoring.host] invalid host: [https://myhost:9243/]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.createHosts(HttpExporter.java:400) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.createRestClient(HttpExporter.java:294) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.<init>(HttpExporter.java:219) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.<init>(HttpExporter.java:206) ~[?:?]
	at org.elasticsearch.xpack.monitoring.Monitoring.lambda$createComponents$1(Monitoring.java:134) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters.initExporters(Exporters.java:136) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.Exporters.setExportersSetting(Exporters.java:71) ~[?:?]
	at org.elasticsearch.common.settings.Setting$2.apply(Setting.java:659) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.settings.Setting$2.apply(Setting.java:632) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.settings.AbstractScopedSettings$SettingUpdater.lambda$updater$0(AbstractScopedSettings.java:610) ~[elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.settings.AbstractScopedSettings.applySettings(AbstractScopedSettings.java:191) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:460) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:418) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:165) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.3.0.jar:7.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.3.0.jar:7.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:835) [?:?]
Caused by: java.lang.IllegalArgumentException: HttpHosts do not use paths [/]. see setRequestConfigCallback for proxies. value: [https://myhost:9243/]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpHostBuilder.<init>(HttpHostBuilder.java:157) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpHostBuilder.builder(HttpHostBuilder.java:98) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.http.HttpExporter.createHosts(HttpExporter.java:398) ~[?:?]
	... 19 more

I restarted all three nodes and confirmed that they remained stuck in the same loop. I then added xpack.monitoring.enabled: false to all three nodes' elasticsearch.yml files and restarted them and observed that they all started up normally (except with monitoring disabled). This allowed me to remove the problematic setting with this command:

PUT /_cluster/settings
{
  "persistent": {
    "xpack.monitoring.exporters.cloud_monitoring.type": null,
    "xpack.monitoring.exporters.cloud_monitoring.host": null
  }
}

Finally I removed xpack.monitoring.enabled: false from all three nodes' elasticsearch.yml files and restarted them one last time.

@blinken
Copy link

blinken commented Sep 25, 2019

That's interesting - the advice we received from support (and also my understanding) was that

For the most part, the logic is basically in this order of priority:
- transient settings
- persistent settings
- file defined settings
- defaults

As such, we didn't attempt to override the persistent settings using elasticsearch.yml. Are you saying that elasticsearch.yml can override persistent settings? In which cases - and where is this documented?

I'd also argue that the cluster in a persistent crashloop as a result of a change to monitoring configuration is a bug that needs to be fixed.

Really appreciate your time to investigate this!

@DaveCTurner
Copy link
Contributor

the logic is basically in this order of priority

That's correct for each setting in isolation. However, we are not overriding any single setting in a way that contradicts this. xpack.monitoring.enabled cannot be set in the persistent or transient cluster settings so its value always comes from elasticsearch.yml or its default of true. This setting disables the monitoring component in a way that means that all the other xpack.monitoring.* settings have no effect, so the code that throws the exception here never runs. This is common behaviour across many of the optional components like monitoring that can be disabled with a setting.

I'd also argue that the cluster in a persistent crashloop as a result of a change to monitoring configuration is a bug that needs to be fixed.

Agreed.

DaveCTurner added a commit that referenced this issue Sep 25, 2019
Today we log and swallow exceptions during cluster state application, but such
an exception should not occur. This commit adds assertions of this fact, and
updates the Javadocs to explain it.

Relates #47038
DaveCTurner added a commit that referenced this issue Sep 25, 2019
Today we log and swallow exceptions during cluster state application, but such
an exception should not occur. This commit adds assertions of this fact, and
updates the Javadocs to explain it.

Relates #47038
DaveCTurner added a commit that referenced this issue Sep 25, 2019
Today we log and swallow exceptions during cluster state application, but such
an exception should not occur. This commit adds assertions of this fact, and
updates the Javadocs to explain it.

Relates #47038
@ywelsch ywelsch removed the v7.3.0 label Dec 2, 2019
@DaveCTurner
Copy link
Contributor

Closed by #50694.

albertzaharovits added a commit that referenced this issue Feb 24, 2020
Add validation for the following logfile audit settings:

    xpack.security.audit.logfile.events.include
    xpack.security.audit.logfile.events.exclude
    xpack.security.audit.logfile.events.ignore_filters.*.users
    xpack.security.audit.logfile.events.ignore_filters.*.realms
    xpack.security.audit.logfile.events.ignore_filters.*.roles
    xpack.security.audit.logfile.events.ignore_filters.*.indices

Closes #52357
Relates #47711 #47038
Follows the example from #47246
albertzaharovits added a commit to albertzaharovits/elasticsearch that referenced this issue Feb 24, 2020
Add validation for the following logfile audit settings:

    xpack.security.audit.logfile.events.include
    xpack.security.audit.logfile.events.exclude
    xpack.security.audit.logfile.events.ignore_filters.*.users
    xpack.security.audit.logfile.events.ignore_filters.*.realms
    xpack.security.audit.logfile.events.ignore_filters.*.roles
    xpack.security.audit.logfile.events.ignore_filters.*.indices

Closes elastic#52357
Relates elastic#47711 elastic#47038
Follows the example from elastic#47246
albertzaharovits added a commit to albertzaharovits/elasticsearch that referenced this issue Feb 24, 2020
Add validation for the following logfile audit settings:

    xpack.security.audit.logfile.events.include
    xpack.security.audit.logfile.events.exclude
    xpack.security.audit.logfile.events.ignore_filters.*.users
    xpack.security.audit.logfile.events.ignore_filters.*.realms
    xpack.security.audit.logfile.events.ignore_filters.*.roles
    xpack.security.audit.logfile.events.ignore_filters.*.indices

Closes elastic#52357
Relates elastic#47711 elastic#47038
Follows the example from elastic#47246
albertzaharovits added a commit that referenced this issue Feb 24, 2020
Add validation for the following logfile audit settings:

    xpack.security.audit.logfile.events.include
    xpack.security.audit.logfile.events.exclude
    xpack.security.audit.logfile.events.ignore_filters.*.users
    xpack.security.audit.logfile.events.ignore_filters.*.realms
    xpack.security.audit.logfile.events.ignore_filters.*.roles
    xpack.security.audit.logfile.events.ignore_filters.*.indices

Closes #52357
Relates #47711 #47038
Follows the example from #47246
albertzaharovits added a commit that referenced this issue Feb 24, 2020
Add validation for the following logfile audit settings:

    xpack.security.audit.logfile.events.include
    xpack.security.audit.logfile.events.exclude
    xpack.security.audit.logfile.events.ignore_filters.*.users
    xpack.security.audit.logfile.events.ignore_filters.*.realms
    xpack.security.audit.logfile.events.ignore_filters.*.roles
    xpack.security.audit.logfile.events.ignore_filters.*.indices

Closes #52357
Relates #47711 #47038
Follows the example from #47246
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. resiliency
Projects
None yet
Development

No branches or pull requests

5 participants