ES_data_should_pass_validations is (once again) flaky #7172

barkbay · 2023-09-19T11:53:22Z

ES_data_should_pass_validations is failing in several tests:

🐞 TestMultipleOutputConfig ocp
🐞 TestAgentConfigRef kind-1-27-1
🐞 TestMultiOutputRecipe kind-1-27-1
🐞 TestMultiOutputRecipe eks-arm
🐞 TestAgentConfigRef kind-1-26-4
🐞 TestAgentConfigRef aks

{
	"Time": "2023-09-18T23:17:27.990167308Z",
	"Action": "run",
	"Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/agent",
	"Test": "TestAgentConfigRef/Elasticsearch_cluster_should_be_created"
}
...
{
	"Time": "2023-09-18T23:18:13.734607877Z",
	"Action": "run",
	"Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/agent",
	"Test": "TestAgentConfigRef/ES_endpoint_should_eventually_be_reachable"
}
...
{
	"Time": "2023-09-18T23:18:19.983314382Z",
	"Action": "output",
	"Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/agent",
	"Test": "TestAgentConfigRef/Agent_health_should_be_green",
	"Output": "=== RUN   TestAgentConfigRef/Agent_health_should_be_green\n"
}
...

{
	"Time": "2023-09-18T23:33:19.988122377Z",
	"Action": "output",
	"Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/agent",
	"Test": "TestAgentConfigRef/ES_data_should_pass_validations",
	"Output": "Retries (15m0s timeout)"
}

{
	"Time": "2023-09-18T23:33:19.988333976Z",
	"Action": "output",
	"Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/agent",
	"Test": "TestAgentConfigRef/ES_data_should_pass_validations",
	"Output": "unexpected count 0 of data streams returned when looking for logs-elastic_agent-default\n"
}

I did not investigate all the cases but in at least 2 of them logs-elastic_agent-default is reported in the ES diagnostic:

health status index                                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size sth
green  open   logs-elastic_agent-default                  -SfqJpNhR8OhidZDxI66dg   1   1         75            0    763.8kb        453.1kb false

The text was updated successfully, but these errors were encountered:

barkbay · 2023-09-19T13:46:24Z

Trying to get some interesting logs:

{
	"log.level": "info",
	"@timestamp": "2023-09-18T23:17:40.892Z",
	"log.origin": {
		"file.name": "coordinator/coordinator.go",
		"file.line": 989
	},
	"message": "Unit state changed log-default-logfile-system-835850a7-3f6b-4ae9-8d02-a9a15767cf39 (STARTING->HEALTHY): Healthy",
	"log": {
		"source": "elastic-agent"
	},
	"component": {
		"id": "log-default",
		"state": "HEALTHY"
	},
	"unit": {
		"id": "log-default-logfile-system-835850a7-3f6b-4ae9-8d02-a9a15767cf39",
		"type": "input",
		"state": "HEALTHY",
		"old_state": "STARTING"
	},
	"ecs.version": "1.6.0"
}

Looks like Agent was happy at that point?

barkbay · 2023-09-19T14:24:21Z

Logs from the ES clusters that show that logs-elastic_agent-default was created around 2023-09-18T23:18:18.696Z:

-- index [[logs-elastic_agent-default/-SfqJpNhR8OhidZDxI66dg]]
----shard_id [logs-elastic_agent-default][0]
--------[logs-elastic_agent-default][0], node[rTUTErajSVqd48raWMmRsw], [R], s[STARTED], a[id=Nh3X6gVnTBO3Ign_S3Q30g], failed_attempts[0]
--------[logs-elastic_agent-default][0], node[UMwxJvkPToSp6A0-Jn277A], [P], s[STARTED], a[id=9Nwd4WiDSUCu8Zf1kleDyg], failed_attempts[0]

barkbay · 2023-09-19T15:03:13Z

I ran the test locally on Kind, it always passes.
One thing I noted though is that there is no logs-elastic_agent-default index in my case, instead I have the following indices:

root@test-agent-configref-rw64-agent-f9pkk:/usr/share/elastic-agent# curl -Ss -k -u elastic:REDACTED 'https://10.96.5.63:9200/_cat/indices' | grep logs
green open .ds-logs-elastic_agent.filebeat-default-2023.09.19-000001         GEP1yt8fTl2b8y7VMepmVw 1 1  108 0   418kb 220.4kb
green open .ds-logs-elastic_agent-default-2023.09.19-000001                  y86uLFw9RdKb0rhoVoBVFQ 1 1   75 0 274.8kb   149kb
green open .ds-logs-elastic_agent.metricbeat-default-2023.09.19-000001       koQUYev1TTiHtcgWhwc2wQ 1 1  198 0 555.2kb 289.1kb

If I ran curl -k -u elastic:REDACTED 'https://10.96.5.63:9200/logs-elastic_agent-default/_search?q=!error.message:*' documents are returned from index .ds-logs-elastic_agent-default-2023.09.19-000001.

The questions then are:

Why this index .ds-logs-elastic_agent-default-* is missing in the logs above?
Why do we have logs-elastic_agent-default instead?

pebrc · 2023-09-19T16:02:16Z

Could this be an initialisation problem where the data stream is not created in time before the first data arrives and then ES auto-creates an index instead? But our e2e test checks for the data stream and so it fails?

pebrc · 2023-09-25T08:28:42Z

On a successful local run creating index, cause [initialize_data_stream], templates [logs], shards [1]/[1]:

pod/elasticsearch-es-default-2/logs.txt:{"@timestamp":"2023-09-25T08:20:03.119Z", "log.level": "INFO", "message":"[.ds-logs-elastic_agent-default-2023.09.25-000001] creating index, cause [initialize_data_stream], templates [logs], shards [1]/[1]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-es-default-2][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","trace.id":"022b536dae06f9f65e9f4b8e34777a32","elasticsearch.cluster.uuid":"nLywCk4-RiyKr_9EVHvHUQ","elasticsearch.node.id":"_e_HTk2PSKeTXentVYtaUg","elasticsearch.node.name":"elasticsearch-es-default-2","elasticsearch.cluster.name":"elasticsearch"}

on a failed e2e run creating index, cause [auto(bulk api)], templates [], shards [1]/[1]:

pod/elasticsearch-rljd-es-default-1/logs.txt:{"@timestamp":"2023-09-24T23:12:41.985Z", "log.level": "INFO", "message":"[logs-elastic_agent-default] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-rljd-es-default-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"t8qv3ZNJTmu8IKA3n05nlw","elasticsearch.node.id":"MhU2IcceQIiIPXpgbwzHJA","elasticsearch.node.name":"elasticsearch-rljd-es-default-1","elasticsearch.cluster.name":"elasticsearch-rljd"}

Elastic Agent is using the default logs index template which should be installed by default through x-pack:

{
      "name": "logs-elastic_agent-default",
      "timestamp_field": {
        "name": "@timestamp"
      },
      "indices": [
        {
          "index_name": ".ds-logs-elastic_agent-default-2023.09.25-000001",
          "index_uuid": "DjSbdXgmQ4m7QbVrFgTanA"
        }
      ],
      "generation": 1,
      "_meta": {
        "description": "default logs template installed by x-pack",
        "managed": true
      },
      "status": "RED",
      "template": "logs",
      "ilm_policy": "logs",
      "hidden": false,
      "system": false,
      "allow_custom_routing": false,
      "replicated": false
    },

What I don't understand is why Elasticsearch accepts the bulk request before it has initialised the default index templates. We are still using 8.9.0 for these tests (pending #7166)

DaveCTurner · 2023-09-25T10:31:24Z

Copying info from Slack: it's important for the process doing the indexing to include the ?require_alias=true query parameter if it expects there to be a suitable index template and would prefer the indexing to be rejected if the template is not present. It looks like this is not the case here.

Although Elasticsearch could in theory implement some special-case handling for the handful of templates that it installs shortly after startup, it's still important to set ?require_alias=true to ensure the right behaviour with other templates, and also to protect against the situation where a user later on deletes the indices and templates. There are very few situations where setting ?require_alias=true is not appropriate today, but we must unfortunately maintain the default of false for backwards-compatibility reasons.

Relates https://github.com/elastic/cloud/issues/106249.

DaveCTurner · 2023-09-25T17:51:01Z

Bah, sorry, it turns out that the ?require_alias flag will block auto-creation of data streams even if there's a valid template, so today this will only help if you're using an old-school alias. I opened elastic/elasticsearch#99882 about that. I still think in the long run we should be setting ?require_alias=true on these requests, but we might need to fix this flag before that is viable.

pebrc · 2023-09-26T08:03:48Z

@DaveCTurner thanks for you help (as always!)

I think in the short term we could look into somehow minimising the chance of this happening somehow in the tests. It is tricky as we treat the Agent test builder and the ES test builder as completely independent. One idea I have is to sort the test builders to always run the ES one first and maybe add an additional check to the ES one for the default index templates.

DaveCTurner · 2023-09-26T08:26:24Z

Yeah that's the best idea I can think of right now. The additional check in ES should be something like GET _cluster/health?wait_for_status=green&wait_for_events=languid.

Add an immediate health check after creation to delay subsequent k8s resource creations to a point when ES is available. The main motivation is test stability in cases where Beats would send data immediately without waiting for proper ES initialisation see #7172 Add an opt-out mechanism for tests where we know the health check cannot succeed. TODO: it is unclear at this point why the remote cluster test fails if we wait for cluster 1 to come up first. I will create a follow up issue

This is to improve test stability see #7172

barkbay added the >flaky_test label Sep 19, 2023

barkbay mentioned this issue Sep 20, 2023

Filebeat/autodiscover ES_data_should_pass_validations fails in 8.9.0-SNAPSHOT #6946

Closed

pebrc mentioned this issue Sep 26, 2023

Try to stabilise tests by checking for cluster health early #7184

Merged

pebrc closed this as completed in #7184 Sep 27, 2023

pebrc mentioned this issue Sep 29, 2023

Try targeted health check before Agent creation #7192

Merged

pebrc added a commit that referenced this issue Sep 29, 2023

Try targeted health check before Agent creation (#7192)

a65012b

This is to improve test stability see #7172

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES_data_should_pass_validations is (once again) flaky #7172

ES_data_should_pass_validations is (once again) flaky #7172

barkbay commented Sep 19, 2023

barkbay commented Sep 19, 2023

barkbay commented Sep 19, 2023 •

edited

Loading

barkbay commented Sep 19, 2023 •

edited

Loading

pebrc commented Sep 19, 2023

pebrc commented Sep 25, 2023 •

edited

Loading

DaveCTurner commented Sep 25, 2023

DaveCTurner commented Sep 25, 2023

pebrc commented Sep 26, 2023

DaveCTurner commented Sep 26, 2023

ES_data_should_pass_validations is (once again) flaky #7172

ES_data_should_pass_validations is (once again) flaky #7172

Comments

barkbay commented Sep 19, 2023

barkbay commented Sep 19, 2023

barkbay commented Sep 19, 2023 • edited Loading

barkbay commented Sep 19, 2023 • edited Loading

pebrc commented Sep 19, 2023

pebrc commented Sep 25, 2023 • edited Loading

DaveCTurner commented Sep 25, 2023

DaveCTurner commented Sep 25, 2023

pebrc commented Sep 26, 2023

DaveCTurner commented Sep 26, 2023

barkbay commented Sep 19, 2023 •

edited

Loading

barkbay commented Sep 19, 2023 •

edited

Loading

pebrc commented Sep 25, 2023 •

edited

Loading