Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES_data_should_pass_validations is (once again) flaky #7172

Closed
barkbay opened this issue Sep 19, 2023 · 9 comments · Fixed by #7184
Closed

ES_data_should_pass_validations is (once again) flaky #7172

barkbay opened this issue Sep 19, 2023 · 9 comments · Fixed by #7184

Comments

@barkbay
Copy link
Contributor

barkbay commented Sep 19, 2023

ES_data_should_pass_validations is failing in several tests:

🐞 TestMultipleOutputConfig ocp
🐞 TestAgentConfigRef kind-1-27-1
🐞 TestMultiOutputRecipe kind-1-27-1
🐞 TestMultiOutputRecipe eks-arm
🐞 TestAgentConfigRef kind-1-26-4
🐞 TestAgentConfigRef aks

{
	"Time": "2023-09-18T23:17:27.990167308Z",
	"Action": "run",
	"Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/agent",
	"Test": "TestAgentConfigRef/Elasticsearch_cluster_should_be_created"
}
...
{
	"Time": "2023-09-18T23:18:13.734607877Z",
	"Action": "run",
	"Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/agent",
	"Test": "TestAgentConfigRef/ES_endpoint_should_eventually_be_reachable"
}
...
{
	"Time": "2023-09-18T23:18:19.983314382Z",
	"Action": "output",
	"Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/agent",
	"Test": "TestAgentConfigRef/Agent_health_should_be_green",
	"Output": "=== RUN   TestAgentConfigRef/Agent_health_should_be_green\n"
}
...

{
	"Time": "2023-09-18T23:33:19.988122377Z",
	"Action": "output",
	"Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/agent",
	"Test": "TestAgentConfigRef/ES_data_should_pass_validations",
	"Output": "Retries (15m0s timeout)"
}

{
	"Time": "2023-09-18T23:33:19.988333976Z",
	"Action": "output",
	"Package": "github.com/elastic/cloud-on-k8s/v2/test/e2e/agent",
	"Test": "TestAgentConfigRef/ES_data_should_pass_validations",
	"Output": "unexpected count 0 of data streams returned when looking for logs-elastic_agent-default\n"
}

I did not investigate all the cases but in at least 2 of them logs-elastic_agent-default is reported in the ES diagnostic:

health status index                                       uuid                   pri rep docs.count docs.deleted store.size pri.store.size sth
green  open   logs-elastic_agent-default                  -SfqJpNhR8OhidZDxI66dg   1   1         75            0    763.8kb        453.1kb false
@barkbay
Copy link
Contributor Author

barkbay commented Sep 19, 2023

Trying to get some interesting logs:

{
	"log.level": "info",
	"@timestamp": "2023-09-18T23:17:40.892Z",
	"log.origin": {
		"file.name": "coordinator/coordinator.go",
		"file.line": 989
	},
	"message": "Unit state changed log-default-logfile-system-835850a7-3f6b-4ae9-8d02-a9a15767cf39 (STARTING->HEALTHY): Healthy",
	"log": {
		"source": "elastic-agent"
	},
	"component": {
		"id": "log-default",
		"state": "HEALTHY"
	},
	"unit": {
		"id": "log-default-logfile-system-835850a7-3f6b-4ae9-8d02-a9a15767cf39",
		"type": "input",
		"state": "HEALTHY",
		"old_state": "STARTING"
	},
	"ecs.version": "1.6.0"
}

Looks like Agent was happy at that point?

@barkbay
Copy link
Contributor Author

barkbay commented Sep 19, 2023

Logs from the ES clusters that show that logs-elastic_agent-default was created around 2023-09-18T23:18:18.696Z:

-- index [[logs-elastic_agent-default/-SfqJpNhR8OhidZDxI66dg]]
----shard_id [logs-elastic_agent-default][0]
--------[logs-elastic_agent-default][0], node[rTUTErajSVqd48raWMmRsw], [R], s[STARTED], a[id=Nh3X6gVnTBO3Ign_S3Q30g], failed_attempts[0]
--------[logs-elastic_agent-default][0], node[UMwxJvkPToSp6A0-Jn277A], [P], s[STARTED], a[id=9Nwd4WiDSUCu8Zf1kleDyg], failed_attempts[0]

@barkbay
Copy link
Contributor Author

barkbay commented Sep 19, 2023

I ran the test locally on Kind, it always passes.
One thing I noted though is that there is no logs-elastic_agent-default index in my case, instead I have the following indices:

root@test-agent-configref-rw64-agent-f9pkk:/usr/share/elastic-agent# curl -Ss -k -u elastic:REDACTED 'https://10.96.5.63:9200/_cat/indices' | grep logs
green open .ds-logs-elastic_agent.filebeat-default-2023.09.19-000001         GEP1yt8fTl2b8y7VMepmVw 1 1  108 0   418kb 220.4kb
green open .ds-logs-elastic_agent-default-2023.09.19-000001                  y86uLFw9RdKb0rhoVoBVFQ 1 1   75 0 274.8kb   149kb
green open .ds-logs-elastic_agent.metricbeat-default-2023.09.19-000001       koQUYev1TTiHtcgWhwc2wQ 1 1  198 0 555.2kb 289.1kb

If I ran curl -k -u elastic:REDACTED 'https://10.96.5.63:9200/logs-elastic_agent-default/_search?q=!error.message:*' documents are returned from index .ds-logs-elastic_agent-default-2023.09.19-000001.

The questions then are:

  • Why this index .ds-logs-elastic_agent-default-* is missing in the logs above?
  • Why do we have logs-elastic_agent-default instead?

@pebrc
Copy link
Collaborator

pebrc commented Sep 19, 2023

Could this be an initialisation problem where the data stream is not created in time before the first data arrives and then ES auto-creates an index instead? But our e2e test checks for the data stream and so it fails?

@pebrc
Copy link
Collaborator

pebrc commented Sep 25, 2023

On a successful local run creating index, cause [initialize_data_stream], templates [logs], shards [1]/[1]:

pod/elasticsearch-es-default-2/logs.txt:{"@timestamp":"2023-09-25T08:20:03.119Z", "log.level": "INFO", "message":"[.ds-logs-elastic_agent-default-2023.09.25-000001] creating index, cause [initialize_data_stream], templates [logs], shards [1]/[1]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-es-default-2][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","trace.id":"022b536dae06f9f65e9f4b8e34777a32","elasticsearch.cluster.uuid":"nLywCk4-RiyKr_9EVHvHUQ","elasticsearch.node.id":"_e_HTk2PSKeTXentVYtaUg","elasticsearch.node.name":"elasticsearch-es-default-2","elasticsearch.cluster.name":"elasticsearch"}

on a failed e2e run creating index, cause [auto(bulk api)], templates [], shards [1]/[1]:

pod/elasticsearch-rljd-es-default-1/logs.txt:{"@timestamp":"2023-09-24T23:12:41.985Z", "log.level": "INFO", "message":"[logs-elastic_agent-default] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[elasticsearch-rljd-es-default-1][masterService#updateTask][T#1]","log.logger":"org.elasticsearch.cluster.metadata.MetadataCreateIndexService","elasticsearch.cluster.uuid":"t8qv3ZNJTmu8IKA3n05nlw","elasticsearch.node.id":"MhU2IcceQIiIPXpgbwzHJA","elasticsearch.node.name":"elasticsearch-rljd-es-default-1","elasticsearch.cluster.name":"elasticsearch-rljd"}

Elastic Agent is using the default logs index template which should be installed by default through x-pack:

{
      "name": "logs-elastic_agent-default",
      "timestamp_field": {
        "name": "@timestamp"
      },
      "indices": [
        {
          "index_name": ".ds-logs-elastic_agent-default-2023.09.25-000001",
          "index_uuid": "DjSbdXgmQ4m7QbVrFgTanA"
        }
      ],
      "generation": 1,
      "_meta": {
        "description": "default logs template installed by x-pack",
        "managed": true
      },
      "status": "RED",
      "template": "logs",
      "ilm_policy": "logs",
      "hidden": false,
      "system": false,
      "allow_custom_routing": false,
      "replicated": false
    },

What I don't understand is why Elasticsearch accepts the bulk request before it has initialised the default index templates. We are still using 8.9.0 for these tests (pending #7166)

@DaveCTurner
Copy link

Copying info from Slack: it's important for the process doing the indexing to include the ?require_alias=true query parameter if it expects there to be a suitable index template and would prefer the indexing to be rejected if the template is not present. It looks like this is not the case here.

Although Elasticsearch could in theory implement some special-case handling for the handful of templates that it installs shortly after startup, it's still important to set ?require_alias=true to ensure the right behaviour with other templates, and also to protect against the situation where a user later on deletes the indices and templates. There are very few situations where setting ?require_alias=true is not appropriate today, but we must unfortunately maintain the default of false for backwards-compatibility reasons.

Relates https://github.com/elastic/cloud/issues/106249.

@DaveCTurner
Copy link

Bah, sorry, it turns out that the ?require_alias flag will block auto-creation of data streams even if there's a valid template, so today this will only help if you're using an old-school alias. I opened elastic/elasticsearch#99882 about that. I still think in the long run we should be setting ?require_alias=true on these requests, but we might need to fix this flag before that is viable.

@pebrc
Copy link
Collaborator

pebrc commented Sep 26, 2023

@DaveCTurner thanks for you help (as always!)

I think in the short term we could look into somehow minimising the chance of this happening somehow in the tests. It is tricky as we treat the Agent test builder and the ES test builder as completely independent. One idea I have is to sort the test builders to always run the ES one first and maybe add an additional check to the ES one for the default index templates.

@DaveCTurner
Copy link

Yeah that's the best idea I can think of right now. The additional check in ES should be something like GET _cluster/health?wait_for_status=green&wait_for_events=languid.

pebrc added a commit that referenced this issue Sep 27, 2023
Add an immediate health check after creation to delay subsequent k8s resource creations to a point when ES is available. The main motivation is test stability in cases where Beats would send data immediately without waiting for proper ES initialisation see #7172
Add an opt-out mechanism for tests where we know the health check cannot succeed. 
TODO: it is unclear at this point why the remote cluster test fails if we wait for cluster 1 to come up first. I will create a follow up issue
pebrc added a commit that referenced this issue Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants