Index in read-only mode not handled properly by Filebeat #13844

eedugon · 2019-09-30T06:40:07Z

When the indices in Elasticsearch are configured as read_only (with any of index.blocks.read_only or index.blocks.write) for maintenance purposes, Filebeat is logging the following and dropping the data. There's no backoff mechanism in place and the data is lost.

Note that FB is raising a WARN and not even an error, when there's indeed data loss.

2019-09-04T08:11:12.384Z	WARN	elasticsearch/client.go:535	Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0x16e86bcd, ext:63703181468, loc:(*time.Location)(nil)}, Meta:common.MapStr{"pipeline":"filebeat-7.3.1-apache-access-default"}, Fields:common.MapStr{"agent":common.MapStr{"ephemeral_id":"ac7b7012-7ffa-4e74-9974-4ca6e5a7b58b", "hostname":"f10453d46724", "id":"2d08bdc4-174c-4a75-ae8b-dfa911d62ba4", "type":"filebeat", "version":"7.3.1"}, "container":common.MapStr{"id":"84e89d283c71af548cb3a1a4f96130520021cc4b681fa3fad3a8538afd26d71d", "image":common.MapStr{"name":"httpd:2.4"}, "name":"my-apache-app"}, "docker":common.MapStr{"container":common.MapStr{"labels":common.MapStr{"co_elastic_logs/fileset_stderr":"error", "co_elastic_logs/fileset_stdout":"access", "co_elastic_logs/module":"apache2", "co_elastic_metrics/hosts":"${data.host}:${data.port}", "co_elastic_metrics/metricsets":"status", "co_elastic_metrics/module":"apache"}}}, "ecs":common.MapStr{"version":"1.0.1"}, "event":common.MapStr{"dataset":"apache.access", "module":"apache"}, "fileset":common.MapStr{"name":"access"}, "host":common.MapStr{"name":"f10453d46724"}, "input":common.MapStr{"type":"container"}, "log":common.MapStr{"file":common.MapStr{"path":"/var/lib/docker/containers/84e89d283c71af548cb3a1a4f96130520021cc4b681fa3fad3a8538afd26d71d/84e89d283c71af548cb3a1a4f96130520021cc4b681fa3fad3a8538afd26d71d-json.log"}, "offset":3181}, "message":"172.17.0.1 - - [04/Sep/2019:08:11:08 +0000] \"GET /caca3 HTTP/1.1\" 404 196", "service":common.MapStr{"type":"apache"}, "stream":"stdout"}, Private:file.State{Id:"", Finished:false, Fileinfo:(*os.fileStat)(0xc0000a3380), Source:"/var/lib/docker/containers/84e89d283c71af548cb3a1a4f96130520021cc4b681fa3fad3a8538afd26d71d/84e89d283c71af548cb3a1a4f96130520021cc4b681fa3fad3a8538afd26d71d-json.log", Offset:3327, Timestamp:time.Time{wall:0xbf53f9f04f421fbe, ext:96510498560, loc:(*time.Location)(0x30d3080)}, TTL:-1, Type:"container", Meta:map[string]string(nil), FileStateOS:file.StateOS{Inode:0x361ad7, Device:0x801}}, TimeSeries:false}, Flags:0x1} (status=403): {"type":"cluster_block_exception","reason":"index [filebeat-7.3.1-2019.09.04-000001] blocked by: [FORBIDDEN/5/index read-only (api)];"}
2019-09-04T08:11:12.385Z	WARN	elasticsearch/client.go:535	Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0x16e86bcd, ext:63703181468, loc:(*time.Location)(nil)}, Meta:common.MapStr{"pipeline":"filebeat-7.3.1-apache-error-pipeline"}, Fields:common.MapStr{"agent":common.MapStr{"ephemeral_id":"ac7b7012-7ffa-4e74-9974-4ca6e5a7b58b", "hostname":"f10453d46724", "id":"2d08bdc4-174c-4a75-ae8b-dfa911d62ba4", "type":"filebeat", "version":"7.3.1"}, "container":common.MapStr{"id":"84e89d283c71af548cb3a1a4f96130520021cc4b681fa3fad3a8538afd26d71d", "image":common.MapStr{"name":"httpd:2.4"}, "name":"my-apache-app"}, "docker":common.MapStr{"container":common.MapStr{"labels":common.MapStr{"co_elastic_logs/fileset_stderr":"error", "co_elastic_logs/fileset_stdout":"access", "co_elastic_logs/module":"apache2", "co_elastic_metrics/hosts":"${data.host}:${data.port}", "co_elastic_metrics/metricsets":"status", "co_elastic_metrics/module":"apache"}}}, "ecs":common.MapStr{"version":"1.0.1"}, "event":common.MapStr{"dataset":"apache.error", "module":"apache"}, "fileset":common.MapStr{"name":"error"}, "host":common.MapStr{"name":"f10453d46724"}, "input":common.MapStr{"type":"container"}, "log":common.MapStr{"file":common.MapStr{"path":"/var/lib/docker/containers/84e89d283c71af548cb3a1a4f96130520021cc4b681fa3fad3a8538afd26d71d/84e89d283c71af548cb3a1a4f96130520021cc4b681fa3fad3a8538afd26d71d-json.log"}, "offset":3181}, "message":"172.17.0.1 - - [04/Sep/2019:08:11:08 +0000] \"GET /caca3 HTTP/1.1\" 404 196", "service":common.MapStr{"type":"apache"}, "stream":"stdout"}, Private:file.State{Id:"", Finished:false, Fileinfo:(*os.fileStat)(0xc0001dd860), Source:"/var/lib/docker/containers/84e89d283c71af548cb3a1a4f96130520021cc4b681fa3fad3a8538afd26d71d/84e89d283c71af548cb3a1a4f96130520021cc4b681fa3fad3a8538afd26d71d-json.log", Offset:3327, Timestamp:time.Time{wall:0xbf53f9f04f473c6f, ext:96510833585, loc:(*time.Location)(0x30d3080)}, TTL:-1, Type:"container", Meta:map[string]string(nil), FileStateOS:file.StateOS{Inode:0x361ad7, Device:0x801}}, TimeSeries:false}, Flags:0x1} (status=403): {"type":"cluster_block_exception","reason":"index [filebeat-7.3.1-2019.09.04-000001] blocked by: [FORBIDDEN/5/index read-only (api)];"}

On the other hand, if the cluster is set in read_only mode (with cluster.blocks.read_only (cluster level setting)), all works as expected. This is logged by Filebeat and the backoff mechanism works fine. Data is not lost.

2019-09-04T07:35:57.549Z	ERROR	pipeline/output.go:121	Failed to publish events: 403 Forbidden: {"error":{"root_cause":[{"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/6/cluster read-only (api)];"}],"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/6/cluster read-only (api)];"},"status":403}


2019-09-04T07:35:57.596Z	INFO	elasticsearch/client.go:743	Attempting to connect to Elasticsearch version 7.2.0
2019-09-04T07:35:57.783Z	INFO	[index-management]	idxmgmt/std.go:252	Auto ILM enable success.
2019-09-04T07:36:01.667Z	ERROR	pipeline/output.go:100	Failed to connect to backoff(elasticsearch(https://a073fd7d4f61450d97cdca4dba0cc4df.eu-west-1.aws.found.io:9243)): Connection marked as failed because the onConnect callback failed: failed to check for policy name 'filebeat-7.3.1': (status=403) {"error":{"root_cause":[{"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/6/cluster read-only (api)];"}],"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/6/cluster read-only (api)];"},"status":403}: 403 Forbidden: {"error":{"root_cause":[{"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/6/cluster read-only (api)];"}],"type":"cluster_block_exception","reason":"blocked by: [FORBIDDEN/6/cluster read-only (api)];"},"status":403}

# Backoff works fine.

2019-09-04T07:39:00.163Z	INFO	pipeline/output.go:93	Attempting to reconnect to backoff(elasticsearch(https://a073fd7d4f61450d97cdca4dba0cc4df.eu-west-1.aws.found.io:9243)) with 7 reconnect attempt(s)
2019-09-04T07:39:00.163Z	INFO	[publisher]	pipeline/retry.go:189	retryer: send unwait-signal to consumer
2019-09-04T07:39:00.163Z	INFO	[publisher]	pipeline/retry.go:191	  done
2019-09-04T07:39:00.163Z	INFO	[publisher]	pipeline/retry.go:166	retryer: send wait signal to consumer
2019-09-04T07:39:00.163Z	INFO	[publisher]	pipeline/retry.go:168	  done

In the working scenario the message is raised as an ERROR (and there's no data loss). In the non-working scenario, there is data loss and Filebeat is logging just a WARN. Doesn't look accurate.

Version: Verified on 6.8 and 7.3
Operating System: Linux
Steps to Reproduce: Set destination index in read_only mode and back to normal a few minutes later to see how data is lost. Set cluster in read_only mode and back to normal a few minutes later to see how data is not lost.

As a final comment: in both described scenarios (working and not-working) the response from Elasticsearch is 403, but with different payload.

The text was updated successfully, but these errors were encountered:

kaiyan-sheng · 2019-11-18T16:06:54Z

Hmmm seems like the problem is caused by code https://github.com/elastic/beats/blob/master/libbeat/outputs/elasticsearch/client.go#L517

I'm not familiar with this part of the code. Maybe we should have a separate status check for status == 403 in this case? Instead of nonIndexable, it should be counted as fails. @urso WDYT?

eedugon · 2019-12-04T08:45:44Z

Most probably related: elastic/elasticsearch#49393
If Elasticsearch starts sending 429 instead of 403 in this case (index in read-only) then the back-off mechanism should work.

urso · 2019-12-04T14:26:22Z

HTTP 4xx indicate a client error. Retrying on client errors can lead to Beats getting stuck, as the event might not be indexible. The only special 4xx status code indicating to the client to retry is 429 (well, there is the non-standard 449 status code, but we don't handle this).

We depend on Elasticsearch returning correct status codes for particular situations. Nothing we can/should do on the Beats side. This is what elastic/elasticsearch#49393 (and other related issues) is about: have Elasticsearch return more sensible status codes.

As there will be no change in Beats, I'm closing it. Please comment/object if you think otherwise.

eedugon added the Filebeat Filebeat label Sep 30, 2019

eedugon mentioned this issue Sep 30, 2019

Monitoring data lost when Elasticsearch cluster is in maintenance mode #13845

Closed

kaiyan-sheng added the Team:Beats label Nov 18, 2019

urso added the [zube]: Backlog label Dec 3, 2019

urso self-assigned this Dec 3, 2019

urso closed this as completed Dec 4, 2019

zube bot added [zube]: Done and removed [zube]: Backlog labels Dec 4, 2019

andresrc removed the [zube]: Done label Jan 10, 2020

andresrc added the Team:Integrations Label for the Integrations team label Mar 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index in read-only mode not handled properly by Filebeat #13844

Index in read-only mode not handled properly by Filebeat #13844

eedugon commented Sep 30, 2019

kaiyan-sheng commented Nov 18, 2019 •

edited

Loading

eedugon commented Dec 4, 2019

urso commented Dec 4, 2019

Index in read-only mode not handled properly by Filebeat #13844

Index in read-only mode not handled properly by Filebeat #13844

Comments

eedugon commented Sep 30, 2019

kaiyan-sheng commented Nov 18, 2019 • edited Loading

eedugon commented Dec 4, 2019

urso commented Dec 4, 2019

kaiyan-sheng commented Nov 18, 2019 •

edited

Loading