[Monitoring] Logs error when we failed to collect information concerning the memory usage. #6426

ph · 2018-02-20T19:12:22Z

Followup from discuss: https://discuss.elastic.co/t/error-while-getting-memory-usage-error-retrieving-process-stats/120638

We had a report on discuss that some error were raised when collecting information about the memory usage. We currently don't log what was the actual error when calling

Additional information: Metricbeat on Dockerswarm host on AWS using Moby Linux.

2018-02-20T12:12:03.188Z	ERROR	instance/metrics.go:69	Error while getting memory usage: error retrieving process stats
2018-02-20T12:12:03.188Z	ERROR	instance/metrics.go:113	Error retrieving CPU percentages: error retrieving process stats
2018-02-20T12:12:03.188Z	INFO	[monitoring]	log/log.go:124	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"info":{"ephemeral_id":"0bd749e7-87ba-4c6f-ab96-4bd459f6b257","uptime":{"ms":1080058}},"memstats":{"gc_next":9093536,"memory_alloc":8263896,"memory_total":424347336}},"libbeat":{"config":{"module":{"running":4}},"pipeline":{"clients":9,"events":{"active":0}}},"system":{"load":{"1":0.17,"15":0.18,"5":0.08,"norm":{"1":0.085,"15":0.09,"5":0.04}}}}}}
2018-02-20T12:12:03.259Z	ERROR	schema/schema.go:41	Error on field 'count': Missing field: count, Error: Key disk_reads not found
2018-02-20T12:12:03.259Z	ERROR	schema/schema.go:41	Error on field 'count': Missing field: count, Error: Key disk_writes not found
2018-02-20T12:12:04.431Z	WARN	elasticsearch/client.go:502	Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xbe9b2204cf35bc38, ext:1080127366249, loc:(*time.Location)(0x3b6b300)}, Meta:common.MapStr(nil), Fields:common.MapStr{"rabbitmq":common.MapStr{"queue":common.MapStr{"messages":common.MapStr{"total":common.MapStr{"count":104}, "ready":common.MapStr{"count":0}, "unacknowledged":common.MapStr{"count":104}, "persistent":common.MapStr{"count":104}}, "arguments":common.MapStr{}, "state":"running", "auto_delete":false, "vhost":"/", "exclusive":false, "node":"rabbit@53b22045affd", "memory":common.MapStr{"bytes":111928}, "disk":common.MapStr{"reads":common.MapStr{}, "writes":common.MapStr{}}, "durable":true, "consumers":common.MapStr{"count":1, "utilisation":common.MapStr{}}, "name":"employee.attendance.message-builder"}}, "tags":[]string{"Test"}, "docker":common.MapStr{"container":common.MapStr{"image":"rabbitmq:3.7.2-management-alpine@sha256:90038af8f3f971de3bb67f91967a775b843b9c789a998db9524ce097c0bcef1c", "labels":map[string]string{"com.docker.swarm.task":"", "com.docker.swarm.task.id":"yvv0bhf4jv2miwmo3c4i2lubt", "com.docker.swarm.task.name":"bridge-apps-test_rabbitmq.1.yvv0bhf4jv2miwmo3c4i2lubt", "com.docker.stack.namespace":"bridge-apps-test", "com.docker.swarm.node.id":"4cex8urr6fuptr4l9koraqqdz", "com.docker.swarm.service.id":"lc7c5soguneg0ttb2p8ibjvlg", "com.docker.swarm.service.name":"bridge-apps-test_rabbitmq"}, "id":"53b22045affd86307e98bb33fdd82d1452b5ac09b9a0ecdabb498775f4eea35f", "name":"bridge-apps-test_rabbitmq.1.yvv0bhf4jv2miwmo3c4i2lubt"}}, "beat":common.MapStr{"name":"59945cb07e2b", "hostname":"59945cb07e2b", "version":"6.2.1"}, "metricset":common.MapStr{"host":"10.0.1.21:15672", "rtt":3892, "name":"queue", "module":"rabbitmq"}}, Private:interface {}(nil)}, Flags:0x0} (status=400): {"type":"illegal_argument_exception","reason":"Can't merge a non object mapping [docker.container.labels.com.docker.swarm.task] with an object mapping [docker.container.labels.com.docker.swarm.task]"}
2018-02-20T12:12:04.432Z	WARN	elasticsearch/client.go:502	Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xbe9b2204cf782eba, ext:1080131720937, loc:(*time.Location)(0x3b6b300)}, Meta:common.MapStr(nil), Fields:common.MapStr{"metricset":common.MapStr{"host":"10.0.1.21:15672", "rtt":4458, "name":"node", "module":"rabbitmq"}, "rabbitmq":common.MapStr{"node":common.MapStr{"queue":common.MapStr{"index":common.MapStr{"journal_write":common.MapStr{"count":304}, "read":common.MapStr{"count":0}, "write":common.MapStr{"count":72}}}, "run":common.MapStr{"queue":0}, "type":"disc", "disk":common.MapStr{"free":common.MapStr{"bytes":9871515648, "limit":common.MapStr{"bytes":50000000}}}, "name":"rabbit@53b22045affd", "fd":common.MapStr{"total":1048576, "used":30}, "mnesia":common.MapStr{"disk":common.MapStr{"tx":common.MapStr{"count":21}}, "ram":common.MapStr{"tx":common.MapStr{"count":242}}}, "proc":common.MapStr{"total":1048576, "used":402}, "io":common.MapStr{"file_handle":common.MapStr{"open_attempt":common.MapStr{"avg":common.MapStr{"ms":0}, "count":700}}, "read":common.MapStr{"bytes":1, "count":1, "avg":common.MapStr{"ms":0}}, "reopen":common.MapStr{"count":1}, "seek":common.MapStr{"avg":common.MapStr{"ms":0}, "count":168}, "sync":common.MapStr{"avg":common.MapStr{"ms":2}, "count":218}, "write":common.MapStr{"bytes":218494, "count":218, "avg":common.MapStr{"ms":0}}}, "processors":2, "uptime":949787884, "msg":common.MapStr{"store_read":common.MapStr{"count":0}, "store_write":common.MapStr{"count":0}}, "socket":common.MapStr{"total":943626, "used":3}, "mem":common.MapStr{"limit":common.MapStr{"bytes":1655037952}, "used":common.MapStr{"bytes":94183424}}, "gc":common.MapStr{"num":common.MapStr{"count":8614950}, "reclaimed":common.MapStr{"bytes":111983064400}}}}, "tags":[]string{"Test"}, "docker":common.MapStr{"container":common.MapStr{"id":"53b22045affd86307e98bb33fdd82d1452b5ac09b9a0ecdabb498775f4eea35f", "name":"bridge-apps-test_rabbitmq.1.yvv0bhf4jv2miwmo3c4i2lubt", "image":"rabbitmq:3.7.2-management-alpine@sha256:90038af8f3f971de3bb67f91967a775b843b9c789a998db9524ce097c0bcef1c", "labels":map[string]string{"com.docker.swarm.service.id":"lc7c5soguneg0ttb2p8ibjvlg", "com.docker.swarm.service.name":"bridge-apps-test_rabbitmq", "com.docker.swarm.task":"", "com.docker.swarm.task.id":"yvv0bhf4jv2miwmo3c4i2lubt", "com.docker.swarm.task.name":"bridge-apps-test_rabbitmq.1.yvv0bhf4jv2miwmo3c4i2lubt", "com.docker.stack.namespace":"bridge-apps-test", "com.docker.swarm.node.id":"4cex8urr6fuptr4l9koraqqdz"}}}, "beat":common.MapStr{"name":"59945cb07e2b", "hostname":"59945cb07e2b", "version":"6.2.1"}}, Private:interface {}(nil)}, Flags:0x0} (status=400): {"type":"illegal_argument_exception","reason":"mapper [docker.container.labels.com.docker.swarm.task] of different type, current_type [keyword], merged_type [ObjectMapper]"}
2018-02-20T12:12:33.188Z	ERROR	instance/metrics.go:69	Error while getting memory usage: error retrieving process stats
2018-02-20T12:12:33.188Z	ERROR	instance/metrics.go:113	Error retrieving CPU percentages: error retrieving process stats

Wrapping the original error here

beats/libbeat/cmd/instance/metrics.go

Lines 77 to 81 in 33ff735

    
           state, err := beatProcessStats.GetOne(beatPID) 
        
           if err != nil { 
        
           	return 0, fmt.Errorf("error retrieving process stats") 
        
           }

The docker tags error is related to a Mapping error

The text was updated successfully, but these errors were encountered:

ph · 2018-02-20T19:17:56Z

We will also need to also return the orignal error here

beats/libbeat/metric/system/process/process.go

Lines 419 to 424 in 33ff735

    
           newProcs := make(ProcsMap, 1) 
        
           p := procStats.getSingleProcess(pid, newProcs) 
        
           if p == nil { 
        
           	return nil, fmt.Errorf("cannot find matching process for pid=%d", pid) 
        
           }

kvch · 2018-02-21T10:04:01Z

@ph getSingleProcess logs the errors, so no message goes unnoticed in that case.

ph · 2018-02-21T14:19:29Z

@kvch Correct, it has Debug and Err level log statements, but in the current trace we see Error statements, and I think they don't come from the getSingleProcess:

ERROR instance/metrics.go:69 Error while getting memory usage: error retrieving process stats ERROR instance/metrics.go:113 Error retrieving CPU percentages: error retrieving process stats

They are currently two problems with the error messages:

They are recurring.
error retrieving process stats is not actionable, the users or us cannot correct the situation.

What do you think, could we improve the log message? This will fix both of the issues.

ph · 2018-02-21T14:20:03Z

@kvch Nevermind saw the PR linked after. more ☕️

Constantin07 · 2018-03-21T10:25:10Z

I'm still getting these errors in logs with metricbeat 6.2.2:

2018-03-21T10:10:07.539Z	ERROR	instance/metrics.go:69	Error while getting memory usage: error retrieving process stats
2018-03-21T10:10:07.539Z	ERROR	instance/metrics.go:113	Error retrieving CPU percentages: error retrieving process stats

but there is no clue what could be the problem. I'm running official metricbeat docker container and trying to pull stats from host.

docker run -d --restart=always --name metricbeat \
--net=host \
--uts=host \
-u root \
-v /proc:/hostfs/proc:ro \
-v /sys/fs/cgroup:/hostfs/sys/fs/cgroup:ro \
-v /:/hostfs:ro \
-v /var/run/docker.sock:/var/run/docker.sock \
metricbeat:6.2.2 -system.hostfs=/hostfs

Any idea what could be the cause ?

andrewkroh · 2018-03-21T11:49:53Z

@Constantin07 I think this deserves a new issue like "Monitoring metrics don't work when containerized". My initial guess is that it cannot find its own metrics in /hostfs/proc because it is pid namespaced and therefore is looking for something like 2 but when on the host the PID is very different.

This problem extends from the fact that /hostfs is treated as a global variable. This causes all metrics collecting code to read from /hostfs/proc. But the self monitoring metrics should always come from /proc.

Constantin07 · 2018-03-21T16:49:16Z

Thanks @andrewkroh. Raised #6620.

ph added the libbeat label Feb 20, 2018

ph added the bug label Feb 20, 2018

kvch mentioned this issue Feb 21, 2018

More verbose error logs for collecting monitoring metrics #6429

Merged

ph closed this as completed in #6429 Feb 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitoring] Logs error when we failed to collect information concerning the memory usage. #6426

[Monitoring] Logs error when we failed to collect information concerning the memory usage. #6426

ph commented Feb 20, 2018

ph commented Feb 20, 2018

kvch commented Feb 21, 2018

ph commented Feb 21, 2018

ph commented Feb 21, 2018

Constantin07 commented Mar 21, 2018 •

edited

Loading

andrewkroh commented Mar 21, 2018 •

edited

Loading

Constantin07 commented Mar 21, 2018

[Monitoring] Logs error when we failed to collect information concerning the memory usage. #6426

[Monitoring] Logs error when we failed to collect information concerning the memory usage. #6426

Comments

ph commented Feb 20, 2018

ph commented Feb 20, 2018

kvch commented Feb 21, 2018

ph commented Feb 21, 2018

ph commented Feb 21, 2018

Constantin07 commented Mar 21, 2018 • edited Loading

andrewkroh commented Mar 21, 2018 • edited Loading

Constantin07 commented Mar 21, 2018

Constantin07 commented Mar 21, 2018 •

edited

Loading

andrewkroh commented Mar 21, 2018 •

edited

Loading