Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Logs error when we failed to collect information concerning the memory usage. #6426

Closed
ph opened this issue Feb 20, 2018 · 7 comments

Comments

@ph
Copy link
Contributor

ph commented Feb 20, 2018

Followup from discuss: https://discuss.elastic.co/t/error-while-getting-memory-usage-error-retrieving-process-stats/120638

We had a report on discuss that some error were raised when collecting information about the memory usage. We currently don't log what was the actual error when calling

Additional information: Metricbeat on Dockerswarm host on AWS using Moby Linux.

2018-02-20T12:12:03.188Z	ERROR	instance/metrics.go:69	Error while getting memory usage: error retrieving process stats
2018-02-20T12:12:03.188Z	ERROR	instance/metrics.go:113	Error retrieving CPU percentages: error retrieving process stats
2018-02-20T12:12:03.188Z	INFO	[monitoring]	log/log.go:124	Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"info":{"ephemeral_id":"0bd749e7-87ba-4c6f-ab96-4bd459f6b257","uptime":{"ms":1080058}},"memstats":{"gc_next":9093536,"memory_alloc":8263896,"memory_total":424347336}},"libbeat":{"config":{"module":{"running":4}},"pipeline":{"clients":9,"events":{"active":0}}},"system":{"load":{"1":0.17,"15":0.18,"5":0.08,"norm":{"1":0.085,"15":0.09,"5":0.04}}}}}}
2018-02-20T12:12:03.259Z	ERROR	schema/schema.go:41	Error on field 'count': Missing field: count, Error: Key disk_reads not found
2018-02-20T12:12:03.259Z	ERROR	schema/schema.go:41	Error on field 'count': Missing field: count, Error: Key disk_writes not found
2018-02-20T12:12:04.431Z	WARN	elasticsearch/client.go:502	Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xbe9b2204cf35bc38, ext:1080127366249, loc:(*time.Location)(0x3b6b300)}, Meta:common.MapStr(nil), Fields:common.MapStr{"rabbitmq":common.MapStr{"queue":common.MapStr{"messages":common.MapStr{"total":common.MapStr{"count":104}, "ready":common.MapStr{"count":0}, "unacknowledged":common.MapStr{"count":104}, "persistent":common.MapStr{"count":104}}, "arguments":common.MapStr{}, "state":"running", "auto_delete":false, "vhost":"/", "exclusive":false, "node":"rabbit@53b22045affd", "memory":common.MapStr{"bytes":111928}, "disk":common.MapStr{"reads":common.MapStr{}, "writes":common.MapStr{}}, "durable":true, "consumers":common.MapStr{"count":1, "utilisation":common.MapStr{}}, "name":"employee.attendance.message-builder"}}, "tags":[]string{"Test"}, "docker":common.MapStr{"container":common.MapStr{"image":"rabbitmq:3.7.2-management-alpine@sha256:90038af8f3f971de3bb67f91967a775b843b9c789a998db9524ce097c0bcef1c", "labels":map[string]string{"com.docker.swarm.task":"", "com.docker.swarm.task.id":"yvv0bhf4jv2miwmo3c4i2lubt", "com.docker.swarm.task.name":"bridge-apps-test_rabbitmq.1.yvv0bhf4jv2miwmo3c4i2lubt", "com.docker.stack.namespace":"bridge-apps-test", "com.docker.swarm.node.id":"4cex8urr6fuptr4l9koraqqdz", "com.docker.swarm.service.id":"lc7c5soguneg0ttb2p8ibjvlg", "com.docker.swarm.service.name":"bridge-apps-test_rabbitmq"}, "id":"53b22045affd86307e98bb33fdd82d1452b5ac09b9a0ecdabb498775f4eea35f", "name":"bridge-apps-test_rabbitmq.1.yvv0bhf4jv2miwmo3c4i2lubt"}}, "beat":common.MapStr{"name":"59945cb07e2b", "hostname":"59945cb07e2b", "version":"6.2.1"}, "metricset":common.MapStr{"host":"10.0.1.21:15672", "rtt":3892, "name":"queue", "module":"rabbitmq"}}, Private:interface {}(nil)}, Flags:0x0} (status=400): {"type":"illegal_argument_exception","reason":"Can't merge a non object mapping [docker.container.labels.com.docker.swarm.task] with an object mapping [docker.container.labels.com.docker.swarm.task]"}
2018-02-20T12:12:04.432Z	WARN	elasticsearch/client.go:502	Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xbe9b2204cf782eba, ext:1080131720937, loc:(*time.Location)(0x3b6b300)}, Meta:common.MapStr(nil), Fields:common.MapStr{"metricset":common.MapStr{"host":"10.0.1.21:15672", "rtt":4458, "name":"node", "module":"rabbitmq"}, "rabbitmq":common.MapStr{"node":common.MapStr{"queue":common.MapStr{"index":common.MapStr{"journal_write":common.MapStr{"count":304}, "read":common.MapStr{"count":0}, "write":common.MapStr{"count":72}}}, "run":common.MapStr{"queue":0}, "type":"disc", "disk":common.MapStr{"free":common.MapStr{"bytes":9871515648, "limit":common.MapStr{"bytes":50000000}}}, "name":"rabbit@53b22045affd", "fd":common.MapStr{"total":1048576, "used":30}, "mnesia":common.MapStr{"disk":common.MapStr{"tx":common.MapStr{"count":21}}, "ram":common.MapStr{"tx":common.MapStr{"count":242}}}, "proc":common.MapStr{"total":1048576, "used":402}, "io":common.MapStr{"file_handle":common.MapStr{"open_attempt":common.MapStr{"avg":common.MapStr{"ms":0}, "count":700}}, "read":common.MapStr{"bytes":1, "count":1, "avg":common.MapStr{"ms":0}}, "reopen":common.MapStr{"count":1}, "seek":common.MapStr{"avg":common.MapStr{"ms":0}, "count":168}, "sync":common.MapStr{"avg":common.MapStr{"ms":2}, "count":218}, "write":common.MapStr{"bytes":218494, "count":218, "avg":common.MapStr{"ms":0}}}, "processors":2, "uptime":949787884, "msg":common.MapStr{"store_read":common.MapStr{"count":0}, "store_write":common.MapStr{"count":0}}, "socket":common.MapStr{"total":943626, "used":3}, "mem":common.MapStr{"limit":common.MapStr{"bytes":1655037952}, "used":common.MapStr{"bytes":94183424}}, "gc":common.MapStr{"num":common.MapStr{"count":8614950}, "reclaimed":common.MapStr{"bytes":111983064400}}}}, "tags":[]string{"Test"}, "docker":common.MapStr{"container":common.MapStr{"id":"53b22045affd86307e98bb33fdd82d1452b5ac09b9a0ecdabb498775f4eea35f", "name":"bridge-apps-test_rabbitmq.1.yvv0bhf4jv2miwmo3c4i2lubt", "image":"rabbitmq:3.7.2-management-alpine@sha256:90038af8f3f971de3bb67f91967a775b843b9c789a998db9524ce097c0bcef1c", "labels":map[string]string{"com.docker.swarm.service.id":"lc7c5soguneg0ttb2p8ibjvlg", "com.docker.swarm.service.name":"bridge-apps-test_rabbitmq", "com.docker.swarm.task":"", "com.docker.swarm.task.id":"yvv0bhf4jv2miwmo3c4i2lubt", "com.docker.swarm.task.name":"bridge-apps-test_rabbitmq.1.yvv0bhf4jv2miwmo3c4i2lubt", "com.docker.stack.namespace":"bridge-apps-test", "com.docker.swarm.node.id":"4cex8urr6fuptr4l9koraqqdz"}}}, "beat":common.MapStr{"name":"59945cb07e2b", "hostname":"59945cb07e2b", "version":"6.2.1"}}, Private:interface {}(nil)}, Flags:0x0} (status=400): {"type":"illegal_argument_exception","reason":"mapper [docker.container.labels.com.docker.swarm.task] of different type, current_type [keyword], merged_type [ObjectMapper]"}
2018-02-20T12:12:33.188Z	ERROR	instance/metrics.go:69	Error while getting memory usage: error retrieving process stats
2018-02-20T12:12:33.188Z	ERROR	instance/metrics.go:113	Error retrieving CPU percentages: error retrieving process stats

Wrapping the original error here

state, err := beatProcessStats.GetOne(beatPID)
if err != nil {
return 0, fmt.Errorf("error retrieving process stats")
}

The docker tags error is related to a Mapping error

@ph ph added the libbeat label Feb 20, 2018
@ph
Copy link
Contributor Author

ph commented Feb 20, 2018

We will also need to also return the orignal error here

newProcs := make(ProcsMap, 1)
p := procStats.getSingleProcess(pid, newProcs)
if p == nil {
return nil, fmt.Errorf("cannot find matching process for pid=%d", pid)
}

@kvch
Copy link
Contributor

kvch commented Feb 21, 2018

@ph getSingleProcess logs the errors, so no message goes unnoticed in that case.

@ph
Copy link
Contributor Author

ph commented Feb 21, 2018

@kvch Correct, it has Debug and Err level log statements, but in the current trace we see Error statements, and I think they don't come from the getSingleProcess:

ERROR instance/metrics.go:69 Error while getting memory usage: error retrieving process stats ERROR instance/metrics.go:113 Error retrieving CPU percentages: error retrieving process stats

They are currently two problems with the error messages:

  1. They are recurring.
  2. error retrieving process stats is not actionable, the users or us cannot correct the situation.

What do you think, could we improve the log message? This will fix both of the issues.

@ph
Copy link
Contributor Author

ph commented Feb 21, 2018

@kvch Nevermind saw the PR linked after. more ☕️

@ph ph closed this as completed in #6429 Feb 21, 2018
@Constantin07
Copy link

Constantin07 commented Mar 21, 2018

I'm still getting these errors in logs with metricbeat 6.2.2:

2018-03-21T10:10:07.539Z	ERROR	instance/metrics.go:69	Error while getting memory usage: error retrieving process stats
2018-03-21T10:10:07.539Z	ERROR	instance/metrics.go:113	Error retrieving CPU percentages: error retrieving process stats

but there is no clue what could be the problem. I'm running official metricbeat docker container and trying to pull stats from host.

docker run -d --restart=always --name metricbeat \
--net=host \
--uts=host \
-u root \
-v /proc:/hostfs/proc:ro \
-v /sys/fs/cgroup:/hostfs/sys/fs/cgroup:ro \
-v /:/hostfs:ro \
-v /var/run/docker.sock:/var/run/docker.sock \
metricbeat:6.2.2 -system.hostfs=/hostfs

Any idea what could be the cause ?

@andrewkroh
Copy link
Member

andrewkroh commented Mar 21, 2018

@Constantin07 I think this deserves a new issue like "Monitoring metrics don't work when containerized". My initial guess is that it cannot find its own metrics in /hostfs/proc because it is pid namespaced and therefore is looking for something like 2 but when on the host the PID is very different.

This problem extends from the fact that /hostfs is treated as a global variable. This causes all metrics collecting code to read from /hostfs/proc. But the self monitoring metrics should always come from /proc.

@Constantin07
Copy link

Thanks @andrewkroh. Raised #6620.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants