Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metricbeat: error calling MarshalJSON for type common.Float #3580

Closed
spalger opened this issue Feb 13, 2017 · 31 comments
Closed

metricbeat: error calling MarshalJSON for type common.Float #3580

spalger opened this issue Feb 13, 2017 · 31 comments
Labels
bug Metricbeat Metricbeat module Team:Integrations Label for the Integrations team

Comments

@spalger
Copy link

spalger commented Feb 13, 2017

  • Version: Snapshot from 15b32e4
  • Operating System: ubuntu
  • Steps to Reproduce: unsure

What I think happened:

  • metricbeat was running fine
  • switched from system module to docker module
  • exit and restart, still running fine
  • adjust config to remove some metricsets from docker module config
  • attempted stop ./metricbeat but it wasn't responding to ctrl+c
  • killed the process, try again, then kill -9
  • started it back up with different metricsets config
  • Running command produces no metrics, constantly logs ERR Failed to encode event: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value

I've tried for fix this by wiping out the metricbeat data directory and restarting the box but am unable to get it work anymore.

@spalger
Copy link
Author

spalger commented Feb 13, 2017

Looks like the issue was that some of the containers I had created were busted and were constantly restarting. This must have caused an error when metricbeat tried to read something.

When I updated the containers to remove the busted ones I got the following in my logs:

2017/02/13 19:46:48.065754 docker.go:125: WARN Container stopped when recovering stats: 476573cb42145e260f1130c1d05308440a9227666e5c9a32a9c33b3ca7dc3623
2017/02/13 19:46:48.065750 docker.go:125: WARN Container stopped when recovering stats: 476573cb42145e260f1130c1d05308440a9227666e5c9a32a9c33b3ca7dc3623
2017/02/13 19:46:48.065795 docker.go:125: WARN Container stopped when recovering stats: 476573cb42145e260f1130c1d05308440a9227666e5c9a32a9c33b3ca7dc3623
2017/02/13 19:46:48.065910 docker.go:125: WARN Container stopped when recovering stats: 476573cb42145e260f1130c1d05308440a9227666e5c9a32a9c33b3ca7dc3623
2017/02/13 19:46:48.067729 log.go:144: ERR recovered from panic while fetching 'docker/diskio' for host '/var/run/docker.sock'. Recovering, but please report this: runtime error: index out of range.
2017/02/13 19:46:48.067922 log.go:145: ERR Stacktrace: goroutine 76 [running]:
runtime/debug.Stack(0xbd423c, 0x2b, 0xc420584f80)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/elastic/beats/libbeat/logp.Recover(0xc4203ef380, 0x53)
        /go/src/github.com/elastic/beats/libbeat/logp/log.go:145 +0x138
panic(0xad8920, 0xc420010130)
        /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/elastic/beats/metricbeat/module/docker.ExtractContainerName(0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:38 +0x16b
github.com/elastic/beats/metricbeat/module/docker.NewContainer(0xc420585368, 0xc420a78090)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:32 +0x4e
github.com/elastic/beats/metricbeat/module/docker/diskio.(*BLkioService).getBlkioStats(0xc4204f00c8, 0xc420585368, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/diskio/helper.go:52 +0x14f
github.com/elastic/beats/metricbeat/module/docker/diskio.(*BLkioService).getBlkioStatsList(0xc4204f00c8, 0xc42058d000, 0xd, 0xd, 0x0, 0x0, 0xc4284ecd42)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/diskio/helper.go:40 +0xe2
github.com/elastic/beats/metricbeat/module/docker/diskio.(*MetricSet).Fetch(0xc4204801b0, 0x284ecd42, 0x11594c0, 0x40d84f, 0x115d5c8, 0x7f3c9df5ca08)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/diskio/diskio.go:54 +0x7c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).multiEventFetch(0xc420232e40, 0x7f3c9df5ca08, 0xc4204801b0, 0xc420105c70, 0x1, 0xc4203ef380, 0x53, 0xc420105ea4)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:267 +0x6c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).fetch(0xc420232e40, 0xc420496720, 0xc420496840, 0x0, 0x0)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:229 +0x331
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).startFetching(0xc420232e40, 0xc420496720, 0xc420496840)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203 +0x2fe
github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start.func1(0xc420229220, 0xc420496720, 0xc420496840, 0xc420232e40)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:139 +0x67
created by github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:140 +0x1f9
2017/02/13 19:46:48.067742 log.go:144: ERR recovered from panic while fetching 'docker/memory' for host '/var/run/docker.sock'. Recovering, but please report this: runtime error: index out of range.
2017/02/13 19:46:48.068297 log.go:145: ERR Stacktrace: goroutine 79 [running]:
runtime/debug.Stack(0xbd423c, 0x2b, 0xc420250c78)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/elastic/beats/libbeat/logp.Recover(0xc42034a2a0, 0x53)
        /go/src/github.com/elastic/beats/libbeat/logp/log.go:145 +0x138
panic(0xad8920, 0xc420010130)
        /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/elastic/beats/metricbeat/module/docker.ExtractContainerName(0x0, 0x0, 0x0, 0xc4207300d1, 0x5)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:38 +0x16b
github.com/elastic/beats/metricbeat/module/docker.NewContainer(0xc420250ee8, 0xc42001a5d0)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:32 +0x4e
github.com/elastic/beats/metricbeat/module/docker/memory.(*MemoryService).GetMemoryStats(0x1176288, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/memory/helper.go:34 +0x60
github.com/elastic/beats/metricbeat/module/docker/memory.(*MemoryService).getMemoryStatsList(0x1176288, 0xc420766000, 0xd, 0xd, 0x0, 0x0, 0xc4284cb1cf)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/memory/helper.go:25 +0x104
github.com/elastic/beats/metricbeat/module/docker/memory.(*MetricSet).Fetch(0xc420480360, 0x284cb1cf, 0x11594c0, 0x40d84f, 0x115db50, 0x7f3c9df5cbb8)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/memory/memory.go:52 +0x7c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).multiEventFetch(0xc420233b40, 0x7f3c9df5cbb8, 0xc420480360, 0xc420073c70, 0x1, 0xc42034a2a0, 0x53, 0xc420073ea4)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:267 +0x6c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).fetch(0xc420233b40, 0xc420496720, 0xc420496840, 0x0, 0x0)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:229 +0x331
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).startFetching(0xc420233b40, 0xc420496720, 0xc420496840)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203 +0x2fe
github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start.func1(0xc420229220, 0xc420496720, 0xc420496840, 0xc420233b40)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:139 +0x67
created by github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:140 +0x1f9
2017/02/13 19:46:48.068707 log.go:144: ERR recovered from panic while fetching 'docker/cpu' for host '/var/run/docker.sock'. Recovering, but please report this: runtime error: index out of range.
2017/02/13 19:46:48.068851 log.go:145: ERR Stacktrace: goroutine 75 [running]:
runtime/debug.Stack(0xbd423c, 0x2b, 0xc420588fe0)
        /usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/elastic/beats/libbeat/logp.Recover(0xc42048e6e0, 0x50)
        /go/src/github.com/elastic/beats/libbeat/logp/log.go:145 +0x138
panic(0xad8920, 0xc420010130)
        /usr/local/go/src/runtime/panic.go:458 +0x243
github.com/elastic/beats/metricbeat/module/docker.ExtractContainerName(0x0, 0x0, 0x0, 0xbb241d, 0x5)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:38 +0x16b
github.com/elastic/beats/metricbeat/module/docker.NewContainer(0xc420589368, 0xc4207d0930)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/helper.go:32 +0x4e
github.com/elastic/beats/metricbeat/module/docker/cpu.(*CPUService).getCpuStats(0x1176288, 0xc420589368, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/cpu/helper.go:52 +0x64
github.com/elastic/beats/metricbeat/module/docker/cpu.(*CPUService).getCPUStatsList(0x1176288, 0xc420769000, 0xd, 0xd, 0x0, 0x0, 0xc4284d1261)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/cpu/helper.go:42 +0xeb
github.com/elastic/beats/metricbeat/module/docker/cpu.(*MetricSet).Fetch(0xc420480120, 0x284d1261, 0x11594c0, 0x40d84f, 0x115e8a0, 0x7f3c9df5c978)
        /go/src/github.com/elastic/beats/metricbeat/module/docker/cpu/cpu.go:52 +0x7c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).multiEventFetch(0xc420232d00, 0x7f3c9df5c978, 0xc420480120, 0xc42041dc70, 0x1, 0xc42048e6e0, 0x50, 0xc42041dea4)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:267 +0x6c
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).fetch(0xc420232d00, 0xc420496720, 0xc420496840, 0x0, 0x0)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:229 +0x331
github.com/elastic/beats/metricbeat/mb/module.(*metricSetWrapper).startFetching(0xc420232d00, 0xc420496720, 0xc420496840)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:203 +0x2fe
github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start.func1(0xc420229220, 0xc420496720, 0xc420496840, 0xc420232d00)
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:139 +0x67
created by github.com/elastic/beats/metricbeat/mb/module.(*Wrapper).Start
        /go/src/github.com/elastic/beats/metricbeat/mb/module/wrapper.go:140 +0x1f9
2017/02/13 19:46:49.674812 client.go:291: ERR Failed to encode event: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value
2017/02/13 19:46:57.655048 metrics.go:39: INFO Non-zero metrics in the last 30s: fetches.docker-container.events=39 fetches.docker-container.success=3 fetches.docker-cpu.events=12 fetches.docker-cpu.success=1 fetches.docker-diskio.events=12 fetches.docker-diskio.success=1 fetches.docker-healthcheck.success=3 fetches.docker-info.events=3 fetches.docker-info.success=3 fetches.docker-memory.events=12 fetches.docker-memory.success=1 fetches.docker-network.success=2 libbeat.es.call_count.PublishEvents=4 libbeat.es.publish.read_bytes=17358 libbeat.es.publish.write_bytes=85118 libbeat.es.published_and_acked_events=77 libbeat.publisher.messages_in_worker_queues=78 libbeat.publisher.published_events=78

But now things seem to be reporting naturally.

I'm going to leave this open since the error reporting and inability to exit metricbeat probably characterizes this as a bug, but I'll wait for feedback.

@tsg
Copy link
Contributor

tsg commented Feb 13, 2017

The stack trace is coming from here where we don't check for the names having at least one entry before calling names[0]. That one should be easy to fix. Cc @ruflin

The first JSON encoding error I'm not sure how it can happen. Perhaps @urso has an idea?

@tsg tsg added bug Metricbeat Metricbeat labels Feb 13, 2017
@ruflin
Copy link
Member

ruflin commented Feb 14, 2017

Interesting that it can happen that a container doesn't have a name. @spalger Which docker version are you using?

For the long shutdown: If you have lots of containers, it can take quite a bit of time to shut down :-( Problem is that for each container we have to do a http request at the moment. So these can queue up :-( How many containers did you have?

@spalger
Copy link
Author

spalger commented Feb 14, 2017

Docker version 1.13.1, build 092cba3, and I had 11 containers at that time I think. I couldn't help but think that the metrics module was frozen or something. It wasn't using any resources and the log kept reporting No non-zero metrics in the last 30s or something like that after the kill was sent.

@spalger
Copy link
Author

spalger commented Feb 14, 2017

Looks like metricbeat has frozen up again, but this time just because a container crashed and was not instructed to restart on crash. Here are the series of events as I observed them:

  • restart containers, but turn off "auto restart" behavior
  • watch as the memory usage of some containers raised and eventually caused the containers to be terminated
  • memory reports from metricbeat stop arriving
  • check the metricbeat log file and notice that new non-zero metrics are reported, but no errors
  • ctrl-c to stop metricbeat, but it's does not close
  • run docker-compose up to rebuild my containers
  • once all docker containers are restarted and running again metricbeat closes
  • restart with -e flag so I can produce log output attached below
  • wait for cycle to restart

console output: https://gist.github.com/spalger/a179d18dfc13b2a006a529634a6404e7

@spalger
Copy link
Author

spalger commented Feb 14, 2017

If anyone would like to take a look at this machine, just ping me offline with your public key

@ruflin
Copy link
Member

ruflin commented Feb 16, 2017

@spalger Thanks for all the details. It seems metricbeat is hanging in a request to the docker stats API (my assumption). I need to find a good way how to crash a container to reproduce this.

@tsg
Copy link
Contributor

tsg commented Feb 16, 2017

Btw, you can get a stacktrace from Metricbeat if you send it SIGQUIT. That might be helpful if it happens again.

@spalger
Copy link
Author

spalger commented Feb 16, 2017

@ruflin looks like the container doesn't need to crash, a short lived container that exits with 0 status seems to have the same effect

@ruflin
Copy link
Member

ruflin commented Feb 17, 2017

@spalger I managed to reproduce this. I opened #3610 as this issue here contains also a second issue. This makes it easier to track / find.

@ruflin
Copy link
Member

ruflin commented Feb 17, 2017

@spalger I think I managed to find the bug and provide a fix here: #3612 This should also fix Stacktrace report you saw above.

@untergeek
Copy link
Member

Not sure if FreeBSD jails count as containers to metricbeat, but I see this error a lot on my FreeNAS box:

ERR Failed to encode event: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value

In fact, it's the only error in any of the logs.

@ruflin
Copy link
Member

ruflin commented Feb 24, 2017

@untergeek Does it still report any data? It seems the container return an interesting "JSON" structure. It would be interesting to get hold of the JSON data that is received to have a closer look. I'm thinking if this could be an enconding issue or that the server does not return json, even though it should

@jess-lawrence-axomic
Copy link

I'm also having this issue:

ERR Failed to encode event: json: error calling MarshalJSON for type common.Float: invalid character 'N' looking for beginning of value

Versions:

docker version: 1.26
metricbeat version: metricbeat version 5.2.2 (amd64), libbeat 5.2.2

Metricbeat is set to pull over metricsets: ['cpu', 'info', 'memory', 'network', 'diskio', 'container']

But It's pulling over info, diskio, network, container and cpu, but failing to pull over memory (Which i believe is part of this error).

Is there anyway to get at the struct docker returns for inspection?

@ruflin
Copy link
Member

ruflin commented Mar 27, 2017

@jess-lawrence-axomic You should be able to access the endpoint through GET /containers/(id or name)/stats. See https://docs.docker.com/engine/api/v1.24/ Would be great if you could paste the output here to user it for some testing.

@jess-lawrence-axomic
Copy link

jess-lawrence-axomic commented Mar 27, 2017

Hey @ruflin , here it is:

{
    "blkio_stats": {
        "io_merged_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 0
            }
        ],
        "io_queue_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 0
            }
        ],
        "io_service_bytes_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 389120
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 389120
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 389120
            }
        ],
        "io_service_time_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 24988598
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 24988598
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 24988598
            }
        ],
        "io_serviced_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 44
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 44
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 44
            }
        ],
        "io_time_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "",
                "value": 431
            }
        ],
        "io_wait_time_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 5653454862
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 5653454862
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 5653454862
            }
        ],
        "sectors_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "",
                "value": 760
            }
        ]
    },
    "cpu_stats": {
        "cpu_usage": {
            "percpu_usage": [
                375581793188,
                203238317936
            ],
            "total_usage": 578820111124,
            "usage_in_kernelmode": 3860000000,
            "usage_in_usermode": 10070000000
        },
        "system_cpu_usage": 642398270000000,
        "throttling_data": {
            "periods": 0,
            "throttled_periods": 0,
            "throttled_time": 0
        }
    },
    "id": "0e921a961e2b1e671dd925440b2f8f072e772bc0e969734b00f00918984fe968",
    "memory_stats": {},
    "name": "/gallant_fermi",
    "networks": {
        "eth0": {
            "rx_bytes": 202406,
            "rx_dropped": 0,
            "rx_errors": 0,
            "rx_packets": 1253,
            "tx_bytes": 119135,
            "tx_dropped": 0,
            "tx_errors": 0,
            "tx_packets": 1313
        }
    },
    "num_procs": 0,
    "pids_stats": {},
    "precpu_stats": {
        "cpu_usage": {
            "total_usage": 0,
            "usage_in_kernelmode": 0,
            "usage_in_usermode": 0
        },
        "throttling_data": {
            "periods": 0,
            "throttled_periods": 0,
            "throttled_time": 0
        }
    },
    "preread": "0001-01-01T00:00:00Z",
    "read": "2017-03-27T10:09:05.342431137Z",
    "storage_stats": {}
}

It seems to have omitted memory stats altogether. This node runs on an EC2 instance (HVM) so I suspect that could be causing issues?

@ruflin
Copy link
Member

ruflin commented Mar 28, 2017

Thanks for sharing the output. Very interesting. I'm curious what it means when the memory stats are empty. Perhaps @exekias or @douaejeouit know more?

@exekias
Copy link
Contributor

exekias commented Mar 28, 2017

That probably means the container is not running at the moment, so no mem info is reported. I'm afraid we need some more tracing to detect what happened there, as I don't see any value starting with N (from invalid character 'N' looking for beginning of value)?

@douaejeouit
Copy link
Contributor

douaejeouit commented Mar 29, 2017

Thanks for sharing. According to the given logs, the problem occurs when the container stops while fetching stats. When this happen, there are no stats to report. I think we should handle this case instead of logging warning. (docker.go : 130)

else if err == nil && stats == nil {
			logp.Warn("Container stopped when recovering stats: %v", container.ID)
		}

Concerning the omitted memory stats, it's probably not a bug because, logically, if that were the case, you wouldn't have had any output. The cmd would probably crush or report an error ? But in this case, you're having your container statistics except for the memory ones. That's weird! It looks more like the memory stats & the storage stats were voluntary omitted, or at least this is the impression the output gives.
I see that your docker is running on an EC2 instance, we should probably observe the stats cmd behaviour if possible. @jess-lawrence-axomic Could you please try it again with another container and share the output? Thank you in advance

@jess-lawrence-axomic
Copy link

Good morning @douaejeouit @exekias

The container was running when I retrieved the stats from the docker API, here is another container on a new node/host:

admin@ip-10-0-16-176:~$ docker ps
CONTAINER ID        IMAGE                                                               COMMAND             CREATED             STATUS              PORTS                  NAMES
bf8be0a4026c        866167950344.dkr.ecr.us-east-1.amazonaws.com/openasset-prd:latest   "./start.sh"        23 hours ago        Up 23 hours         0.0.0.0:8392->80/tcp   festive_lalande

And the stats:

{
    "blkio_stats": {
        "io_merged_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 0
            }
        ],
        "io_queue_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 0
            }
        ],
        "io_service_bytes_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 8192
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 8192
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 8192
            }
        ],
        "io_service_time_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 537011
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 537011
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 537011
            }
        ],
        "io_serviced_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 1
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 1
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 1
            }
        ],
        "io_time_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "",
                "value": 1
            }
        ],
        "io_wait_time_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "Read",
                "value": 181040
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Write",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Sync",
                "value": 0
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Async",
                "value": 181040
            },
            {
                "major": 202,
                "minor": 0,
                "op": "Total",
                "value": 181040
            }
        ],
        "sectors_recursive": [
            {
                "major": 202,
                "minor": 0,
                "op": "",
                "value": 16
            }
        ]
    },
    "cpu_stats": {
        "cpu_usage": {
            "percpu_usage": [
                70936811668,
                48850403781
            ],
            "total_usage": 119787215449,
            "usage_in_kernelmode": 1140000000,
            "usage_in_usermode": 4780000000
        },
        "system_cpu_usage": 171407860000000,
        "throttling_data": {
            "periods": 0,
            "throttled_periods": 0,
            "throttled_time": 0
        }
    },
    "id": "bf8be0a4026cbd5cf86748da7d27cc3e21f54976dbbd6f46d3bd03e6bf536d25",
    "memory_stats": {},
    "name": "/festive_lalande",
    "networks": {
        "eth0": {
            "rx_bytes": 145851,
            "rx_dropped": 0,
            "rx_errors": 0,
            "rx_packets": 876,
            "tx_bytes": 94039,
            "tx_dropped": 0,
            "tx_errors": 0,
            "tx_packets": 910
        }
    },
    "num_procs": 0,
    "pids_stats": {},
    "precpu_stats": {
        "cpu_usage": {
            "total_usage": 0,
            "usage_in_kernelmode": 0,
            "usage_in_usermode": 0
        },
        "throttling_data": {
            "periods": 0,
            "throttled_periods": 0,
            "throttled_time": 0
        }
    },
    "preread": "0001-01-01T00:00:00Z",
    "read": "2017-03-29T08:43:44.292542038Z",
    "storage_stats": {}
}

However I'm thinking this is more a node config issue as the memory cgroup isn't present:

admin@ip-10-0-16-176:~$ ls -l /sys/fs/cgroup/
total 0
dr-xr-xr-x 3 root root  0 Mar 28 08:58 blkio
lrwxrwxrwx 1 root root 11 Mar 28 08:58 cpu -> cpu,cpuacct
lrwxrwxrwx 1 root root 11 Mar 28 08:58 cpuacct -> cpu,cpuacct
dr-xr-xr-x 3 root root  0 Mar 28 08:58 cpu,cpuacct
dr-xr-xr-x 3 root root  0 Mar 28 08:58 cpuset
dr-xr-xr-x 3 root root  0 Mar 28 08:58 devices
dr-xr-xr-x 3 root root  0 Mar 28 08:58 freezer
lrwxrwxrwx 1 root root 16 Mar 28 08:58 net_cls -> net_cls,net_prio
dr-xr-xr-x 3 root root  0 Mar 28 08:58 net_cls,net_prio
lrwxrwxrwx 1 root root 16 Mar 28 08:58 net_prio -> net_cls,net_prio
dr-xr-xr-x 3 root root  0 Mar 28 08:58 perf_event
dr-xr-xr-x 5 root root  0 Mar 28 08:58 systemd

@douaejeouit
Copy link
Contributor

douaejeouit commented Mar 29, 2017

Thank you for the quick reply. Yes, would suspect that! Unfortunately, I don't have a clear idea about how docker instance is managed on EC2 .. :(
we'll keep you updated if we got some news ASAP.

@ruflin
Copy link
Member

ruflin commented Mar 29, 2017

I think a first improvement step we could do on the docker metricsets is to check if the json is empty before we start processing it and return an more meaningful error instead of the getting to the JSON Marshall error. Not sure how tricky that will be with the docker library we use.

@jess-lawrence-axomic
Copy link

Just a FYI, turns out debian disables memory cgroup by default, this solves it:

sed -i.bkup 's/\(GRUB_CMDLINE_LINUX_DEFAULT=.*\)"$/\1 cgroup_enable=memory"/g' /etc/default/grub
update-grub

@douaejeouit
Copy link
Contributor

@jess-lawrence-axomic that sounds good! Thank you for sharing the information.
@ruflin, +1. Yes, we should do it. Neither do I but I'm thinking about it.

@douaejeouit
Copy link
Contributor

douaejeouit commented Mar 30, 2017

The trick is to make the fields a pointers here to take advantage of the "omitempty" option declared for , basically, all the fields. If the field's tag is "omitempty" and it has an empty value, the field should be omitted from the encoding. Empty values are :

I played some tests where a declare an empty stats structure, and I'm getting this output, which, I think , it represents the default values of the declared fields.
output without pointers :

returned : Not valid json: json: unsupported value: NaN

output with pointer feilds :

{"_module":{"container":{"id":"containerID","labels":{"label1":"val1","label2":"val2"},"name":"name1"}},"fail":{"count":0},"limit":0,"rss":{"pct":null,"total":0},"usage":{"max":0,"pct":null,"total":0}}

@ruflin what do you think about it?

@ruflin
Copy link
Member

ruflin commented Mar 30, 2017

That looks very promising. Instead of reporting the default values I would suggest to report and error. Something like memory values are not available. So in the case of @jess-lawrence-axomic he will see in the logs, that the memory values are not available and can turn off the metricset. Just providing default values would lead to the perception that it is working but isn't.

@douaejeouit
Copy link
Contributor

Update : Actually, the

Not valid json: json: unsupported value: NaN

happens simply because of the value stored here . Since an "empty" stats is, basically, a zeroed allocated memory storage ( it's a pointer to a newly allocated zero value of the structure ), ie, fields are equal to 0 at the creation. This is what the new function does!
So basically, if there is any statistics to read, the values of the associated feilds will be overwritten, otherwise, it's 0. So in this case, we're applying a division where the denominator is equal to 0. The thing that, naturally, causes the NaN value.
I think that the issue comes from here and not from the Marchall function. Because the 0 values of numbers are considered as an empty value => they are omitted.
Please correct me if I'm wrong guys!!
Here are my thoughts :

  • We need to test the denominator value before the calculation.
  • Need more logs .. I'm not sure how could we know if the 0 values are the reported ones or it's just due to some missing metrics..

@ruflin
Copy link
Member

ruflin commented Mar 31, 2017

If a value is set (also 0) it will be reported. If we don't want to report it, we need to remove it from the event as otherwise it will be hard to tell, if it is 0 or accidentially 0 (as you pointed out). So if we don't get any json back, we should return an error and not any values I would think?

xificurC pushed a commit to xificurC/beats that referenced this issue Jun 15, 2017
@xificurC
Copy link
Contributor

Based on @douaejeouit's last comment I tried to create a quick fix to just send 0 when there would be a NaN (see referenced commit above). The issue is still present though:

2017-06-15T12:35:21+02:00 ERR Failed to publish events caused by: json: 
error calling MarshalJSON for type common.Float: invalid character 'N' 
looking for beginning of value
2017-06-15T12:35:21+02:00 INFO Error publishing events (retrying): json: 
error calling MarshalJSON for type common.Float: invalid character 'N' 
looking for beginning of value

Either I failed fixing the code (my first lines of go) or the NaN is coming from elsewhere. Is there a way I could find out who is logging that error?

I also thought if MarshalJSON could serialize NaN into something sensible but based on the discussions here it seems that's not possible. So I guess any NaN would have to be handled on your side?

@ruflin
Copy link
Member

ruflin commented Jun 16, 2017

As the error here does not have a stack trace, its tricky to figure out where it comes from. Probably interesting could be what is before and ofter especially if you have debug enabled.

@ruflin ruflin added the module label Feb 26, 2018
@ruflin ruflin added the Team:Integrations Label for the Integrations team label Nov 21, 2018
@fearful-symmetry
Copy link
Contributor

This looks like the issue that was fixed by #11676 and backported to 6.7 and 7.0
I ran into the invalid character 'N' log line while reproducing this, it is indeed a Nan from division-by-zero. I'm gonna close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Metricbeat Metricbeat module Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

9 participants