Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metricbeat: Improvements in docker diskio metricset #6701

Merged
merged 4 commits into from
Apr 13, 2018

Conversation

jsoriano
Copy link
Member

@jsoriano jsoriano commented Mar 29, 2018

This PR continues with #6608 in the aim to provide docker metrics more consistent with docker stats. Currently we report for disk I/O the number of operations per second.

This PR adds accumulated I/O stats per number of operations and per data volume since the container started, as docker stats does. I think this information is more valuable, as it permits to have further aggregations when visualizing it. It also addresses some other issues in the metricset.

Fields in events have been reorganized and old ones have been marked as deprecated.

We could also add some extra functionality, as obtaining metrics also on Windows hosts (as is this module now it would only work with docker daemons running on Linux and not sure about Mac). In Linux we could also collect metrics per block device.

  • Add accumulated I/O stats to diskio metricset
  • Fix diskio metricset when docker reads from multiple devices
  • Some naming and documentation fixes
  • Fix memory leak in blkio service (last seen stats were being stored, but never removed)
  • More tests

To follow-up in other issues:

@jsoriano jsoriano added in progress Pull request is currently in progress. module feedback needed Metricbeat Metricbeat labels Mar 29, 2018
BlkioStatsPerContainer map[string]BlkioRaw
}

func NewBlkioService() *BlkioService {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported function NewBlkioService should have comment or be unexported

@@ -29,14 +26,20 @@ type BlkioRaw struct {
totals uint64
}

type BLkioService struct {
BlkioSTatsPerContainer map[string]BlkioRaw
type BlkioService struct {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported type BlkioService should have comment or be unexported

@@ -21,6 +21,20 @@ func eventMapping(stats *BlkioStats) common.MapStr {
"reads": stats.reads,
"writes": stats.writes,
"total": stats.totals,
"serviced": common.MapStr{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I follow this structure for backwards compatibility, but current metrics should probably be at the same level as these ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only conflict we would have would be for total as the other ones are plural. We could use summary instead of total? Not the nice thing :-( Other ideas on how we could work around this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm, this could be an approach, yes, I'll update the PR with this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new commit with this approach 2a0306b

@@ -7,12 +7,56 @@
- name: reads
type: scaled_float
description: >
Number of reads.
Number of current reads per second
Copy link
Member Author

@jsoriano jsoriano Mar 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that calculating the operations or bytes/second should be done at visualization time, given that we now will have the accumulated values, but removing this would be a breaking change. Opinions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it would be a breaking change. We should have a "deprecated" field in fields.yml so we don't forget to remove such fields and it's also marked in the docs.

In general I prefer counters over gauges exactly because it allows to do the calculations over different time frames on the ES side. This case is a bit special as we summarise it and not return what we get from the API. I would hope the API gives us counters which is not the case here and does some precalculations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the deprecated field, I think we need something like this.

The API exports it as gauges, but I'd say it doesn't do any calculation for block I/O. The problem I see with counters and this kind of metrics is that you have to be very careful about how and when you collect them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call it just deprecated: true in the fields.yml?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@@ -45,56 +48,62 @@ func (io *BLkioService) getBlkioStatsList(rawStats []docker.Stat, dedot bool) []
return formattedStats
}

func (io *BLkioService) getBlkioStats(myRawStat *docker.Stat, dedot bool) BlkioStats {
// TODO: User StorageStats instead of BlkioStats on Windows
Copy link
Member Author

@jsoriano jsoriano Mar 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to docker stats API documentation, BlkioStats is only populated in Linux, for Windows there is another structure under StorageStats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we open a follow up github issue for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest issue about Windows at the moment is testing it, maybe better to leave the TODO here and open a follow-up issue, yes.

stats := BlkioRaw{
Time: time,
reads: 0,
writes: 0,
totals: 0,
}

for _, myEntry := range blkioEntry {
Copy link
Member Author

@jsoriano jsoriano Mar 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Entries are per operation type, but also per Major/Minor (per device), so they should be summed by type. We could also decide to send metrics per device. Opinions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we send metrics per device, would that be an additional metricset? I would assume that each device is on event? Or how would the output look like?

What do you mean by Major/Minor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These blkio entries are defined as:

// BlkioStatEntry is one small entity to store a piece of Blkio stats
// Not used on Windows.
type BlkioStatEntry struct {
    Major uint64 `json:"major"`
    Minor uint64 `json:"minor"`
    Op    string `json:"op"`
    Value uint64 `json:"value"`
}

I mean Major/Minor numbers as the identifier of a device in a posix system, in this case they identify the block device related to these stats.

So we have information per device, and we could send that, though the mapping between Major/Minor and device files, mountpoints or something useful would be tricky, and we couldn't provide the same functionality for Windows.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this is only partially related to this PR so I suggest to merge this one and follow up the discussion in an other thread?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in any case I'm not sure about how to do it in a useful way with proper mapping between device ids and device files or mountpoints. We can leave it till we see someone needing it.

@@ -45,56 +48,62 @@ func (io *BLkioService) getBlkioStatsList(rawStats []docker.Stat, dedot bool) []
return formattedStats
}

func (io *BLkioService) getBlkioStats(myRawStat *docker.Stat, dedot bool) BlkioStats {
// TODO: User StorageStats instead of BlkioStats on Windows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we open a follow up github issue for that?

@@ -21,6 +21,20 @@ func eventMapping(stats *BlkioStats) common.MapStr {
"reads": stats.reads,
"writes": stats.writes,
"total": stats.totals,
"serviced": common.MapStr{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only conflict we would have would be for total as the other ones are plural. We could use summary instead of total? Not the nice thing :-( Other ideas on how we could work around this?

stats := BlkioRaw{
Time: time,
reads: 0,
writes: 0,
totals: 0,
}

for _, myEntry := range blkioEntry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we send metrics per device, would that be an additional metricset? I would assume that each device is on event? Or how would the output look like?

What do you mean by Major/Minor?

@@ -7,12 +7,56 @@
- name: reads
type: scaled_float
description: >
Number of reads.
Number of current reads per second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it would be a breaking change. We should have a "deprecated" field in fields.yml so we don't forget to remove such fields and it's also marked in the docs.

In general I prefer counters over gauges exactly because it allows to do the calculations over different time frames on the ES side. This case is a bit special as we summarise it and not return what we get from the API. I would hope the API gives us counters which is not the case here and does some precalculations.

for _, myEntry := range blkioEntry {
if myEntry.Op == "Write" {
stats.writes = myEntry.Value
stats.writes += myEntry.Value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wonder if the counter we calculate here is actually correct. To we get the sum over the last 2 seconds here or do we get the value since our last request?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only get the value of the last request. The sum is to sum the same operation on multiple devices.

@jsoriano jsoriano force-pushed the docker-blockio branch 3 times, most recently from ea10321 to fdf7fbf Compare April 9, 2018 17:44
if exist {
myBlkioStats.reads = io.getReadPs(&oldBlkioStats, &newBlkioStats)
myBlkioStats.writes = io.getWritePs(&oldBlkioStats, &newBlkioStats)
myBlkioStats.totals = io.getTotalPs(&oldBlkioStats, &newBlkioStats)
}

io.BlkioSTatsPerContainer[myRawStat.Container.ID] = newBlkioStats
// FIXME: Memory leak
io.BlkioStatsPerContainer[myRawStat.Container.ID] = newBlkioStats
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a memory leak here, entries are never removed, I'll take a look to this too.

@jsoriano jsoriano added review and removed in progress Pull request is currently in progress. labels Apr 10, 2018
@jsoriano
Copy link
Member Author

I have added a new test, to cover all calculations done on raw stats, there are some other tests that could be revisited, but I left them by now. Regarding Windows, I have created a follow-up issue #6815.

This would be ready for review.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the data.json file should be generated to have the new values in.

description: >
Number of reads during the life of the container
- name: bytes
type: long
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add here format: bytes to kibana can make sense of it. Same is true for other bytes values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@jsoriano
Copy link
Member Author

data.json updated and bytes fileds formatted as bytes

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for getting back our CI builds. Seems there are some issues with Kafka.

* Add accumulated I/O stats to diskio metricset
* Restructure fields in events, mark old fields as deprecated
* Fix diskio metricset when docker reads from multiple devices
* Some naming fixes
The metricset stores last seen stats to calculate rates per second, but
they were never being removed for removed containers.
@jsoriano
Copy link
Member Author

Failing tests not related to these changes.

@ruflin ruflin merged commit aaae0b7 into elastic:master Apr 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants