-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metricbeat: Improvements in docker diskio metricset #6701
Conversation
BlkioStatsPerContainer map[string]BlkioRaw | ||
} | ||
|
||
func NewBlkioService() *BlkioService { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported function NewBlkioService should have comment or be unexported
@@ -29,14 +26,20 @@ type BlkioRaw struct { | |||
totals uint64 | |||
} | |||
|
|||
type BLkioService struct { | |||
BlkioSTatsPerContainer map[string]BlkioRaw | |||
type BlkioService struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exported type BlkioService should have comment or be unexported
@@ -21,6 +21,20 @@ func eventMapping(stats *BlkioStats) common.MapStr { | |||
"reads": stats.reads, | |||
"writes": stats.writes, | |||
"total": stats.totals, | |||
"serviced": common.MapStr{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I follow this structure for backwards compatibility, but current metrics should probably be at the same level as these ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only conflict we would have would be for total
as the other ones are plural. We could use summary
instead of total
? Not the nice thing :-( Other ideas on how we could work around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm, this could be an approach, yes, I'll update the PR with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new commit with this approach 2a0306b
@@ -7,12 +7,56 @@ | |||
- name: reads | |||
type: scaled_float | |||
description: > | |||
Number of reads. | |||
Number of current reads per second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that calculating the operations or bytes/second should be done at visualization time, given that we now will have the accumulated values, but removing this would be a breaking change. Opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing it would be a breaking change. We should have a "deprecated" field in fields.yml so we don't forget to remove such fields and it's also marked in the docs.
In general I prefer counters over gauges exactly because it allows to do the calculations over different time frames on the ES side. This case is a bit special as we summarise it and not return what we get from the API. I would hope the API gives us counters which is not the case here and does some precalculations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to the deprecated field, I think we need something like this.
The API exports it as gauges, but I'd say it doesn't do any calculation for block I/O. The problem I see with counters and this kind of metrics is that you have to be very careful about how and when you collect them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we call it just deprecated: true
in the fields.yml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
@@ -45,56 +48,62 @@ func (io *BLkioService) getBlkioStatsList(rawStats []docker.Stat, dedot bool) [] | |||
return formattedStats | |||
} | |||
|
|||
func (io *BLkioService) getBlkioStats(myRawStat *docker.Stat, dedot bool) BlkioStats { | |||
// TODO: User StorageStats instead of BlkioStats on Windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to docker stats API documentation, BlkioStats
is only populated in Linux, for Windows there is another structure under StorageStats
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we open a follow up github issue for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest issue about Windows at the moment is testing it, maybe better to leave the TODO here and open a follow-up issue, yes.
stats := BlkioRaw{ | ||
Time: time, | ||
reads: 0, | ||
writes: 0, | ||
totals: 0, | ||
} | ||
|
||
for _, myEntry := range blkioEntry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Entries are per operation type, but also per Major/Minor (per device), so they should be summed by type. We could also decide to send metrics per device. Opinions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we send metrics per device, would that be an additional metricset? I would assume that each device is on event? Or how would the output look like?
What do you mean by Major/Minor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These blkio entries are defined as:
// BlkioStatEntry is one small entity to store a piece of Blkio stats
// Not used on Windows.
type BlkioStatEntry struct {
Major uint64 `json:"major"`
Minor uint64 `json:"minor"`
Op string `json:"op"`
Value uint64 `json:"value"`
}
I mean Major/Minor numbers as the identifier of a device in a posix system, in this case they identify the block device related to these stats.
So we have information per device, and we could send that, though the mapping between Major/Minor and device files, mountpoints or something useful would be tricky, and we couldn't provide the same functionality for Windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this is only partially related to this PR so I suggest to merge this one and follow up the discussion in an other thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in any case I'm not sure about how to do it in a useful way with proper mapping between device ids and device files or mountpoints. We can leave it till we see someone needing it.
@@ -45,56 +48,62 @@ func (io *BLkioService) getBlkioStatsList(rawStats []docker.Stat, dedot bool) [] | |||
return formattedStats | |||
} | |||
|
|||
func (io *BLkioService) getBlkioStats(myRawStat *docker.Stat, dedot bool) BlkioStats { | |||
// TODO: User StorageStats instead of BlkioStats on Windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we open a follow up github issue for that?
@@ -21,6 +21,20 @@ func eventMapping(stats *BlkioStats) common.MapStr { | |||
"reads": stats.reads, | |||
"writes": stats.writes, | |||
"total": stats.totals, | |||
"serviced": common.MapStr{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only conflict we would have would be for total
as the other ones are plural. We could use summary
instead of total
? Not the nice thing :-( Other ideas on how we could work around this?
stats := BlkioRaw{ | ||
Time: time, | ||
reads: 0, | ||
writes: 0, | ||
totals: 0, | ||
} | ||
|
||
for _, myEntry := range blkioEntry { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we send metrics per device, would that be an additional metricset? I would assume that each device is on event? Or how would the output look like?
What do you mean by Major/Minor
?
@@ -7,12 +7,56 @@ | |||
- name: reads | |||
type: scaled_float | |||
description: > | |||
Number of reads. | |||
Number of current reads per second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing it would be a breaking change. We should have a "deprecated" field in fields.yml so we don't forget to remove such fields and it's also marked in the docs.
In general I prefer counters over gauges exactly because it allows to do the calculations over different time frames on the ES side. This case is a bit special as we summarise it and not return what we get from the API. I would hope the API gives us counters which is not the case here and does some precalculations.
for _, myEntry := range blkioEntry { | ||
if myEntry.Op == "Write" { | ||
stats.writes = myEntry.Value | ||
stats.writes += myEntry.Value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wonder if the counter we calculate here is actually correct. To we get the sum over the last 2 seconds here or do we get the value since our last request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only get the value of the last request. The sum is to sum the same operation on multiple devices.
ea10321
to
fdf7fbf
Compare
if exist { | ||
myBlkioStats.reads = io.getReadPs(&oldBlkioStats, &newBlkioStats) | ||
myBlkioStats.writes = io.getWritePs(&oldBlkioStats, &newBlkioStats) | ||
myBlkioStats.totals = io.getTotalPs(&oldBlkioStats, &newBlkioStats) | ||
} | ||
|
||
io.BlkioSTatsPerContainer[myRawStat.Container.ID] = newBlkioStats | ||
// FIXME: Memory leak | ||
io.BlkioStatsPerContainer[myRawStat.Container.ID] = newBlkioStats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have a memory leak here, entries are never removed, I'll take a look to this too.
2d24fd9
to
2e3b94a
Compare
I have added a new test, to cover all calculations done on raw stats, there are some other tests that could be revisited, but I left them by now. Regarding Windows, I have created a follow-up issue #6815. This would be ready for review. |
2e3b94a
to
48a2f7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the data.json
file should be generated to have the new values in.
description: > | ||
Number of reads during the life of the container | ||
- name: bytes | ||
type: long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add here format: bytes
to kibana can make sense of it. Same is true for other bytes values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
48a2f7c
to
ca64792
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for getting back our CI builds. Seems there are some issues with Kafka.
* Add accumulated I/O stats to diskio metricset * Restructure fields in events, mark old fields as deprecated * Fix diskio metricset when docker reads from multiple devices * Some naming fixes
The metricset stores last seen stats to calculate rates per second, but they were never being removed for removed containers.
ca64792
to
c7d798c
Compare
Failing tests not related to these changes. |
This PR continues with #6608 in the aim to provide docker metrics more consistent with
docker stats
. Currently we report for disk I/O the number of operations per second.This PR adds accumulated I/O stats per number of operations and per data volume since the container started, as
docker stats
does. I think this information is more valuable, as it permits to have further aggregations when visualizing it. It also addresses some other issues in the metricset.Fields in events have been reorganized and old ones have been marked as deprecated.
We could also add some extra functionality, as obtaining metrics also on Windows hosts (as is this module now it would only work with docker daemons running on Linux and not sure about Mac). In Linux we could also collect metrics per block device.
To follow-up in other issues:
Group by device in Linux?To be done if requested in the future