-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul all metrics #81
base: main
Are you sure you want to change the base?
Conversation
- Fix names to comply with the [official guidelines](https://prometheus.io/docs/practices/naming/#metric-and-label-naming) and to better mirror the names of similar timeseries from the much-more-popular cAdvisor, when reasonable. And don't use the word "svc" to refer to tasks, as it is just not correct. - Improve `help`s. - Stop reporting per-CPU usage metrics. They're empirically only available in Fargate, but the current collector implementation assumes they're available everywhere. (They were previously available in EC2 but that stopped being the case when ecs-agent was upgraded to use cgroups v2.) Given that it's not clear why per-CPU numbers are useful in general, remove them everywhere instead of exposing disjoint metrics for Fargate and EC2. This will also prevent Fargate from potentially spontaneously breaking in the same way EC2 did. - Fix task-level memory limit to actually be in bytes (it previously said "bytes" but was in fact MiB). - Correctly report container-level memory limits in all cases - the stats `limit` is nonsense if, as in Fargate, there is no container-level limit configured in the task definition. While the right data for all cases is hiding in the stats response somewhere, I have instead opted to cut out the stats middleman and use the task metadata directly to drive this metric. I think it's substantially less likely that ECS fails to effect the configured limits upon cgroups correctly than it is that we fail to interrogate cgroups output correctly: the latter empirically happens with some frequency :^). - Add metrics concerning Fargate ephemeral storage, and one for task image pull duration. - Add more labels for task- and container-level metrics. While we should always be cautious when adding common labels to timeseries, I think the existing ones were insufficient for doing basic aggregations (e.g. "average memory usage for a given task family grouped by revision"). These labels are in proportion to those used by cAdvisor for its own container timeseries. I have tested these changes both in Fargate and EC2 and they look correct to me. Signed-off-by: Ian Kerins <[email protected]>
available for several network metrics. | ||
|
||
## Example output | ||
TODO update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll defer doing this until the rest of the PR is approved, as the contents of this example are sensitive to even minor changes in the PR. (I am actually deploying ecs_exporter to AWS to do my testing, and will do so again to produce this. But doing such deployments is a bit tedious, so.)
container.Name, | ||
container.Image, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the container image name to identify the uniqueness? I think this would be better included in an "info" metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not. My reasoning is that it’s easier for someone consuming monitoring data to understand what a container is by its image name as opposed to whatever someone named it in the task definition, but of the labels I’m adding here I do think this one is probably the least justified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern is vaguely that someone can use the same container name in different task definitions differently, making it a misleading dimension for aggregation. But the same can be said for an image name. You could only be certain about this if you instead used labels that identify the task definition (family+revision).
Anyway, curious what your thoughts are on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Identification of different instances of the same container is up to the service discovery.
The main thing we need to make sure is that have a fully unique key or keys as we range metadata.Containers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Identification of different instances of the same container is up to the service discovery.
It makes sense that service discovery is responsible for generating distinct label sets for distinct instances, but is it also solely responsible for generating label sets that are useful for doing useful aggregations across the resulting timeseries?
If I want to, for example, compare container memory usage across two versions of an ECS service during a deployment (i.e. the current and previous version of the service), one way to distinguish between those versions would be with this image
label, or by a combination of the ECS task family and revision labels. container_name
alone can’t help me do this, unless I was doing something strange like changing the name in each task definition revision.
Are you saying we should always leave the addition of these other potentially useful labels to service discovery? My concern is that it might be much harder for service discovery to generate them than it is for us, here in the exporter. If you’re using Cloud Map DNS service discovery, all you’re going to get is an IP address. You’d have to implement http_sd using a Cloud Map API client to have a chance to do this kind of thing, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying we should always leave the addition of these other potentially useful labels to service discovery?
No, I'm saying that we should probably follow the "info metric" pattern.
For example, in node_exporter we have node_network_info
, which has a number of labels for things that are useful, but not really necessary all the time on every other node_network_...
metric.
There are cases where we do include some "not required for uniqueness" labels on all related metrics. The fstype
label on node_filesystem_...
is a good example of this. It's not required for uniqueness, but it's very common to filter by fstype
. Much more common than say, all of the other mountpoint flags.
This is one of those "naming is hard" problems. I just want to make sure we think about why we're including labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can you use such info metrics though? How do you take labels on one metric and apply them to another? Prometheus doesn’t support join-type operations across timeseries as far as I know. Can you achieve something like that during relabeling, or?
help
s.limit
is nonsense if, as in Fargate, there is no container-level limit configured in the task definition. While the right data for all cases is hiding in the stats response somewhere, I have instead opted to cut out the stats middleman and use the task metadata directly to drive this metric. I think it's substantially less likely that ECS fails to effect the configured limits upon cgroups correctly than it is that we fail to interrogate cgroups output correctly: the latter empirically happens with some frequency :^).I have tested these changes both in Fargate and EC2 and they look correct to me.
Closes #70 (by way of obsoleting it).
Closes #74.
Closes #69.
Closes #35.
Closes #16 (as far as I can tell).