Overhaul all metrics #81

isker · 2024-10-13T21:46:38Z

Fix names to comply with the official guidelines and to better mirror the names of similar timeseries from the much-more-popular cAdvisor, when reasonable. And don't use the word "svc" to refer to tasks, as it is just not correct.
Improve helps.
Stop reporting per-CPU usage metrics. They're empirically only available in Fargate, but the current collector implementation assumes they're available everywhere. (They were previously available in EC2 but that stopped being the case when ecs-agent was upgraded to use cgroups v2.) Given that it's not clear why per-CPU numbers are useful in general, remove them everywhere instead of exposing disjoint metrics for Fargate and EC2. This will also prevent Fargate from potentially spontaneously breaking in the same way EC2 did.
Fix task-level memory limit to actually be in bytes (it previously said "bytes" but was in fact MiB).
Correctly report container-level memory limits in all cases - the stats limit is nonsense if, as in Fargate, there is no container-level limit configured in the task definition. While the right data for all cases is hiding in the stats response somewhere, I have instead opted to cut out the stats middleman and use the task metadata directly to drive this metric. I think it's substantially less likely that ECS fails to effect the configured limits upon cgroups correctly than it is that we fail to interrogate cgroups output correctly: the latter empirically happens with some frequency :^).
Add metrics concerning Fargate ephemeral storage, and one for task image pull duration.
Add more labels for task- and container-level metrics. While we should always be cautious when adding common labels to timeseries, I think the existing ones were insufficient for doing basic aggregations (e.g. "average memory usage for a given task family grouped by revision"). These labels are in proportion to those used by cAdvisor for its own container timeseries.

I have tested these changes both in Fargate and EC2 and they look correct to me.

Closes #70 (by way of obsoleting it).
Closes #74.
Closes #69.
Closes #35.
Closes #16 (as far as I can tell).

- Fix names to comply with the [official guidelines](https://prometheus.io/docs/practices/naming/#metric-and-label-naming) and to better mirror the names of similar timeseries from the much-more-popular cAdvisor, when reasonable. And don't use the word "svc" to refer to tasks, as it is just not correct. - Improve `help`s. - Stop reporting per-CPU usage metrics. They're empirically only available in Fargate, but the current collector implementation assumes they're available everywhere. (They were previously available in EC2 but that stopped being the case when ecs-agent was upgraded to use cgroups v2.) Given that it's not clear why per-CPU numbers are useful in general, remove them everywhere instead of exposing disjoint metrics for Fargate and EC2. This will also prevent Fargate from potentially spontaneously breaking in the same way EC2 did. - Fix task-level memory limit to actually be in bytes (it previously said "bytes" but was in fact MiB). - Correctly report container-level memory limits in all cases - the stats `limit` is nonsense if, as in Fargate, there is no container-level limit configured in the task definition. While the right data for all cases is hiding in the stats response somewhere, I have instead opted to cut out the stats middleman and use the task metadata directly to drive this metric. I think it's substantially less likely that ECS fails to effect the configured limits upon cgroups correctly than it is that we fail to interrogate cgroups output correctly: the latter empirically happens with some frequency :^). - Add metrics concerning Fargate ephemeral storage, and one for task image pull duration. - Add more labels for task- and container-level metrics. While we should always be cautious when adding common labels to timeseries, I think the existing ones were insufficient for doing basic aggregations (e.g. "average memory usage for a given task family grouped by revision"). These labels are in proportion to those used by cAdvisor for its own container timeseries. I have tested these changes both in Fargate and EC2 and they look correct to me. Signed-off-by: Ian Kerins <[email protected]>

isker · 2024-10-13T21:47:22Z

README.md

  available for several network metrics.

 ## Example output
+TODO update


I'll defer doing this until the rest of the PR is approved, as the contents of this example are sensitive to even minor changes in the PR. (I am actually deploying ecs_exporter to AWS to do my testing, and will do so again to produce this. But doing such deployments is a bit tedious, so.)

ecscollector/collector.go

SuperQ · 2024-10-15T06:56:48Z

ecscollector/collector.go

 			container.Name,
+			container.Image,


Do we need the container image name to identify the uniqueness? I think this would be better included in an "info" metric.

We do not. My reasoning is that it’s easier for someone consuming monitoring data to understand what a container is by its image name as opposed to whatever someone named it in the task definition, but of the labels I’m adding here I do think this one is probably the least justified.

The concern is vaguely that someone can use the same container name in different task definitions differently, making it a misleading dimension for aggregation. But the same can be said for an image name. You could only be certain about this if you instead used labels that identify the task definition (family+revision).

Anyway, curious what your thoughts are on this.

Identification of different instances of the same container is up to the service discovery.

The main thing we need to make sure is that have a fully unique key or keys as we range metadata.Containers.

Identification of different instances of the same container is up to the service discovery.

It makes sense that service discovery is responsible for generating distinct label sets for distinct instances, but is it also solely responsible for generating label sets that are useful for doing useful aggregations across the resulting timeseries?

If I want to, for example, compare container memory usage across two versions of an ECS service during a deployment (i.e. the current and previous version of the service), one way to distinguish between those versions would be with this image label, or by a combination of the ECS task family and revision labels. container_name alone can’t help me do this, unless I was doing something strange like changing the name in each task definition revision.

Are you saying we should always leave the addition of these other potentially useful labels to service discovery? My concern is that it might be much harder for service discovery to generate them than it is for us, here in the exporter. If you’re using Cloud Map DNS service discovery, all you’re going to get is an IP address. You’d have to implement http_sd using a Cloud Map API client to have a chance to do this kind of thing, I think.

Are you saying we should always leave the addition of these other potentially useful labels to service discovery?

No, I'm saying that we should probably follow the "info metric" pattern.

For example, in node_exporter we have node_network_info, which has a number of labels for things that are useful, but not really necessary all the time on every other node_network_... metric.

There are cases where we do include some "not required for uniqueness" labels on all related metrics. The fstype label on node_filesystem_... is a good example of this. It's not required for uniqueness, but it's very common to filter by fstype. Much more common than say, all of the other mountpoint flags.

This is one of those "naming is hard" problems. I just want to make sure we think about why we're including labels.

How can you use such info metrics though? How do you take labels on one metric and apply them to another? Prometheus doesn’t support join-type operations across timeseries as far as I know. Can you achieve something like that during relabeling, or?

isker force-pushed the overhaul branch from 5c7aa78 to 18b5689 Compare October 13, 2024 21:49

isker commented Oct 13, 2024

View reviewed changes

SuperQ reviewed Oct 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overhaul all metrics #81

Overhaul all metrics #81

isker commented Oct 13, 2024

isker Oct 13, 2024 •

edited

Loading

SuperQ Oct 15, 2024

isker Oct 15, 2024

isker Oct 15, 2024

SuperQ Oct 15, 2024

isker Oct 15, 2024

SuperQ Oct 16, 2024

isker Oct 16, 2024

Overhaul all metrics #81

Are you sure you want to change the base?

Overhaul all metrics #81

Conversation

isker commented Oct 13, 2024

isker Oct 13, 2024 • edited Loading

Choose a reason for hiding this comment

SuperQ Oct 15, 2024

Choose a reason for hiding this comment

isker Oct 15, 2024

Choose a reason for hiding this comment

isker Oct 15, 2024

Choose a reason for hiding this comment

SuperQ Oct 15, 2024

Choose a reason for hiding this comment

isker Oct 15, 2024

Choose a reason for hiding this comment

SuperQ Oct 16, 2024

Choose a reason for hiding this comment

isker Oct 16, 2024

Choose a reason for hiding this comment

isker Oct 13, 2024 •

edited

Loading