Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul all metrics #81

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

isker
Copy link
Contributor

@isker isker commented Oct 13, 2024

  • Fix names to comply with the official guidelines and to better mirror the names of similar timeseries from the much-more-popular cAdvisor, when reasonable. And don't use the word "svc" to refer to tasks, as it is just not correct.
  • Improve helps.
  • Stop reporting per-CPU usage metrics. They're empirically only available in Fargate, but the current collector implementation assumes they're available everywhere. (They were previously available in EC2 but that stopped being the case when ecs-agent was upgraded to use cgroups v2.) Given that it's not clear why per-CPU numbers are useful in general, remove them everywhere instead of exposing disjoint metrics for Fargate and EC2. This will also prevent Fargate from potentially spontaneously breaking in the same way EC2 did.
  • Fix task-level memory limit to actually be in bytes (it previously said "bytes" but was in fact MiB).
  • Correctly report container-level memory limits in all cases - the stats limit is nonsense if, as in Fargate, there is no container-level limit configured in the task definition. While the right data for all cases is hiding in the stats response somewhere, I have instead opted to cut out the stats middleman and use the task metadata directly to drive this metric. I think it's substantially less likely that ECS fails to effect the configured limits upon cgroups correctly than it is that we fail to interrogate cgroups output correctly: the latter empirically happens with some frequency :^).
  • Add metrics concerning Fargate ephemeral storage, and one for task image pull duration.
  • Add more labels for task- and container-level metrics. While we should always be cautious when adding common labels to timeseries, I think the existing ones were insufficient for doing basic aggregations (e.g. "average memory usage for a given task family grouped by revision"). These labels are in proportion to those used by cAdvisor for its own container timeseries.

I have tested these changes both in Fargate and EC2 and they look correct to me.

Closes #70 (by way of obsoleting it).
Closes #74.
Closes #69.
Closes #35.
Closes #16 (as far as I can tell).

- Fix names to comply with the [official
  guidelines](https://prometheus.io/docs/practices/naming/#metric-and-label-naming)
  and to better mirror the names of similar timeseries from the
  much-more-popular cAdvisor, when reasonable. And don't use the word
  "svc" to refer to tasks, as it is just not correct.
- Improve `help`s.
- Stop reporting per-CPU usage metrics. They're empirically only
  available in Fargate, but the current collector implementation assumes
  they're available everywhere. (They were previously available in EC2 but
  that stopped being the case when ecs-agent was upgraded to use cgroups
  v2.)  Given that it's not clear why per-CPU numbers are useful in
  general, remove them everywhere instead of exposing disjoint metrics for
  Fargate and EC2. This will also prevent Fargate from potentially
  spontaneously breaking in the same way EC2 did.
- Fix task-level memory limit to actually be in bytes (it previously
  said "bytes" but was in fact MiB).
- Correctly report container-level memory limits in all cases - the
  stats `limit` is nonsense if, as in Fargate, there is no container-level
  limit configured in the task definition. While the right data for all
  cases is hiding in the stats response somewhere, I have instead opted to
  cut out the stats middleman and use the task metadata directly to drive
  this metric. I think it's substantially less likely that ECS fails to
  effect the configured limits upon cgroups correctly than it is that we
  fail to interrogate cgroups output correctly: the latter empirically
  happens with some frequency :^).
- Add metrics concerning Fargate ephemeral storage, and one for task
  image pull duration.
- Add more labels for task- and container-level metrics. While we should
  always be cautious when adding common labels to timeseries, I think
  the existing ones were insufficient for doing basic aggregations (e.g.
  "average memory usage for a given task family grouped by revision").
  These labels are in proportion to those used by cAdvisor for its own
  container timeseries.

I have tested these changes both in Fargate and EC2 and they look
correct to me.

Signed-off-by: Ian Kerins <[email protected]>
available for several network metrics.

## Example output
TODO update
Copy link
Contributor Author

@isker isker Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer doing this until the rest of the PR is approved, as the contents of this example are sensitive to even minor changes in the PR. (I am actually deploying ecs_exporter to AWS to do my testing, and will do so again to produce this. But doing such deployments is a bit tedious, so.)

ecscollector/collector.go Show resolved Hide resolved
container.Name,
container.Image,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the container image name to identify the uniqueness? I think this would be better included in an "info" metric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not. My reasoning is that it’s easier for someone consuming monitoring data to understand what a container is by its image name as opposed to whatever someone named it in the task definition, but of the labels I’m adding here I do think this one is probably the least justified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concern is vaguely that someone can use the same container name in different task definitions differently, making it a misleading dimension for aggregation. But the same can be said for an image name. You could only be certain about this if you instead used labels that identify the task definition (family+revision).

Anyway, curious what your thoughts are on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identification of different instances of the same container is up to the service discovery.

The main thing we need to make sure is that have a fully unique key or keys as we range metadata.Containers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Identification of different instances of the same container is up to the service discovery.

It makes sense that service discovery is responsible for generating distinct label sets for distinct instances, but is it also solely responsible for generating label sets that are useful for doing useful aggregations across the resulting timeseries?

If I want to, for example, compare container memory usage across two versions of an ECS service during a deployment (i.e. the current and previous version of the service), one way to distinguish between those versions would be with this image label, or by a combination of the ECS task family and revision labels. container_name alone can’t help me do this, unless I was doing something strange like changing the name in each task definition revision.

Are you saying we should always leave the addition of these other potentially useful labels to service discovery? My concern is that it might be much harder for service discovery to generate them than it is for us, here in the exporter. If you’re using Cloud Map DNS service discovery, all you’re going to get is an IP address. You’d have to implement http_sd using a Cloud Map API client to have a chance to do this kind of thing, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying we should always leave the addition of these other potentially useful labels to service discovery?

No, I'm saying that we should probably follow the "info metric" pattern.

For example, in node_exporter we have node_network_info, which has a number of labels for things that are useful, but not really necessary all the time on every other node_network_... metric.

There are cases where we do include some "not required for uniqueness" labels on all related metrics. The fstype label on node_filesystem_... is a good example of this. It's not required for uniqueness, but it's very common to filter by fstype. Much more common than say, all of the other mountpoint flags.

This is one of those "naming is hard" problems. I just want to make sure we think about why we're including labels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can you use such info metrics though? How do you take labels on one metric and apply them to another? Prometheus doesn’t support join-type operations across timeseries as far as I know. Can you achieve something like that during relabeling, or?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants