Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agones controller metrics becomes a huge amount of data over time #2424

Closed
yoshd opened this issue Jan 11, 2022 · 6 comments
Closed

Agones controller metrics becomes a huge amount of data over time #2424

yoshd opened this issue Jan 11, 2022 · 6 comments
Labels
kind/feature New features for Agones
Milestone

Comments

@yoshd
Copy link
Contributor

yoshd commented Jan 11, 2022

Is your feature request related to a problem? Please describe.

The Agones controller metrics agones_gameserver_allocations_duration_seconds becomes a huge amount of data over time.
In agones_gameserver_allocations_duration_seconds , there is a node_name label, but in environments such as GKE where the number of nodes increases or decreases dynamically, the cardinality of node_name increases over time.
As a result, the agones_gameserver_allocations_duration_seconds in the /metrics API response of the Agones controller will have a huge number of rows.

e.g.
If 10 new nodes are created every day, there will be more than 12000 rows after 100 days.
( 10[instances]*100[days]*12[distribution] = 12000[rows] )

agones_gameserver_allocations_duration_seconds_bucket{cluster_name="none",fleet_name="game-server-fleet",is_multicluster="false",node_name="gke-abcdefg-9d8bc6da-gxd1",scheduling_strategy="Packed",status="Allocated",le="0.01"} 0
.
.
.
agones_gameserver_allocations_duration_seconds_bucket{cluster_name="none",fleet_name="game-server-fleet",is_multicluster="false",node_name="gke-abcdefg-9d8bc6da-i11x",scheduling_strategy="Packed",status="Allocated",le="0.01"} 0
.
.
.
# Repeated as many times as the number of `node_name`

If this situation continues, sending metrics to GoogleCloud's CloudMonitoring and other services will cost a huge amount of money.

Describe the solution you'd like

The Agoens controller clears the old agones_gameserver_allocations_duration_seconds data after a certain amount of time has passed.
It would be nice to be able to specify this value in values in helm.
Maybe there is a better way, but I don't have any ideas.

@yoshd yoshd added the kind/feature New features for Agones label Jan 11, 2022
@markmandel
Copy link
Member

markmandel commented Jan 11, 2022

This is absolutely a cardinality bomb for sure!

I think the best solution would be to actual remove the label reference to the node name. I'm not sure it actually adds any value to the allocation metrics anyway?

@yoshd
Copy link
Contributor Author

yoshd commented Jan 11, 2022

I am in favor of removing the node name label.
I don't need this label.

@yoshd
Copy link
Contributor Author

yoshd commented Jan 17, 2022

@markmandel
If there is no problem in removing the node_name label from the allocation metrics, I can create pull request.
How about that?

@markmandel
Copy link
Member

That sounds great! - the only caveat I have is that I would like to double check if it breaks anything in the grafana dashboards before closing out this issue (but you don't have to wait on a PR for that -- it can be checked separately).

Also, if you find any other cardinality bombs in metrics, please file separate issues for it - a lot of the metrics was written a long time ago, and might need a review.

@markmandel
Copy link
Member

Any objection to closing this issue, since we solved it in #2433?

@yoshd
Copy link
Contributor Author

yoshd commented Feb 17, 2022

Yes, there is no problem.
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New features for Agones
Projects
None yet
Development

No branches or pull requests

3 participants