-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROX-19013 Add gitops to fleetmanager #1233
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, thanks for splitting the PRs!
} | ||
|
||
var ( | ||
errorCounter = prometheus.NewCounterVec(prometheus.CounterOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the error counter reset automatically when I re-apply a valid config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it would just stay at the same level. But the prom query would be something like increase(counter_total[5m])
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case the counter would provide wrong information when the counter increase continuously, reporting 100 errors while there are 0 errors. It creates an implicit requirement on the writer of a prom query to get usable data.
Is it possible to decrease the counter when the error was reset, basically having a boolean value?
In doubt consulting SRE how to handle this metric.
caveat: I am not very familiar with Prometheus metrics, if this is considered best-practices can you share an article/doc with me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how counters are supposed to work. It's not a problem for querying it with PromQL
https://prometheus.io/docs/concepts/metric_types/
https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#Counter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be implemented with a gauge in this case? 🤔
https://prometheus.io/docs/concepts/metric_types/#gauge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could, but that's not standard. Check the usages in the repository for counters, they are canonically used to represent the number of errors, the number of requests, etc. e.g. ReconcilerFailureCount
.
Then, we can craft a query like increase
to watch the rate at which this number increases. https://prometheus.io/docs/prometheus/latest/querying/functions/#increase
calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, it still delegates the complexity of the implementation to the writer of the prom query.
We need an indication that an error happened in the GitOps configuration as feedback for the engineer who has merged a pull-request, we are kind of abusing the alerting system a bit here.
This would make the gauge the better alternative and resulting in a promquery which is simple and idiomatic to use.
A user of the metric does not know how to work with this data without additional documentation how to interpret it.
Using a counter overall increases the system's complexity in this use-case, making the gauge the simpler, respectively better alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's incorrect. Gauges are used for metrics that can decrease, such as memory usage, cpu usage, number of connected sensors. Counters are used to measure increasing values, such as the number of errors thrown, the number of requests handled.
The only valid way to use a gauge in this context would be to calculate the "number of successive gitops errors" - as opposed to the total number of gitops errors. This has little value in this context.
Using counters to report errors through prometheus is a standard use of this type of metric.
See
- https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/stats
- https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/network_filters/tcp_proxy_filter
- https://www.envoyproxy.io/docs/envoy/latest/configuration/upstream/cluster_manager/cluster_stats
- https://kubernetes.io/docs/reference/instrumentation/metrics/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address the complexity issue, I think that the use of counters would not increase the complexity of the queries. I think that on the contrary, using a Gauge for a type of value that is usually represented as a Counter would increase complexity.
In that case the counter would provide wrong information when the counter increase continuously, reporting 100 errors while there are 0 errors
This is a common behavior in promQL queries. For example, a http_errors_total
counter will always report the cumulative total errors. But it is common practice to obtain the rate of change of that metric, and base alerts on that. Usually (it is an overwhelmingly common pattern) the counters are wrapped with the sum(rate(my_metric[5m])) by (my_label)
expression.
We need an indication that an error happened in the GitOps configuration as feedback for the engineer who has merged a pull-request
Yes, this would be handled by a counter. If we measure the rate of change of the counter, we can alert ourselves that something went wrong. For example, if we have more than 2 gitops errors per 5 minutes, we might want to throw an alert.
I think I see your point with a gauge, where you would like to "reset" the value to 0 if there are no errors. But it is usually not the metric type that is used to carry the meaning of "events occurring", such as errors or requests. It should rather carry the meaning of a measurable float/integer value, such as the CPU usage, number of active clients, etc.
How counters / rate
/ increase
work
There is a widely used function, rate
, that allows us to calculate the "per-second average rate of increase" of counters. Formulated as
where
increase
is a helper function that returns rate
multiplied by the number of seconds under the specified time window, or
In our case, we are interested in occasions where the errors are increasing. So when
The alert would be pretty straightforward to write with PromQL (and in all honesty, much more simple than most prometheus alerts that we have in the rhacs-observability repository).
example:
alert: FleetManagerGitOpsError
expr: increase(gitops_error_total[15m]) > 0
for: 5m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ludydoo Thanks for the great explanation, it makes sense to me now. I agree, let's keep the counter
and align with best pratices as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary: The ConfigProvider implementation needs a bit of refactoring to be less abstract and Prometheus counter looks like it provides wrong data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚢
if tc.readerWillFail { | ||
p.reader = &failingReader | ||
} else { | ||
p.reader = &successfulReader | ||
} | ||
|
||
if tc.validationWillFail { | ||
p.validationFn = failingValidationFn | ||
} else { | ||
p.validationFn = successfulValidationFn | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the validationFns and readers can be part of the configured test structs directly, would be able to remove the if-statements
require.NoError(t, err) | ||
provider := NewFileReader(tmpFilePath) | ||
_, err = provider.Read() | ||
assert.Error(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: require.NoError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one expects an error (TestFileReader_Get_FailsIfFileIsInvalidYAML
) - if the yaml is invalid
New changes are detected. LGTM label has been removed. |
b2b6115
to
621c8ec
Compare
f20a5b7
to
e05e5c1
Compare
74ddfc3
to
bfd05a7
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kurlov, ludydoo, SimonBaeumer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
This PR applies the gitops configuration to centrals on the fleetmanager side.
gitops.ConfigProvider
The
gitops.ConfigProvider
is an interface that returns a gitops config. Currently, it is designed to use an empty config. There are a few decorators, such as afallbackToLastWorkingConfig
(caches the last working config) andproviderWithMetrics
(increments a prometheus counter on errors). Seeacs-fleet-manager/internal/dinosaur/pkg/gitops/provider.go
Line 21 in fcc3aca
gitops.Service
The
gitops.Service
is an interface that is able to return av1alpha1.Central
by applying overrides (from thegitops.Config
) on top of the default, hardcoded central. Seeacs-fleet-manager/internal/dinosaur/pkg/gitops/service.go
Line 29 in fcc3aca
presenter
The
gitops.Service
is used by thepresenters.ManagedCentralsPresenter
to convert adbapi.CentralRequest
into aprivate.ManagedCentral
. It populates theprivate.ManagedCentral
CentralYAML
property. See https://github.com/stackrox/acs-fleet-manager/pull/1233/files#diff-102fa59115d9a7ed481c807cf7788df59014cc77128f29cf2b21dd4021f6eabeMove
dinosaurService.ListByClusterID
todataPlaneCentralService.ListByClusterID
.The
dinosaurService.ListByClusterID
is better suited to be attached to thedataPlaneCentralService
, since it is only used byfleetshard
to poll the centrals.Added documentation
See https://github.com/stackrox/acs-fleet-manager/pull/1233/files#diff-b2893d8e9f6459a31df49afcfde1154a622094c48c96520b8d0e903ff89d5cbc and https://github.com/stackrox/acs-fleet-manager/pull/1233/files#diff-414c9d6677df2275fa8c5d398c779cc38dc56e957d0028c03eabd65da5c34f50
Checklist (Definition of Done)
Test manual
Documentation added if necessary (i.e. changes to dev setup, test execution, ...)ROX-12345: ...
Add secret to app-interface Vault or Secrets Manager if necessaryRDS changes were e2e tested manuallyCheck AWS limits are reasonable for changes provisioning new resources