ROX-19013 Add gitops to fleetmanager #1233

ludydoo · 2023-09-04T07:49:11Z

Description

This PR applies the gitops configuration to centrals on the fleetmanager side.

`gitops.ConfigProvider`

The gitops.ConfigProvider is an interface that returns a gitops config. Currently, it is designed to use an empty config. There are a few decorators, such as a fallbackToLastWorkingConfig (caches the last working config) and providerWithMetrics (increments a prometheus counter on errors). See

acs-fleet-manager/internal/dinosaur/pkg/gitops/provider.go

Line 21 in fcc3aca

func NewDefaultConfigProvider() ConfigProvider {

. At some point, it will read from a file. We decided into doing this a bit later.

`gitops.Service`

The gitops.Service is an interface that is able to return a v1alpha1.Central by applying overrides (from the gitops.Config) on top of the default, hardcoded central. See

acs-fleet-manager/internal/dinosaur/pkg/gitops/service.go

Line 29 in fcc3aca

func (s *service) GetCentral(ctx CentralParams) (v1alpha1.Central, error) {

presenter

The gitops.Service is used by the presenters.ManagedCentralsPresenter to convert a dbapi.CentralRequest into a private.ManagedCentral. It populates the private.ManagedCentral CentralYAML property. See https://github.com/stackrox/acs-fleet-manager/pull/1233/files#diff-102fa59115d9a7ed481c807cf7788df59014cc77128f29cf2b21dd4021f6eabe

Move `dinosaurService.ListByClusterID` to `dataPlaneCentralService.ListByClusterID`.

The dinosaurService.ListByClusterID is better suited to be attached to the dataPlaneCentralService, since it is only used by fleetshard to poll the centrals.

Added documentation

See https://github.com/stackrox/acs-fleet-manager/pull/1233/files#diff-b2893d8e9f6459a31df49afcfde1154a622094c48c96520b8d0e903ff89d5cbc and https://github.com/stackrox/acs-fleet-manager/pull/1233/files#diff-414c9d6677df2275fa8c5d398c779cc38dc56e957d0028c03eabd65da5c34f50

Checklist (Definition of Done)

Unit and integration tests added
Added test description under Test manual
~~Documentation added if necessary (i.e. changes to dev setup, test execution, ...)~~
CI and all relevant tests are passing
Add the ticket number to the PR title if available, i.e. ROX-12345: ...
Discussed security and business related topics privately. Will move any security and business related topics that arise to private communication channel.
~~Add secret to app-interface Vault or Secrets Manager if necessary~~
~~RDS changes were e2e tested manually~~
~~Check AWS limits are reasonable for changes provisioning new resources~~

internal/dinosaur/pkg/gitops/config.go

internal/dinosaur/pkg/gitops/config_test.go

internal/dinosaur/pkg/services/dinosaur.go

SimonBaeumer

Good job, thanks for splitting the PRs!

internal/dinosaur/pkg/gitops/provider.go

SimonBaeumer · 2023-09-05T07:44:19Z

internal/dinosaur/pkg/gitops/provider.go

+}
+
+var (
+	errorCounter = prometheus.NewCounterVec(prometheus.CounterOpts{


Is the error counter reset automatically when I re-apply a valid config?

No, it would just stay at the same level. But the prom query would be something like increase(counter_total[5m]) ...

In that case the counter would provide wrong information when the counter increase continuously, reporting 100 errors while there are 0 errors. It creates an implicit requirement on the writer of a prom query to get usable data.
Is it possible to decrease the counter when the error was reset, basically having a boolean value?
In doubt consulting SRE how to handle this metric.
caveat: I am not very familiar with Prometheus metrics, if this is considered best-practices can you share an article/doc with me?

That's how counters are supposed to work. It's not a problem for querying it with PromQL

https://prometheus.io/docs/concepts/metric_types/

https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#Counter

Could it be implemented with a gauge in this case? 🤔
https://prometheus.io/docs/concepts/metric_types/#gauge

You could, but that's not standard. Check the usages in the repository for counters, they are canonically used to represent the number of errors, the number of requests, etc. e.g. ReconcilerFailureCount.

Then, we can craft a query like increase to watch the rate at which this number increases. https://prometheus.io/docs/prometheus/latest/querying/functions/#increase

calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for

Hm, it still delegates the complexity of the implementation to the writer of the prom query.
We need an indication that an error happened in the GitOps configuration as feedback for the engineer who has merged a pull-request, we are kind of abusing the alerting system a bit here.
This would make the gauge the better alternative and resulting in a promquery which is simple and idiomatic to use.
A user of the metric does not know how to work with this data without additional documentation how to interpret it.

Using a counter overall increases the system's complexity in this use-case, making the gauge the simpler, respectively better alternative.

It's incorrect. Gauges are used for metrics that can decrease, such as memory usage, cpu usage, number of connected sensors. Counters are used to measure increasing values, such as the number of errors thrown, the number of requests handled.

The only valid way to use a gauge in this context would be to calculate the "number of successive gitops errors" - as opposed to the total number of gitops errors. This has little value in this context.

Using counters to report errors through prometheus is a standard use of this type of metric.

See

https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_conn_man/stats

https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/network_filters/tcp_proxy_filter

https://www.envoyproxy.io/docs/envoy/latest/configuration/upstream/cluster_manager/cluster_stats

https://kubernetes.io/docs/reference/instrumentation/metrics/

To address the complexity issue, I think that the use of counters would not increase the complexity of the queries. I think that on the contrary, using a Gauge for a type of value that is usually represented as a Counter would increase complexity.

In that case the counter would provide wrong information when the counter increase continuously, reporting 100 errors while there are 0 errors

This is a common behavior in promQL queries. For example, a http_errors_total counter will always report the cumulative total errors. But it is common practice to obtain the rate of change of that metric, and base alerts on that. Usually (it is an overwhelmingly common pattern) the counters are wrapped with the sum(rate(my_metric[5m])) by (my_label) expression.

We need an indication that an error happened in the GitOps configuration as feedback for the engineer who has merged a pull-request

Yes, this would be handled by a counter. If we measure the rate of change of the counter, we can alert ourselves that something went wrong. For example, if we have more than 2 gitops errors per 5 minutes, we might want to throw an alert.

I think I see your point with a gauge, where you would like to "reset" the value to 0 if there are no errors. But it is usually not the metric type that is used to carry the meaning of "events occurring", such as errors or requests. It should rather carry the meaning of a measurable float/integer value, such as the CPU usage, number of active clients, etc.

How counters / rate / increase work

There is a widely used function, rate, that allows us to calculate the "per-second average rate of increase" of counters. Formulated as

$$r(t) = \frac{f(t)-f(t-\Delta t)}{\Delta t} = \frac{\Delta y}{\Delta t}$$

where $f(t)$ is the value of the metric at time $t$ and $\Delta t$ is the time range. Though, there will be some extrapolation if there is no metric at time $t$.

$r(t)$ is adjusted to never be negative because it is adjusted for restarts. So the codomain of $r(t)$ is $[0,\infty]$

increase is a helper function that returns rate multiplied by the number of seconds under the specified time window, or

$$ \begin{split} i(t) & = r(t) * \Delta t \\ & = \frac{f(t)-f(t-\Delta t)}{\Delta t} * \Delta t \\ & = f(t)-f(t-\Delta t) \end{split} $$

In our case, we are interested in occasions where the errors are increasing. So when $f(t)-f(t-\Delta t) > 0$

The alert would be pretty straightforward to write with PromQL (and in all honesty, much more simple than most prometheus alerts that we have in the rhacs-observability repository).

example:

alert: FleetManagerGitOpsError expr: increase(gitops_error_total[15m]) > 0 for: 5m

@ludydoo Thanks for the great explanation, it makes sense to me now. I agree, let's keep the counter and align with best pratices as you suggested.

internal/dinosaur/pkg/gitops/service.go

internal/dinosaur/pkg/gitops/service_test.go

SimonBaeumer

Summary: The ConfigProvider implementation needs a bit of refactoring to be less abstract and Prometheus counter looks like it provides wrong data.

internal/dinosaur/pkg/gitops/README.md

internal/dinosaur/pkg/gitops/service.go

kurlov

🎉

SimonBaeumer

LGTM 🚢

SimonBaeumer · 2023-09-11T14:23:14Z

internal/dinosaur/pkg/gitops/provider_test.go

+			if tc.readerWillFail {
+				p.reader = &failingReader
+			} else {
+				p.reader = &successfulReader
+			}
+
+			if tc.validationWillFail {
+				p.validationFn = failingValidationFn
+			} else {
+				p.validationFn = successfulValidationFn
+			}


nit: the validationFns and readers can be part of the configured test structs directly, would be able to remove the if-statements

SimonBaeumer · 2023-09-11T14:24:52Z

internal/dinosaur/pkg/gitops/reader_test.go

+	require.NoError(t, err)
+	provider := NewFileReader(tmpFilePath)
+	_, err = provider.Read()
+	assert.Error(t, err)


nit: require.NoError

This one expects an error (TestFileReader_Get_FailsIfFileIsInvalidYAML) - if the yaml is invalid

openshift-ci · 2023-09-11T14:27:16Z

New changes are detected. LGTM label has been removed.

openshift-ci · 2023-09-14T13:52:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kurlov, ludydoo, SimonBaeumer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [SimonBaeumer,kurlov]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ludydoo temporarily deployed to development September 4, 2023 07:49 — with GitHub Actions Inactive