Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-19013 Add gitops to fleetmanager #1233

Merged
merged 3 commits into from
Sep 14, 2023
Merged

Conversation

ludydoo
Copy link
Collaborator

@ludydoo ludydoo commented Sep 4, 2023

Description

This PR applies the gitops configuration to centrals on the fleetmanager side.

gitops.ConfigProvider

The gitops.ConfigProvider is an interface that returns a gitops config. Currently, it is designed to use an empty config. There are a few decorators, such as a fallbackToLastWorkingConfig (caches the last working config) and providerWithMetrics (increments a prometheus counter on errors). See

func NewDefaultConfigProvider() ConfigProvider {
. At some point, it will read from a file. We decided into doing this a bit later.

gitops.Service

The gitops.Service is an interface that is able to return a v1alpha1.Central by applying overrides (from the gitops.Config) on top of the default, hardcoded central. See

func (s *service) GetCentral(ctx CentralParams) (v1alpha1.Central, error) {

presenter

The gitops.Service is used by the presenters.ManagedCentralsPresenter to convert a dbapi.CentralRequest into a private.ManagedCentral. It populates the private.ManagedCentral CentralYAML property. See https://github.com/stackrox/acs-fleet-manager/pull/1233/files#diff-102fa59115d9a7ed481c807cf7788df59014cc77128f29cf2b21dd4021f6eabe

Move dinosaurService.ListByClusterID to dataPlaneCentralService.ListByClusterID.

The dinosaurService.ListByClusterID is better suited to be attached to the dataPlaneCentralService, since it is only used by fleetshard to poll the centrals.

Added documentation

See https://github.com/stackrox/acs-fleet-manager/pull/1233/files#diff-b2893d8e9f6459a31df49afcfde1154a622094c48c96520b8d0e903ff89d5cbc and https://github.com/stackrox/acs-fleet-manager/pull/1233/files#diff-414c9d6677df2275fa8c5d398c779cc38dc56e957d0028c03eabd65da5c34f50

Checklist (Definition of Done)

  • Unit and integration tests added
  • Added test description under Test manual
  • Documentation added if necessary (i.e. changes to dev setup, test execution, ...)
  • CI and all relevant tests are passing
  • Add the ticket number to the PR title if available, i.e. ROX-12345: ...
  • Discussed security and business related topics privately. Will move any security and business related topics that arise to private communication channel.
  • Add secret to app-interface Vault or Secrets Manager if necessary
  • RDS changes were e2e tested manually
  • Check AWS limits are reasonable for changes provisioning new resources

@ludydoo ludydoo temporarily deployed to development September 4, 2023 07:49 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 4, 2023 07:49 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 4, 2023 07:49 — with GitHub Actions Inactive
Copy link
Member

@SimonBaeumer SimonBaeumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, thanks for splitting the PRs!

internal/dinosaur/pkg/gitops/provider.go Show resolved Hide resolved
internal/dinosaur/pkg/gitops/provider.go Outdated Show resolved Hide resolved
internal/dinosaur/pkg/gitops/provider.go Outdated Show resolved Hide resolved
}

var (
errorCounter = prometheus.NewCounterVec(prometheus.CounterOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the error counter reset automatically when I re-apply a valid config?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it would just stay at the same level. But the prom query would be something like increase(counter_total[5m]) ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case the counter would provide wrong information when the counter increase continuously, reporting 100 errors while there are 0 errors. It creates an implicit requirement on the writer of a prom query to get usable data.
Is it possible to decrease the counter when the error was reset, basically having a boolean value?
In doubt consulting SRE how to handle this metric.
caveat: I am not very familiar with Prometheus metrics, if this is considered best-practices can you share an article/doc with me?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how counters are supposed to work. It's not a problem for querying it with PromQL

https://prometheus.io/docs/concepts/metric_types/

https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#Counter

Copy link
Member

@SimonBaeumer SimonBaeumer Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be implemented with a gauge in this case? 🤔
https://prometheus.io/docs/concepts/metric_types/#gauge

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could, but that's not standard. Check the usages in the repository for counters, they are canonically used to represent the number of errors, the number of requests, etc. e.g. ReconcilerFailureCount.

Then, we can craft a query like increase to watch the rate at which this number increases. https://prometheus.io/docs/prometheus/latest/querying/functions/#increase

calculates the increase in the time series in the range vector. Breaks in monotonicity (such as counter resets due to target restarts) are automatically adjusted for

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it still delegates the complexity of the implementation to the writer of the prom query.
We need an indication that an error happened in the GitOps configuration as feedback for the engineer who has merged a pull-request, we are kind of abusing the alerting system a bit here.
This would make the gauge the better alternative and resulting in a promquery which is simple and idiomatic to use.
A user of the metric does not know how to work with this data without additional documentation how to interpret it.

Using a counter overall increases the system's complexity in this use-case, making the gauge the simpler, respectively better alternative.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's incorrect. Gauges are used for metrics that can decrease, such as memory usage, cpu usage, number of connected sensors. Counters are used to measure increasing values, such as the number of errors thrown, the number of requests handled.

The only valid way to use a gauge in this context would be to calculate the "number of successive gitops errors" - as opposed to the total number of gitops errors. This has little value in this context.

Using counters to report errors through prometheus is a standard use of this type of metric.

See

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address the complexity issue, I think that the use of counters would not increase the complexity of the queries. I think that on the contrary, using a Gauge for a type of value that is usually represented as a Counter would increase complexity.

In that case the counter would provide wrong information when the counter increase continuously, reporting 100 errors while there are 0 errors

This is a common behavior in promQL queries. For example, a http_errors_total counter will always report the cumulative total errors. But it is common practice to obtain the rate of change of that metric, and base alerts on that. Usually (it is an overwhelmingly common pattern) the counters are wrapped with the sum(rate(my_metric[5m])) by (my_label) expression.

We need an indication that an error happened in the GitOps configuration as feedback for the engineer who has merged a pull-request

Yes, this would be handled by a counter. If we measure the rate of change of the counter, we can alert ourselves that something went wrong. For example, if we have more than 2 gitops errors per 5 minutes, we might want to throw an alert.

I think I see your point with a gauge, where you would like to "reset" the value to 0 if there are no errors. But it is usually not the metric type that is used to carry the meaning of "events occurring", such as errors or requests. It should rather carry the meaning of a measurable float/integer value, such as the CPU usage, number of active clients, etc.

How counters / rate / increase work

There is a widely used function, rate, that allows us to calculate the "per-second average rate of increase" of counters. Formulated as

$$r(t) = \frac{f(t)-f(t-\Delta t)}{\Delta t} = \frac{\Delta y}{\Delta t}$$

where $f(t)$ is the value of the metric at time $t$ and $\Delta t$ is the time range. Though, there will be some extrapolation if there is no metric at time $t$.

$r(t)$ is adjusted to never be negative because it is adjusted for restarts. So the codomain of $r(t)$ is $[0,\infty]$

increase is a helper function that returns rate multiplied by the number of seconds under the specified time window, or

$$ \begin{split} i(t) & = r(t) * \Delta t \\ & = \frac{f(t)-f(t-\Delta t)}{\Delta t} * \Delta t \\ & = f(t)-f(t-\Delta t) \end{split} $$

In our case, we are interested in occasions where the errors are increasing. So when $f(t)-f(t-\Delta t) > 0$

The alert would be pretty straightforward to write with PromQL (and in all honesty, much more simple than most prometheus alerts that we have in the rhacs-observability repository).

example:

alert: FleetManagerGitOpsError
expr:  increase(gitops_error_total[15m]) > 0
for:   5m

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ludydoo Thanks for the great explanation, it makes sense to me now. I agree, let's keep the counter and align with best pratices as you suggested.

internal/dinosaur/pkg/gitops/service.go Outdated Show resolved Hide resolved
internal/dinosaur/pkg/gitops/service_test.go Outdated Show resolved Hide resolved
@ludydoo ludydoo temporarily deployed to development September 5, 2023 08:10 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 5, 2023 08:10 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 5, 2023 08:10 — with GitHub Actions Inactive
@ludydoo ludydoo requested a review from SimonBaeumer September 5, 2023 08:10
Copy link
Member

@SimonBaeumer SimonBaeumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary: The ConfigProvider implementation needs a bit of refactoring to be less abstract and Prometheus counter looks like it provides wrong data.

internal/dinosaur/pkg/gitops/README.md Show resolved Hide resolved
internal/dinosaur/pkg/gitops/service.go Outdated Show resolved Hide resolved
internal/dinosaur/pkg/gitops/service.go Show resolved Hide resolved
internal/dinosaur/pkg/gitops/service.go Outdated Show resolved Hide resolved
@ludydoo ludydoo temporarily deployed to development September 11, 2023 12:44 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 12:44 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 12:44 — with GitHub Actions Inactive
Copy link
Member

@kurlov kurlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Member

@SimonBaeumer SimonBaeumer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚢

Comment on lines 87 to 97
if tc.readerWillFail {
p.reader = &failingReader
} else {
p.reader = &successfulReader
}

if tc.validationWillFail {
p.validationFn = failingValidationFn
} else {
p.validationFn = successfulValidationFn
}
Copy link
Member

@SimonBaeumer SimonBaeumer Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the validationFns and readers can be part of the configured test structs directly, would be able to remove the if-statements

require.NoError(t, err)
provider := NewFileReader(tmpFilePath)
_, err = provider.Read()
assert.Error(t, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: require.NoError

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one expects an error (TestFileReader_Get_FailsIfFileIsInvalidYAML) - if the yaml is invalid

@openshift-ci openshift-ci bot removed the lgtm label Sep 11, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 11, 2023

New changes are detected. LGTM label has been removed.

@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:27 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:27 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:27 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:28 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:28 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:28 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:33 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:33 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:33 — with GitHub Actions Inactive
@ludydoo ludydoo force-pushed the ROX-19013-add-central-yaml-to-api branch from b2b6115 to 621c8ec Compare September 11, 2023 14:38
@ludydoo ludydoo force-pushed the ROX-19013-fleetmanager-service branch from f20a5b7 to e05e5c1 Compare September 11, 2023 14:54
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:54 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:54 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:54 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:55 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:55 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 11, 2023 14:55 — with GitHub Actions Inactive
Base automatically changed from ROX-19013-add-central-yaml-to-api to main September 11, 2023 15:21
@ludydoo ludydoo force-pushed the ROX-19013-fleetmanager-service branch from 74ddfc3 to bfd05a7 Compare September 14, 2023 13:52
@ludydoo ludydoo temporarily deployed to development September 14, 2023 13:52 — with GitHub Actions Inactive
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 14, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kurlov, ludydoo, SimonBaeumer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [SimonBaeumer,kurlov]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ludydoo ludydoo temporarily deployed to development September 14, 2023 13:52 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 14, 2023 13:52 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 14, 2023 14:09 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 14, 2023 14:09 — with GitHub Actions Inactive
@ludydoo ludydoo temporarily deployed to development September 14, 2023 14:09 — with GitHub Actions Inactive
@ludydoo ludydoo merged commit e5d198a into main Sep 14, 2023
5 checks passed
@ludydoo ludydoo deleted the ROX-19013-fleetmanager-service branch September 14, 2023 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants