Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/util/log: emit counter metrics that track failed fluent server connections #102753

Closed
abarganier opened this issue May 3, 2023 · 1 comment
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) db-cy-23

Comments

@abarganier
Copy link
Contributor

abarganier commented May 3, 2023

Is your feature request related to a problem? Please describe.
Currently, if a fluent-server log sink in CRDB fails to connect to the fluent-server target, a log message is emitted indicating the failed connection attempt.

It would be nice to also have this tracked in the form of a counter metric, which can help us more easily correlate this failure with other metrics, such as those emitted by fluent-bit processes.

Describe the solution you'd like
Add a counter metric that's incremented each time a fluent-server log sink fails to connect to the target. Make sure that fluent-server log sinks with buffering enabled (wrapped in a bufferSink) still increment these metrics on connection failure.

Jira issue: CRDB-27635

Epic CC-9681

@abarganier abarganier added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability-inf labels May 3, 2023
@abarganier abarganier self-assigned this May 3, 2023
craig bot pushed a commit that referenced this issue Jul 24, 2023
106552: tests: support float approximation in roachtest query comparison utils r=rharding6373 a=rharding6373

tests, logictest, floatcmp: refactor comparison test util functions
    
This commit moves some float comparison test util functions from
logictest into the floatcmp package. It also moves a query result
comparison function from the tlp file to query_comparison_util in the
tests package.
    
This commit also marks roachtests as testonly targets.
    
Epic: none
    
Release note: None


tests: support float approximation in roachtest query comparison utils
    
Before this change unoptimized query oracle tests would compare results
using simple string comparison. However, due to floating point precision
limitations, it's possible for results with floating point to diverge
during the course of normal computation. This results in test failures
that are difficult to reproduce or determine whether they are expected
behavior.
    
This change utilizes existing floating point comparison functions used
by logic tests to match float values only to a specific precision. Like
the logic tests, we also have special handling for floats and decimals
under the s390x architecture (see #63244). In order to avoid costly
comparisons, we only check floating point precision if the naiive string
comparison approach fails and there are float or decimal types in the
result.
    
Epic: None
Fixes: #95665
    
Release note: None

106607: pkg/util/log: introduce `fluent.sink.conn.errors` metric to log package r=knz a=abarganier

Logging is a critical subsystem within CRDB, but as things
are today, we have very little observability into logging
itself. For starters, we have no metrics in the logging
package at all!

This makes it difficult to observe things within the log
package. For example, if a fluent-server log sink fails
to connect to FluentBit, how can we tell? We get some STDOUT
message, but that's about it.

It's time to get some metrics into the log package.

Doing so is a bit of a tricky dance, because pretty much every
package in CRDB imports the logging package, meaning you
almost always get a circular dependency when trying to make
use of any library within pkg/util/log. Therefore, this PR 
injects metrics functionality into the logging package.
It does so via a new interface called `LogMetrics` with
functions that enable its users to modify metrics.

The implementation of the interface can live elsewhere,
such as the metrics package itself, whe circular dependencies
aren't such a pain. We can then inject the implementation
into the log package.

With all that plumbing done, the PR also introduces a new metric,
`fluent.sink.conn.errors` representing fluentbit connection errors 
whenever a fluent-server log sink fails to establish a connection.

We can see the metric represented below in a test, where no
fluent-server was running for a moment, before starting it.
Note that the connection errors level off once the server was 
started:
<img width="1018" alt="Screenshot 2023-07-11 at 1 32 57 PM" src="https://github.com/cockroachdb/cockroach/assets/8194877/ccacf84e-e585-4a76-98af-ed145629f9ef">

Release note (ops change): This patch introduces the counter
metric `fluent.sink.conn.errors` to the CockroachDB tsdb,
which is exposed to `/_status/vars` clients as
`fluent_sink_conn_errors`. The metric is incremented whenever
a `fluent-server` log sink fails to establish a connection to
the log sink pointed to by the `address` for the sink in the
provided log config.

Epic: CC-9681

Addresses: #102753

107453: Revert "github-pull-request-make: temporary workaround" r=postamar a=postamar

This reverts commit 2bd61c0.

Relates to #106920.

Co-authored-by: rharding6373 <[email protected]>
Co-authored-by: Alex Barganier <[email protected]>
Co-authored-by: Marius Posta <[email protected]>
@abarganier
Copy link
Contributor Author

Resolved via #106607

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) db-cy-23
Projects
None yet
Development

No branches or pull requests

1 participant