Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log redaction: reduce overly conservative log redaction policies #86316

Closed
abarganier opened this issue Aug 17, 2022 · 2 comments
Closed

log redaction: reduce overly conservative log redaction policies #86316

abarganier opened this issue Aug 17, 2022 · 2 comments
Assignees
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@abarganier
Copy link
Contributor

abarganier commented Aug 17, 2022

Is your feature request related to a problem? Please describe.
Support staff have long relied on debug zip bundles to help folks running CockroachDB debug & diagnose issues. A key piece of these debug zip bundles are the dumps of all the logs available on each node. These logs are a core piece of observability data for operators and support staff alike.

However, as CockroachDB matures, so do the compliance policies that it must meet. With compliance comes a need for us to expand our redaction to all sources of data that flow out of CockroachDB, especially those used by folks who don't actually own the running cluster (e.g. support staff).

debug zip has recently been updated to support a --redact flag, which has taken over from the now deprecated --redact-logs flag. A redacted debug zip will redact all of its logs - a feature that has existed for some time now, but has not been utilized much as our compliance policies didn't mandate it.

Today, our compliance policies are expanding, and in the future a redacted debug zip bundle may be all that our support staff are able to gain access to.

Unfortunately, support has made it very clear that redacted logs as they stand today are insufficient for debugging & diagnosing support escalations. Redacted logs feel "overly redacted" to the point where operational information that is unrelated to customer data (e.g. NodeIDs, operation priorities, ClusterID, queries that already have constants stripped, etc).

Describe the solution you'd like
If in the future we are going to enforce that our support staff is only able to access logs in redacted form, we need to make sure that those logs are redacting data thoughtfully, as opposed to the overly-conservative redaction of today.

We need to define what "sensitive data" means in the context of log redaction, and draw clear lines around what's redacted & what's left unredacted. We can then make targeted improvements to log redaction by implementing SafeFormatter where appropriate, or explicitly marking objects interpolated into logs as "safe".

Additional context
In other redaction work aimed at meeting compliance requirements, all customer data has been identified as "sensitive". The exception here is key data, which may contain row-level data but has been deemed necessary to support CockroachDB. Can the same definitions be applied to log redaction?

Jira issue: CRDB-18694

Epic CRDB-12732

@abarganier abarganier added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability-inf labels Aug 17, 2022
@abarganier
Copy link
Contributor Author

NB: Our support staff has graciously provided me with a list of specific log lines where unnecessary redaction is taking place. This list is well-scoped & actionable, so I will begin working through the list as a first iteration of improvements.

@abarganier abarganier self-assigned this Aug 17, 2022
craig bot pushed a commit that referenced this issue Sep 8, 2022
86475: cli: support `COCKROACH_REDACTION_POLICY_MANAGED` env var r=knz a=abarganier

Currently, log redaction policies have no way to discern their own
runtime environment. Logged objects that may be considered sensitive
and unsafe in on-prem deployments of CockroachDB might be otherwise
safe when we're running within a managed service such as Cockroach
Cloud. For example, CLI argument lists included as part of the
`cockroach start` command are already known to those operating the
managed service, so there's no reason we should be redacting this
information from logs in this case.

This patch adds the `--managed` flag to the start commands. This
flag is plumbed through to the global logging config object where
the log package has access to it.

We also introduce `log.SafeManaged(s interface{})`, which conditionally
marks an object with `redact.Safe()` depending on whether or not we
are running as a managed service. This is inspired by the original
`log.SafeOperational(s interface{})` function.

I believe that this new `--managed` flag should not be advertised in
our public documentation, as its intended use is for those running
Cockroach Cloud.

Release justification: low-risk, high benefit changes to existing
functionality. The new CLI flag has a minimal impact on DB
operations and provides high value reduction of log redaction,
which will be necessary for support staff with our latest compliance
requirements.

Release note (cli change): `cockroach start` commands now have an
additional `--managed` flag that can be used to indicate whether
or not the node is running as part of a managed service (e.g.
Cockroach Cloud). Perhaps this shouldn't be advertised in our
public facing docs, as its only intended for use by those running
Cockroach Cloud and not for on-prem deployments.

Addresses #86316

86774: sql/schemachanger: version gate element creation r=Xiang-Gu a=ajwerner

Commit 1: fix minSupportedVersion of `ADD COLUMN` in new schema changer
from v22.1 to v22.2
Commit 2: We cannot create elements the old version of the code does not know about.

Release justification: fixed mixed version incompatibility
Release note: None

87317: sql: improve and clean up tracing a bit r=yuzefovich a=yuzefovich

**tracing: omit distsql ids from SHOW TRACE**

This commit removes the custom handling of tracing tags with
`cockroach.` prefix when populating the output of SHOW TRACE.
Previously, all tags with this prefix would be included into the "start
span" message, possibly taking up multiple lines in the SHOW TRACE
output. However, there is only one user of those tags - ids of different
components of DistSQL infrastructure, and I don't think it's helpful to
have those ids in the output at all, so this commit removes this ability
and makes the "start span" message nicer.

This special handling was introduced four years ago in
60978aa and at that time there might
have been a reason to have some special handling of these tags (so that
they become visible when viewing the jaeger trace), but that is not
necessary anymore (I believe because we now always propagate all tags
across nodes).

Release justification: low-risk cleanup.

Release note: None

**execinfra: clean up ProcessorBase a bit**

This commit performs the following cleanup:
- it removes the redundant `InternalClose` implementations. At some
point last year an "extended" version was introduced to take in
a closure to be called when the processor is being closed. There is only
one user for that, and it can itself do the necessary cleanup before
calling `InternalClose`
- it removes the update to `rowIdx` of `ProcOutputHelper` (which tracks
how many rows the helper has emitted) when the processor is closed. The
idea behind this was to protect from the future calls to `Next` method
so that the helper doesn't emit more rows once it is closed, but it is
not allowed by the interface anyway - once the processor is closed, no
new calls to `Next` are allowed, so this protection was meaningless.
However, what prompted me to look into this was the fact that the
`rowIdx` field was being set to `MaxInt64` which would trip up the stats
collection change in the following commit.

Release justification: low-risk cleanup.

Release note: None

**sql: improve tracing of some things**

This commit makes it so that we create a tracing span for all
processors. Previously, out of performance considerations, we elided the
spans for the columnarizer, materializer, planNodeToRowSource, and
flowCoordinator, but given the improvements to tracing in the last year
or so it doesn't seem necessary to do that anymore. In particular so
given that we don't create tracing spans by default any way, only when
the tracing is enabled for the statement.

Additionally, this commit adds a couple of tags to the tracing span of
the vectorized outbox (similar to what we have in the row-by-row
engine).

Release justification: low-risk improvement.

Release note: None

87468: clusterversion: require env var to do poison dev upgrades r=dt a=dt

Previously the offsetting of all in-development versions ensured that upgrading to one of these would mark the cluster as untrusted, dev-version-only, however the fact we did not offset already released versions meant that one could perform such an upgrade easily, by simply starting a dev binary in a stable release data directory, as upgrades happen by default automatically. This could lead to an inadvertent and irreversible conversion of a cluster to dev versions.

This changes the behavior to default to offsetting _all_ versions, not just the the new ones, which has the effect of also offset the version _from which_ a binary is willing to upgrade. This significantly reduces the risk of inadvertently upgrading a cluster to a dev version, as by default, the dev version will refuse to start in a release-version's data directory.

In some cases however it is useful to start a custom or development build in an existing data directory, e.g. a snapshot collected from production. For these cases, the env var COCKROACH_UPGRADE_TO_DEV_VERSION can be used to only offset the second defined version and above, meaning that the first version, which is typically the minBinaryVersion, is left alone, and that binary thus considers itself backwards compatible with that older release version and will thus be willing to start in / join that existing cluster.

Release note: none.

Release justification: bug fix in new functionality.

87474: ci: pass custom timeout to testrace in CI r=rickystewart a=healthy-pod

In #86363, we added a timeout to tests at the test binary level. Tests running with `--config=race` however use a custom timeout, different from the original default values set by bazel based on the test size.

This patch propagates those custom values to testrace in CI.

Release justification: Non-production code changes
Release note: None

Co-authored-by: Alex Barganier <[email protected]>
Co-authored-by: Xiang Gu <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: healthy-pod <[email protected]>
craig bot pushed a commit that referenced this issue Sep 8, 2022
86813: pkg/keys: implement SafeFormatter for roachpb.Key r=andreimatei a=abarganier

Currently, when a key is logged, the entirety of the pretty-printed
key is redacted, hindering observability when parsing through redacted
logs (something that will become more common with upcoming compliance
requirements).

For example, prior to this patch, a pretty-printed key would appear
in the following way for the unredacted/redacted cases, respectively:

	- unredacted: ‹/Table/42/1222/"index key"›
	- redacted:   ‹x›

This patch addresses this by implementing the SafeFormatter interface
for `roachpb.Key` and `roachpb.RKey`, yielding the following result
when looking at the same example above:

	- unredacted: /Table/42/1222/‹"index key"›
	- redacted:   /Table/42/1222/‹x›

While the index key itself remains redacted, the ability to see the
specific table, index, and in the case of tenant tables, the tenant
itself, provides much better observability into which table & index
a log line is referring to than before.

Note that this implementation is only partial. It currently only
supports keys that fall in the `/Table` keyspace for application
tenants and system tenants. Keyspaces such as Meta1, Meta2, Local,
etc. are not yet supported, but can be added with much more ease
in the future now that the example has been set.

Finally, we remove the `maxLen` and related truncation logic from
`StringWithDirs`, as this is no longer used. Furthermore, the
truncation was invalid as it could have truncated a utf-8
sequence in the wrong place, making the result invalid utf-8.

This PR is a continuation of the work originally done by `@kzh` in
#67065. See the original PR for some initial discussions. 

Release note (security update): redacted logs will now reveal
pretty-printed keys, except for the index key values themselves.
For example `/Table/42/1222/‹x›` will be shown instead of `‹x›`
(which was shown previously). This improved redaction is available
for the `/Table` keyspace for both system and application tenants.
Other keyspaces such as `/Meta1`, `/Meta2`, `/Local`, etc. are not
yet supported.

Release justification: low risk, high benefit observability changes

Addresses #86316

87494: authors: add leon.fattakhov to authors r=coolcom200 a=coolcom200

Release note: None
Release justification: non-production code change.

87516: gc: end to end integration test r=tbg a=aliher1911

Currently tests are covering various components, but there's no end to end flow test that would verify that GC will issue GC requests that would successfully remove all relevant data.
This commit adds a TestCluster based test to verify GC behaviour.

Release justification: test only change
Release note: None
Fixes #84988

87527: sql: add query level execution stats to sampled query log r=THardy98 a=THardy98

Partially addresses: #84729

This change adds:
- network bytes sent
- maximum memory usage
- maximum disk usage
- KV bytes read
- KV rows read
- network messages

fields to the `SampledQuery` telemetry log.

Release justification: low risk, high benefit changes to existing functionality

Release note (sql change): This change adds::
- network bytes sent
- maximum memory usage
- maximum disk usage
- KV bytes read
- KV rows read
- network messages fields to the `SampledQuery` telemetry log.

87617: workload/schemachange: propagate worker ID to log r=ajwerner a=ajwerner

The schemachange workload can run with concurrency. It logs what workers do. It was hard to understand what different workers are doing because we never populated the worker ID.

Release note: None

Co-authored-by: Alex Barganier <[email protected]>
Co-authored-by: Leon Fattakhov <[email protected]>
Co-authored-by: Oleg Afanasyev <[email protected]>
Co-authored-by: Thomas Hardy <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
@abarganier
Copy link
Contributor Author

Closing for now as we've hit a baseline of improvements. There is still work to be done, which will be handled in other issues (to be created, will link here once that happens).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

1 participant