-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log redaction: reduce overly conservative log redaction policies #86316
Comments
NB: Our support staff has graciously provided me with a list of specific log lines where unnecessary redaction is taking place. This list is well-scoped & actionable, so I will begin working through the list as a first iteration of improvements. |
86475: cli: support `COCKROACH_REDACTION_POLICY_MANAGED` env var r=knz a=abarganier Currently, log redaction policies have no way to discern their own runtime environment. Logged objects that may be considered sensitive and unsafe in on-prem deployments of CockroachDB might be otherwise safe when we're running within a managed service such as Cockroach Cloud. For example, CLI argument lists included as part of the `cockroach start` command are already known to those operating the managed service, so there's no reason we should be redacting this information from logs in this case. This patch adds the `--managed` flag to the start commands. This flag is plumbed through to the global logging config object where the log package has access to it. We also introduce `log.SafeManaged(s interface{})`, which conditionally marks an object with `redact.Safe()` depending on whether or not we are running as a managed service. This is inspired by the original `log.SafeOperational(s interface{})` function. I believe that this new `--managed` flag should not be advertised in our public documentation, as its intended use is for those running Cockroach Cloud. Release justification: low-risk, high benefit changes to existing functionality. The new CLI flag has a minimal impact on DB operations and provides high value reduction of log redaction, which will be necessary for support staff with our latest compliance requirements. Release note (cli change): `cockroach start` commands now have an additional `--managed` flag that can be used to indicate whether or not the node is running as part of a managed service (e.g. Cockroach Cloud). Perhaps this shouldn't be advertised in our public facing docs, as its only intended for use by those running Cockroach Cloud and not for on-prem deployments. Addresses #86316 86774: sql/schemachanger: version gate element creation r=Xiang-Gu a=ajwerner Commit 1: fix minSupportedVersion of `ADD COLUMN` in new schema changer from v22.1 to v22.2 Commit 2: We cannot create elements the old version of the code does not know about. Release justification: fixed mixed version incompatibility Release note: None 87317: sql: improve and clean up tracing a bit r=yuzefovich a=yuzefovich **tracing: omit distsql ids from SHOW TRACE** This commit removes the custom handling of tracing tags with `cockroach.` prefix when populating the output of SHOW TRACE. Previously, all tags with this prefix would be included into the "start span" message, possibly taking up multiple lines in the SHOW TRACE output. However, there is only one user of those tags - ids of different components of DistSQL infrastructure, and I don't think it's helpful to have those ids in the output at all, so this commit removes this ability and makes the "start span" message nicer. This special handling was introduced four years ago in 60978aa and at that time there might have been a reason to have some special handling of these tags (so that they become visible when viewing the jaeger trace), but that is not necessary anymore (I believe because we now always propagate all tags across nodes). Release justification: low-risk cleanup. Release note: None **execinfra: clean up ProcessorBase a bit** This commit performs the following cleanup: - it removes the redundant `InternalClose` implementations. At some point last year an "extended" version was introduced to take in a closure to be called when the processor is being closed. There is only one user for that, and it can itself do the necessary cleanup before calling `InternalClose` - it removes the update to `rowIdx` of `ProcOutputHelper` (which tracks how many rows the helper has emitted) when the processor is closed. The idea behind this was to protect from the future calls to `Next` method so that the helper doesn't emit more rows once it is closed, but it is not allowed by the interface anyway - once the processor is closed, no new calls to `Next` are allowed, so this protection was meaningless. However, what prompted me to look into this was the fact that the `rowIdx` field was being set to `MaxInt64` which would trip up the stats collection change in the following commit. Release justification: low-risk cleanup. Release note: None **sql: improve tracing of some things** This commit makes it so that we create a tracing span for all processors. Previously, out of performance considerations, we elided the spans for the columnarizer, materializer, planNodeToRowSource, and flowCoordinator, but given the improvements to tracing in the last year or so it doesn't seem necessary to do that anymore. In particular so given that we don't create tracing spans by default any way, only when the tracing is enabled for the statement. Additionally, this commit adds a couple of tags to the tracing span of the vectorized outbox (similar to what we have in the row-by-row engine). Release justification: low-risk improvement. Release note: None 87468: clusterversion: require env var to do poison dev upgrades r=dt a=dt Previously the offsetting of all in-development versions ensured that upgrading to one of these would mark the cluster as untrusted, dev-version-only, however the fact we did not offset already released versions meant that one could perform such an upgrade easily, by simply starting a dev binary in a stable release data directory, as upgrades happen by default automatically. This could lead to an inadvertent and irreversible conversion of a cluster to dev versions. This changes the behavior to default to offsetting _all_ versions, not just the the new ones, which has the effect of also offset the version _from which_ a binary is willing to upgrade. This significantly reduces the risk of inadvertently upgrading a cluster to a dev version, as by default, the dev version will refuse to start in a release-version's data directory. In some cases however it is useful to start a custom or development build in an existing data directory, e.g. a snapshot collected from production. For these cases, the env var COCKROACH_UPGRADE_TO_DEV_VERSION can be used to only offset the second defined version and above, meaning that the first version, which is typically the minBinaryVersion, is left alone, and that binary thus considers itself backwards compatible with that older release version and will thus be willing to start in / join that existing cluster. Release note: none. Release justification: bug fix in new functionality. 87474: ci: pass custom timeout to testrace in CI r=rickystewart a=healthy-pod In #86363, we added a timeout to tests at the test binary level. Tests running with `--config=race` however use a custom timeout, different from the original default values set by bazel based on the test size. This patch propagates those custom values to testrace in CI. Release justification: Non-production code changes Release note: None Co-authored-by: Alex Barganier <[email protected]> Co-authored-by: Xiang Gu <[email protected]> Co-authored-by: Andrew Werner <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: healthy-pod <[email protected]>
86813: pkg/keys: implement SafeFormatter for roachpb.Key r=andreimatei a=abarganier Currently, when a key is logged, the entirety of the pretty-printed key is redacted, hindering observability when parsing through redacted logs (something that will become more common with upcoming compliance requirements). For example, prior to this patch, a pretty-printed key would appear in the following way for the unredacted/redacted cases, respectively: - unredacted: ‹/Table/42/1222/"index key"› - redacted: ‹x› This patch addresses this by implementing the SafeFormatter interface for `roachpb.Key` and `roachpb.RKey`, yielding the following result when looking at the same example above: - unredacted: /Table/42/1222/‹"index key"› - redacted: /Table/42/1222/‹x› While the index key itself remains redacted, the ability to see the specific table, index, and in the case of tenant tables, the tenant itself, provides much better observability into which table & index a log line is referring to than before. Note that this implementation is only partial. It currently only supports keys that fall in the `/Table` keyspace for application tenants and system tenants. Keyspaces such as Meta1, Meta2, Local, etc. are not yet supported, but can be added with much more ease in the future now that the example has been set. Finally, we remove the `maxLen` and related truncation logic from `StringWithDirs`, as this is no longer used. Furthermore, the truncation was invalid as it could have truncated a utf-8 sequence in the wrong place, making the result invalid utf-8. This PR is a continuation of the work originally done by `@kzh` in #67065. See the original PR for some initial discussions. Release note (security update): redacted logs will now reveal pretty-printed keys, except for the index key values themselves. For example `/Table/42/1222/‹x›` will be shown instead of `‹x›` (which was shown previously). This improved redaction is available for the `/Table` keyspace for both system and application tenants. Other keyspaces such as `/Meta1`, `/Meta2`, `/Local`, etc. are not yet supported. Release justification: low risk, high benefit observability changes Addresses #86316 87494: authors: add leon.fattakhov to authors r=coolcom200 a=coolcom200 Release note: None Release justification: non-production code change. 87516: gc: end to end integration test r=tbg a=aliher1911 Currently tests are covering various components, but there's no end to end flow test that would verify that GC will issue GC requests that would successfully remove all relevant data. This commit adds a TestCluster based test to verify GC behaviour. Release justification: test only change Release note: None Fixes #84988 87527: sql: add query level execution stats to sampled query log r=THardy98 a=THardy98 Partially addresses: #84729 This change adds: - network bytes sent - maximum memory usage - maximum disk usage - KV bytes read - KV rows read - network messages fields to the `SampledQuery` telemetry log. Release justification: low risk, high benefit changes to existing functionality Release note (sql change): This change adds:: - network bytes sent - maximum memory usage - maximum disk usage - KV bytes read - KV rows read - network messages fields to the `SampledQuery` telemetry log. 87617: workload/schemachange: propagate worker ID to log r=ajwerner a=ajwerner The schemachange workload can run with concurrency. It logs what workers do. It was hard to understand what different workers are doing because we never populated the worker ID. Release note: None Co-authored-by: Alex Barganier <[email protected]> Co-authored-by: Leon Fattakhov <[email protected]> Co-authored-by: Oleg Afanasyev <[email protected]> Co-authored-by: Thomas Hardy <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
Closing for now as we've hit a baseline of improvements. There is still work to be done, which will be handled in other issues (to be created, will link here once that happens). |
Is your feature request related to a problem? Please describe.
Support staff have long relied on
debug zip
bundles to help folks running CockroachDB debug & diagnose issues. A key piece of these debug zip bundles are the dumps of all the logs available on each node. These logs are a core piece of observability data for operators and support staff alike.However, as CockroachDB matures, so do the compliance policies that it must meet. With compliance comes a need for us to expand our redaction to all sources of data that flow out of CockroachDB, especially those used by folks who don't actually own the running cluster (e.g. support staff).
debug zip
has recently been updated to support a--redact
flag, which has taken over from the now deprecated--redact-logs
flag. A redacted debug zip will redact all of its logs - a feature that has existed for some time now, but has not been utilized much as our compliance policies didn't mandate it.Today, our compliance policies are expanding, and in the future a redacted
debug zip
bundle may be all that our support staff are able to gain access to.Unfortunately, support has made it very clear that redacted logs as they stand today are insufficient for debugging & diagnosing support escalations. Redacted logs feel "overly redacted" to the point where operational information that is unrelated to customer data (e.g. NodeIDs, operation priorities, ClusterID, queries that already have constants stripped, etc).
Describe the solution you'd like
If in the future we are going to enforce that our support staff is only able to access logs in redacted form, we need to make sure that those logs are redacting data thoughtfully, as opposed to the overly-conservative redaction of today.
We need to define what "sensitive data" means in the context of log redaction, and draw clear lines around what's redacted & what's left unredacted. We can then make targeted improvements to log redaction by implementing SafeFormatter where appropriate, or explicitly marking objects interpolated into logs as "safe".
Additional context
In other redaction work aimed at meeting compliance requirements, all customer data has been identified as "sensitive". The exception here is key data, which may contain row-level data but has been deemed necessary to support CockroachDB. Can the same definitions be applied to log redaction?
Jira issue: CRDB-18694
Epic CRDB-12732
The text was updated successfully, but these errors were encountered: