-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvfollowerreadsccl: excessive overhead in canSendToFollower #62447
Comments
Assigning this to @nvanbenschoten for now, since he submitted the original PR, but might pick this up myself when I have time. |
Nice find! I think what happened here is that we lifted the call to |
62377: changefeedccl: allow topic_name parameter for changefeed kafka sinks r=[stevendanna,miretskiy] a=HonoreDB Previously, changes for a table went to a Kafka topic named for that table, with users only able to specify a prefix. Some users, however, need changes to go to a specific topic, including sometimes the same one for more than one table, distinguishing messages using metadata. This patch allows the `?topic_name=foo` parameter to be added to Kafka sink URIs. This will override the per-table topic generation, so that changes for every table will all go to the specified topic. It may be used in conjunction with `topic_prefix`, although the distinction is not meaningful. Release note (enterprise change): Kafka sink URIs now accept the "topic_name" parameter to override per-table topic names. Closes #59300 Closes #58302 62414: workload/schemachange: add SURVIVE syntax r=ajwerner a=otan see individual commits for details 62420: cliccl/load: fix TestLoadShowIncremental typo r=pbardea a=Elliebababa This patch fixs typo of TestLoadShowIncremental. Resolves: #62416 Release note: none 62465: kvccl: re-order enterprise check in canSendToFollower r=nvanbenschoten a=nvanbenschoten Fixes #62447. In #62447, Erik found that #59571 had re-ordered the call to `utilccl.CheckEnterpriseEnabled` to occur before checking the batch in `canSendToFollower`, instead of after. This added an error allocation into the hot path of all reads, which showed up in CPU profiles and caused an 8% performance regression on `kv95`. This commit fixes this by moving the enterprise check back out of the hot-path for all non-stale read-only batches. A follow up to this PR would be to make `utilccl.CheckEnterpriseEnabled` cheaper by avoiding the error allocation for callers that don't need an error. This work is not done in this commit. 62473: kvserver: unskip TestEagerReplication r=lunevalex a=lunevalex PR #61847 fixed the flake in TestEagerReplication but was not rebased with master, so the skipped tag was not properly removed. This PR actually unskips TestEagerReplication. Release note: None Co-authored-by: Aaron Zinger <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: elliebababa <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Alex Lunev <[email protected]>
`utilccl.CheckEnterpriseEnabled()` is used to check whether a valid enterprise license exists for a given feature. If no valid license is found, it returns an error with specific details. However, `kvccl` used this function in follower read hot paths, and instantiating an error when follower reads are unavailable could have significant overhead -- see e.g. cockroachdb#62447. This patch adds `IsEnterpriseEnabled()`, which has the same behavior as `CheckEnterpriseEnabled()` but returns a boolean instead. This is significantly faster since we can avoid instantiating a custom error each time. `kvccl` is also updated to use this in hot paths. Release note: None
Previously, the `utilccl` package would base64-decode the license from the settings representation every time it was needed, which was sufficient for its uses. However, recently there's been a need to check whether enterprise features are enabled in hot paths (e.g. with follower reads as seen in cockroachdb#62447), making the decoding cost too great. This patch adds `decodeCached()` which caches the decoded license, and uses it where appropriate. Release note: None
Previously, the `utilccl` package would decode the license from the the base64-encoded Protobuf representation in settings every time it was needed, which was sufficient for its uses. However, recently there's been a need to check whether enterprise features are enabled in hot paths (e.g. with follower reads as seen in cockroachdb#62447), making the decoding cost too great. This patch adds `cluster.Settings.Cache` as a shared cache, and uses it to cache decoded licenses with a private key type. Release note: None
Fixes cockroachdb#62447. In cockroachdb#62447, Erik found that cockroachdb#59571 had re-ordered the call to `utilccl.CheckEnterpriseEnabled` to occur before checking the batch in `canSendToFollower`, instead of after. This added an error allocation into the hot path of all reads, which showed up in CPU profiles and caused an 8% performance regression on `kv95`. This commit fixes this by moving the enterprise check back out of the hot-path for all non-stale read-only batches. A follow up to this PR would be to make `utilccl.CheckEnterpriseEnabled` cheaper by avoiding the error allocation for callers that don't need an error. This work is not done in this commit.
`utilccl.CheckEnterpriseEnabled()` is used to check whether a valid enterprise license exists for a given feature. If no valid license is found, it returns an error with specific details. However, `kvccl` used this function in follower read hot paths, and instantiating an error when follower reads are unavailable could have significant overhead -- see e.g. cockroachdb#62447. This patch adds `IsEnterpriseEnabled()`, which has the same behavior as `CheckEnterpriseEnabled()` but returns a boolean instead. This is significantly faster since we can avoid instantiating a custom error each time. `kvccl` is also updated to use this in hot paths. Release note: None
Previously, the `utilccl` package would decode the license from the the base64-encoded Protobuf representation in settings every time it was needed, which was sufficient for its uses. However, recently there's been a need to check whether enterprise features are enabled in hot paths (e.g. with follower reads as seen in cockroachdb#62447), making the decoding cost too great. This patch adds `cluster.Settings.Cache` as a shared cache, and uses it to cache decoded licenses with a private key type. Release note: None
62498: utilccl,kvccl: improve performance when checking enterprise features r=tbg a=erikgrinaker **utilccl: cache license decoding** Previously, the `utilccl` package would decode the license from the the base64-encoded Protobuf representation in settings every time it was needed, which was sufficient for its uses. However, recently there's been a need to check whether enterprise features are enabled in hot paths (e.g. with follower reads as seen in #62447), making the decoding cost too great. This patch adds `cluster.Settings.Cache` as a shared cache, and uses it to cache decoded licenses with a private key type. **utilccl,kvccl: add IsEnterpriseEnabled for faster license checks** `utilccl.CheckEnterpriseEnabled()` is used to check whether a valid enterprise license exists for a given feature. If no valid license is found, it returns an error with specific details. However, `kvccl` used this function in follower read hot paths, and instantiating an error when follower reads are unavailable could have significant overhead -- see e.g. #62447. This patch adds `IsEnterpriseEnabled()`, which has the same behavior as `CheckEnterpriseEnabled()` but returns a boolean instead. This is significantly faster since we can avoid instantiating a custom error each time. `kvccl` is also updated to use this in hot paths. Resolves #62489. Release note: None 62642: colserde: fix the edge case with nulls handling r=yuzefovich a=yuzefovich When serializing the data of Bool, Bytes, Int, and Float types when they don't have any nulls in the vector, we don't explicit specify the null bitmap. Previously, when deserializing such vectors with no nulls we would simply call `UnsetNulls` on the `coldata.Nulls` object that is currently present. However, it is possible that already present nulls object cannot support the desired batch length. This could lead to index out of bounds accesses. Note that in the vast majority of cases this likely doesn't happen in practice because we check `MaybeHasNulls`, and that would return `false` making us omit the null checking code. Fixes: #62636. Release note (bug fix): Previously, CockroachDB could encounter an internal error in rare circumstances when executing queries via the vectorized engine that operate on columns of BOOL, BYTES, INT, and FLOAT types that have a mix of NULL and non-NULL values. 62740: workload: add idle-conns flag for adding idle connections to tpcc r=rafiss a=RichardJCai workload: add idle-conns flag for adding idle connections to tpcc Release note: None #62526 62814: tenantrate: add "test" that reports IOPS estimations r=RaduBerinde a=RaduBerinde This change adds a "test" facility which takes the description of a uniform workload (read percentage, read size, write size) and prints out an estimation of the sustained IOPS and burst IO. This will allow a better understanding of how changes to the settings or the mechanism translate into IOPS changes. Release note: None 62833: kvserver: deflake TestFollowerReadsWithStaleDescriptor r=aayushshah15 a=aayushshah15 A preceding change (#62696) introduced a flakey update to this test. Prior to that change, this test was using 2 voting replicas but that change tried to make it use 1 voter and 1 non-voter instead (as a litmus test for the new syntax added in #62696). The test rebalances a replica away from a node and ensures that a historical read sent immediately afterwards gets re-routed to the leaseholder replica, since the receiving store had its replica destroyed. However, when we're using a non-voter in this test, that non-voter may not have learned about this replication change by the time it receives this historical query and that fails the assertion. This commit re-organizes the test and fixes the flake. Release note: None 62862: testutils: add skip.UnderBazelWithIssue r=rickystewart a=stevendanna This is to skip individual tests under bazel. This seems a bit more fine-grained than the broken_in_bazel tag in the bazel configuration but also allows us to avoid skipping tests that work outside of bazel in our main test suite. Release note: None 62877: Added CACHE to SEQUENCE syntax diagrams r=ericharmeling a=ericharmeling Follow-up of #56954. Release justification: non-production code changes Release note: None 62889: colexecerror: catch panics from packages in sql/sem folder r=yuzefovich a=yuzefovich Previously, we would only catch panics from `sql/sem/tree` package. Recently sqlsmith encountered a crash because of a panic in `sql/sem/builtins` package, and I believe it is reasonable to catch panics from that package as well as from `sql/sem/transform`, so we will now be catching based on `sql/sem` prefix. Addresses: #62846. Release note: None 62898: build: install essential build tools in teamcity build agents r=jlinder a=rickystewart In #62815, we migrated from an alternative way of installing golang, the `longsleep/golang-backports` deb repo, to the currently recommended install method found at https://golang.org/doc/install -- namely, we download a tarball and then just unzip it in the right spot. This works perfectly, *except* that the deb package had a dependency on build tools like `gcc` and `make`, and certain build configurations had come to depend on their global installation (namely, all those that don't use `builder.sh` to run a build). This resulted in a couple of failures being reported: * https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_ExampleORMs/2834741 * https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_Acceptance/2834732 We just install [`build-essential`](https://packages.ubuntu.com/xenial/build-essential) here, which is the easiest way to get all of that stuff. Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: richardjcai <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Aayush Shah <[email protected]> Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
As seen in #62359,
kvfollowerreadsccl.canSendToFollower
started using a significant amount of resources in the week before Feb 19. The following flame graph shows that this was viaDistSender.sendToReplicas
, with the majority of the time spent inpgerror.Newf
viacheckEnterpriseEnabled
.The offending PR was #59571,
kv95
benchmarks show a drop from 60712 ops/s to 55565 ops/s (median over 5 runs) for this commit.The text was updated successfully, but these errors were encountered: