Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvfollowerreadsccl: excessive overhead in canSendToFollower #62447

Closed
erikgrinaker opened this issue Mar 23, 2021 · 2 comments · Fixed by #62465
Closed

kvfollowerreadsccl: excessive overhead in canSendToFollower #62447

erikgrinaker opened this issue Mar 23, 2021 · 2 comments · Fixed by #62465
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@erikgrinaker
Copy link
Contributor

As seen in #62359, kvfollowerreadsccl.canSendToFollower started using a significant amount of resources in the week before Feb 19. The following flame graph shows that this was via DistSender.sendToReplicas, with the majority of the time spent in pgerror.Newf via checkEnterpriseEnabled.

Screenshot 2021-03-23 at 13 45 06

The offending PR was #59571, kv95 benchmarks show a drop from 60712 ops/s to 55565 ops/s (median over 5 runs) for this commit.

@erikgrinaker erikgrinaker added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-distribution Relating to rebalancing and leasing. labels Mar 23, 2021
@erikgrinaker
Copy link
Contributor Author

Assigning this to @nvanbenschoten for now, since he submitted the original PR, but might pick this up myself when I have time.

@nvanbenschoten
Copy link
Member

Nice find! I think what happened here is that we lifted the call to checkEnterpriseEnabled from after checking for a sufficient staleness to before, so we now hit it on all requests. My instinct is that we should both a) re-order the call to checkEnterpriseEnabled and b) make checkEnterpriseEnabled cheaper by avoiding heap allocations.

craig bot pushed a commit that referenced this issue Mar 23, 2021
62377: changefeedccl: allow topic_name parameter for changefeed kafka sinks r=[stevendanna,miretskiy] a=HonoreDB

Previously, changes for a table went to a Kafka topic
named for that table, with users only able to specify a prefix.
Some users, however, need changes to go to a specific topic,
including sometimes the same one for more than one table,
distinguishing messages using metadata.
This patch allows the `?topic_name=foo` parameter to be added
to Kafka sink URIs. This will override the per-table topic
generation, so that changes for every table will all go to
the specified topic. It may be used in conjunction with
`topic_prefix`, although the distinction is not meaningful.

Release note (enterprise change): Kafka sink URIs now accept
the "topic_name" parameter to override per-table topic names.

Closes #59300
Closes #58302


62414: workload/schemachange: add SURVIVE syntax r=ajwerner a=otan

see individual commits for details

62420: cliccl/load: fix TestLoadShowIncremental typo r=pbardea a=Elliebababa

This patch fixs typo of TestLoadShowIncremental.

Resolves: #62416

Release note: none

62465: kvccl: re-order enterprise check in canSendToFollower r=nvanbenschoten a=nvanbenschoten

Fixes #62447.

In #62447, Erik found that #59571 had re-ordered the call to
`utilccl.CheckEnterpriseEnabled` to occur before checking the batch in
`canSendToFollower`, instead of after. This added an error allocation
into the hot path of all reads, which showed up in CPU profiles and
caused an 8% performance regression on `kv95`. This commit fixes this by
moving the enterprise check back out of the hot-path for all non-stale
read-only batches.

A follow up to this PR would be to make `utilccl.CheckEnterpriseEnabled`
cheaper by avoiding the error allocation for callers that don't need an
error. This work is not done in this commit.

62473: kvserver: unskip TestEagerReplication r=lunevalex a=lunevalex

PR #61847 fixed the flake in TestEagerReplication but was not
rebased with master, so the skipped tag was not properly removed.
This PR actually unskips TestEagerReplication.

Release note: None

Co-authored-by: Aaron Zinger <[email protected]>
Co-authored-by: Oliver Tan <[email protected]>
Co-authored-by: elliebababa <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Alex Lunev <[email protected]>
@craig craig bot closed this as completed in b7d2f2e Mar 23, 2021
erikgrinaker added a commit to erikgrinaker/cockroach that referenced this issue Mar 23, 2021
`utilccl.CheckEnterpriseEnabled()` is used to check whether a valid
enterprise license exists for a given feature. If no valid license is
found, it returns an error with specific details.

However, `kvccl` used this function in follower read hot paths, and
instantiating an error when follower reads are unavailable could have
significant overhead -- see e.g. cockroachdb#62447.

This patch adds `IsEnterpriseEnabled()`, which has the same behavior as
`CheckEnterpriseEnabled()` but returns a boolean instead. This is
significantly faster since we can avoid instantiating a custom error
each time. `kvccl` is also updated to use this in hot paths.

Release note: None
erikgrinaker added a commit to erikgrinaker/cockroach that referenced this issue Mar 23, 2021
Previously, the `utilccl` package would base64-decode the license from
the settings representation every time it was needed, which was
sufficient for its uses. However, recently there's been a need to check
whether enterprise features are enabled in hot paths (e.g. with follower
reads as seen in cockroachdb#62447), making the decoding cost too great.

This patch adds `decodeCached()` which caches the decoded license, and
uses it where appropriate.

Release note: None
erikgrinaker added a commit to erikgrinaker/cockroach that referenced this issue Mar 24, 2021
Previously, the `utilccl` package would decode the license from the the
base64-encoded Protobuf representation in settings every time it was
needed, which was sufficient for its uses. However, recently there's
been a need to check whether enterprise features are enabled in hot
paths (e.g. with follower reads as seen in cockroachdb#62447), making the decoding
cost too great.

This patch adds `cluster.Settings.Cache` as a shared cache, and uses it
to cache decoded licenses with a private key type.

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 25, 2021
Fixes cockroachdb#62447.

In cockroachdb#62447, Erik found that cockroachdb#59571 had re-ordered the call to
`utilccl.CheckEnterpriseEnabled` to occur before checking the batch in
`canSendToFollower`, instead of after. This added an error allocation
into the hot path of all reads, which showed up in CPU profiles and
caused an 8% performance regression on `kv95`. This commit fixes this by
moving the enterprise check back out of the hot-path for all non-stale
read-only batches.

A follow up to this PR would be to make `utilccl.CheckEnterpriseEnabled`
cheaper by avoiding the error allocation for callers that don't need an
error. This work is not done in this commit.
erikgrinaker added a commit to erikgrinaker/cockroach that referenced this issue Mar 31, 2021
`utilccl.CheckEnterpriseEnabled()` is used to check whether a valid
enterprise license exists for a given feature. If no valid license is
found, it returns an error with specific details.

However, `kvccl` used this function in follower read hot paths, and
instantiating an error when follower reads are unavailable could have
significant overhead -- see e.g. cockroachdb#62447.

This patch adds `IsEnterpriseEnabled()`, which has the same behavior as
`CheckEnterpriseEnabled()` but returns a boolean instead. This is
significantly faster since we can avoid instantiating a custom error
each time. `kvccl` is also updated to use this in hot paths.

Release note: None
erikgrinaker added a commit to erikgrinaker/cockroach that referenced this issue Mar 31, 2021
Previously, the `utilccl` package would decode the license from the the
base64-encoded Protobuf representation in settings every time it was
needed, which was sufficient for its uses. However, recently there's
been a need to check whether enterprise features are enabled in hot
paths (e.g. with follower reads as seen in cockroachdb#62447), making the decoding
cost too great.

This patch adds `cluster.Settings.Cache` as a shared cache, and uses it
to cache decoded licenses with a private key type.

Release note: None
craig bot pushed a commit that referenced this issue Mar 31, 2021
62498: utilccl,kvccl: improve performance when checking enterprise features r=tbg a=erikgrinaker

**utilccl: cache license decoding**

Previously, the `utilccl` package would decode the license from the the
base64-encoded Protobuf representation in settings every time it was
needed, which was sufficient for its uses. However, recently there's
been a need to check whether enterprise features are enabled in hot
paths (e.g. with follower reads as seen in #62447), making the decoding
cost too great.

This patch adds `cluster.Settings.Cache` as a shared cache, and uses it
to cache decoded licenses with a private key type.

**utilccl,kvccl: add IsEnterpriseEnabled for faster license checks**

`utilccl.CheckEnterpriseEnabled()` is used to check whether a valid
enterprise license exists for a given feature. If no valid license is
found, it returns an error with specific details.

However, `kvccl` used this function in follower read hot paths, and
instantiating an error when follower reads are unavailable could have
significant overhead -- see e.g. #62447.

This patch adds `IsEnterpriseEnabled()`, which has the same behavior as
`CheckEnterpriseEnabled()` but returns a boolean instead. This is
significantly faster since we can avoid instantiating a custom error
each time. `kvccl` is also updated to use this in hot paths.

Resolves #62489.

Release note: None

62642: colserde: fix the edge case with nulls handling r=yuzefovich a=yuzefovich

When serializing the data of Bool, Bytes, Int, and Float types when they
don't have any nulls in the vector, we don't explicit specify the null
bitmap. Previously, when deserializing such vectors with no nulls we
would simply call `UnsetNulls` on the `coldata.Nulls` object that is
currently present. However, it is possible that already present nulls
object cannot support the desired batch length. This could lead to index
out of bounds accesses. Note that in the vast majority of cases this
likely doesn't happen in practice because we check `MaybeHasNulls`, and
that would return `false` making us omit the null checking code.

Fixes: #62636.

Release note (bug fix): Previously, CockroachDB could encounter an
internal error in rare circumstances when executing queries via the
vectorized engine that operate on columns of BOOL, BYTES, INT, and FLOAT
types that have a mix of NULL and non-NULL values.

62740: workload: add idle-conns flag for adding idle connections to tpcc r=rafiss a=RichardJCai

workload: add idle-conns flag for adding idle connections to tpcc

Release note: None

#62526

62814: tenantrate: add "test" that reports IOPS estimations r=RaduBerinde a=RaduBerinde

This change adds a "test" facility which takes the description of a
uniform workload (read percentage, read size, write size) and prints
out an estimation of the sustained IOPS and burst IO. This will allow
a better understanding of how changes to the settings or the mechanism
translate into IOPS changes.

Release note: None

62833: kvserver: deflake TestFollowerReadsWithStaleDescriptor r=aayushshah15 a=aayushshah15

A preceding change (#62696) introduced a flakey update to this test.
Prior to that change, this test was using 2 voting replicas but that
change tried to make it use 1 voter and 1 non-voter instead (as a litmus
test for the new syntax added in #62696).

The test rebalances a replica away from a node and ensures that a
historical read sent immediately afterwards gets re-routed to the
leaseholder replica, since the receiving store had its replica
destroyed. However, when we're using a non-voter in this test, that
non-voter may not have learned about this replication change by the time
it receives this historical query and that fails the assertion.

This commit re-organizes the test and fixes the flake.

Release note: None

62862: testutils: add skip.UnderBazelWithIssue r=rickystewart a=stevendanna

This is to skip individual tests under bazel. This seems a bit more
fine-grained than the broken_in_bazel tag in the bazel configuration
but also allows us to avoid skipping tests that work outside of bazel
in our main test suite.

Release note: None

62877: Added CACHE to SEQUENCE syntax diagrams r=ericharmeling a=ericharmeling

Follow-up of #56954.

Release justification: non-production code changes

Release note: None

62889: colexecerror: catch panics from packages in sql/sem folder r=yuzefovich a=yuzefovich

Previously, we would only catch panics from `sql/sem/tree` package.
Recently sqlsmith encountered a crash because of a panic in
`sql/sem/builtins` package, and I believe it is reasonable to catch
panics from that package as well as from `sql/sem/transform`, so we will
now be catching based on `sql/sem` prefix.

Addresses: #62846.

Release note: None

62898: build: install essential build tools in teamcity build agents r=jlinder a=rickystewart

In #62815, we migrated from an alternative way of installing golang, the
`longsleep/golang-backports` deb repo, to the currently recommended
install method found at https://golang.org/doc/install -- namely, we
download a tarball and then just unzip it in the right spot. This
works perfectly, *except* that the deb package had a dependency on build
tools like `gcc` and `make`, and certain build configurations had come
to depend on their global installation (namely, all those that don't use
`builder.sh` to run a build). This resulted in a couple of failures
being reported:

* https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_ExampleORMs/2834741
* https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_Acceptance/2834732

We just install [`build-essential`](https://packages.ubuntu.com/xenial/build-essential)
here, which is the easiest way to get all of that stuff.

Release note: None

Co-authored-by: Erik Grinaker <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: richardjcai <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Aayush Shah <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants