Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql stats: extend timeout for tests for combinedstmts endpoints #109184

Closed
gtr opened this issue Aug 21, 2023 · 0 comments · Fixed by #109381
Closed

sql stats: extend timeout for tests for combinedstmts endpoints #109184

gtr opened this issue Aug 21, 2023 · 0 comments · Fixed by #109381
Assignees
Labels
A-cluster-observability Related to cluster observability C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test

Comments

@gtr
Copy link
Contributor

gtr commented Aug 21, 2023

Describe the problem

Currently, tests which hit the combinedstmts endpoint will fail under stress. The current http client for the GetJSONProtoWithAdminOption test function is set to 10s, which isn't enough for mosts tests. The http client should have its timeout extended or provide an option to extend to a sufficient duration.

Jira issue: CRDB-30814

@gtr gtr added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cluster-observability A-cluster-observability Related to cluster observability labels Aug 21, 2023
gtr added a commit to gtr/cockroach that referenced this issue Aug 22, 2023
under stress

This commit skips tests which hit the `combinedStmts` or `statements`
endpoints which will sometimes timeout under stress as a result of
recent backend changes. The test investigation is tracked by cockroachdb#109184.

Release note: None
@gtr gtr self-assigned this Aug 23, 2023
craig bot pushed a commit that referenced this issue Aug 23, 2023
…109356

108485: github: code coverage workflows r=RaduBerinde a=RaduBerinde

This change adds two GitHub Action workflows which run on each PR. One generates unit test code coverage data, and one publishes that data to a GCS bucket from where Reviewable can access it.

We generate coverage data using `bazel coverage`, but we restrict it to only test the packages that have been modified by the PR.

Two workflows are required for security (the first workflow runs potentially malicious code from a fork); for more details, see https://securitylab.github.com/research/github-actions-preventing-pwn-requests/

Epic: none
Release note: None

109036: roachprod: better determination if scp -R flag can be used r=RaduBerinde a=RaduBerinde

When uploading a file to a cluster, we use the "tree dist" algorithm by default. This uploads the file to a single node, then we copy the file from that node to the other nodes (up to 10).

This only makes sense if the remote-to-remote transfers can happen directly, which only happens if we pass the `-R -A` flags to `scp`. Unfortunately older versions don't support these flags. Currently the flags are only passed if the OS is `darwin`.

This commits improves the determination - we run `ssh -V` (once) and check if the `SSL` major version is three. For reference, some examples of what `ssh -V` returns:
 - recent MacOSX: `OpenSSH_9.0p1, LibreSSL 3.3.6`
 - Ubuntu 22.04: `OpenSSH_8.9p1 Ubuntu-3ubuntu0.3, OpenSSL 3.0.2 15 Mar 2022`

In addition, if the version is not 3, we disable the use of "tree dist".

Epic: none
Release note: None

109260: sql stats: skip tests hitting combinedstmts and statements endpoints r=gtr a=gtr

Part of #109184.

This commit skips tests which hit the `combinedStmts` or `statements` endpoints which will sometimes timeout under stress as a result of recent backend changes. The test investigation is tracked by #109184.

Release note: None

109288: dev: error when trying to `dev test` a bazel tested target r=rickystewart a=liamgillies

Running `dev test` on these integration tests will always fail, so this PR adds a error when running the command on those files.

Fixes: #107813
Release note: None

109292: sql: fix expected batch count for edge case in copy test r=rharding6373 a=rharding6373

In TestLargeDynamicRows we test that 4 rows of data can fit in a batch size of at least 4 rows given default memory sizes. However, when we set the batch row size to the minimum value of 4, the test hook that counts batches counts an extra empty batch. This PR changes adjusts the minimum row size to 5 for the purposes of this test.

Epic: None
Fixes: #109134

Release note: None

109324: build: update bazel builder build docs r=rickystewart a=rail

Previously, the documentation described a manual build of the `bazelbuilder` docker image. The current approach is to use CI to build the image.

This PR updates the documentation to reflect the current process, including the FIPS image steps.

Epic: none
Release note: None

109340: changefeedccl: move node drain handling logic out of kvfeed r=miretskiy a=jayshrivastava

Previously, the kvfeed was responsible for monitoring for
node drains using a goroutine. This change moves this logic
into the change aggregator and removes the goroutine.
Overall, this change makes the code more organized and performant.

This change was inspired by work being done for #109167. The
work in that PR requires being able to restart the kvfeed.
Having drain logic intermingled with the kvfeed makes
restarts much more complex, hard to review, prone to bugs, etc.

Informs: #96953
Release note: None
Epic: None

109349: kv: wait on latches on each key in reverse acquisition order r=arulajmani,kvoli a=nvanbenschoten

This commit allocates latch IDs from the top of the uint64 space and in reverse order. This is done to order latches in the tree on a same key in reverse order of acquisition. Doing so ensures that when we iterate over the tree and see a key with many conflicting latches, we visit the latches on that key in the reverse order that they will be released. In doing so, we minimize the number of open channels that we wait on (calls to `waitForSignal`) and minimize the number of goroutine scheduling points. This is important to avoid spikes in runnable goroutine after each request completes, which can negatively affect node health.

See experiments below.

Epic: None
Release note (performance improvement): The impact of high concurrency blind writes to the same key on goroutine scheduling latency was reduced.

109356: build: explicitly set SKIP_LABEL_TEST_FAILURE in compose.sh r=rickystewart a=chrisseto

Previously, `SKIP_LABEL_TEST_FAILURE` was being set via a teamcity configuration. This change was quite opaque as the majority of CI configuration for Cockroach is stored as shell scripts within its repo. This commit follows that pattern by explicitly setting `SKIP_LABEL_TEST_FAILURE` in the script that runs `TestComposeCompare`.

Epic: None
Release note: None

Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: gtr <[email protected]>
Co-authored-by: Liam Gillies <[email protected]>
Co-authored-by: rharding6373 <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Chris Seto <[email protected]>
gtr added a commit to gtr/cockroach that referenced this issue Aug 24, 2023
Fixes: cockroachdb#109184.

Previously, tests which hit the `combinedstmts` and `statements`
endpoints were skipped under stress because they would occaisonally
fail. This commit unskips these tests and additonally introduces a new
function `GetStatusJSONProtoWithAdminAndTimeoutOption` which provides a
parameter to add additional time to the http client's timeout. The
default timeout is 10 seconds. 20 seconds is added for tests executed
under stress, which should be enough time for these tests to pass.

Release note: None
gtr added a commit to gtr/cockroach that referenced this issue Aug 25, 2023
Fixes: cockroachdb#109184.

Previously, tests which hit the `combinedstmts` and `statements`
endpoints were skipped under stress because they would occaisonally
fail. This commit unskips these tests and additonally introduces a new
function `GetStatusJSONProtoWithAdminAndTimeoutOption` which provides a
parameter to add additional time to the http client's timeout. The
default timeout is 10 seconds. 20 seconds is added for tests executed
under stress, which should be enough time for these tests to pass.

Release note: None
gtr added a commit to gtr/cockroach that referenced this issue Aug 28, 2023
Fixes: cockroachdb#109184.

Previously, tests which hit the `combinedstmts` and `statements`
endpoints were skipped under stress because they would occaisonally
fail. This commit unskips these tests and instead of unmarshalling the
metadata JSON blob, the select query directly extracts the values needed
from it. The http client's timeout is also increased by 15s.

Release note: None
gtr added a commit to gtr/cockroach that referenced this issue Aug 28, 2023
Fixes: cockroachdb#109184.

Previously, tests which hit the `combinedstmts` and `statements`
endpoints were skipped under stress because they would occaisonally
fail. This commit unskips these tests and instead of unmarshalling the
metadata JSON blob, the select query directly extracts the values needed
from it. The http client's timeout is also increased by 15s.

Release note: None
gtr added a commit to gtr/cockroach that referenced this issue Aug 28, 2023
Fixes: cockroachdb#109184.

Previously, tests which hit the `combinedstmts` and `statements`
endpoints were skipped under stress because they would occaisonally
fail. This commit unskips these tests and instead of unmarshalling the
metadata JSON blob, the select query directly extracts the values needed
from it. The http client's timeout is also increased by 30s.

Release note: None
gtr added a commit to gtr/cockroach that referenced this issue Aug 28, 2023
Fixes: cockroachdb#109184.

Previously, tests which hit the `combinedstmts` and `statements`
endpoints were skipped under stress because they would occaisonally
fail. This commit unskips these tests and instead of unmarshalling the
metadata JSON blob, the select query directly extracts the values needed
from it. The http client's timeout is also increased by 30s.

Release note: None
gtr added a commit to gtr/cockroach that referenced this issue Aug 28, 2023
Fixes: cockroachdb#109184.

Previously, tests which hit the `combinedstmts` and `statements`
endpoints were skipped under stress because they would occaisonally
fail. This commit unskips these tests and instead of unmarshalling the
metadata JSON blob, the select query directly extracts the values needed
from it. The http client's timeout is also increased by 30s.

Release note: None
gtr added a commit to gtr/cockroach that referenced this issue Aug 29, 2023
Fixes: cockroachdb#109184.

Previously, tests which hit the `combinedstmts` and `statements`
endpoints were skipped under stress because they would occaisonally
fail. This commit unskips these tests and instead of unmarshalling the
metadata JSON blob, the select query directly extracts the values needed
from it. The http client's timeout is also increased by 30s.

Release note (bug fix): unskip tests related to a fix to a bug reported
by a customer
craig bot pushed a commit that referenced this issue Aug 29, 2023
109316: asim: add assertion, mutation-assertion events r=kvoli a=wenyihu6

Previously, the simulator relied on test code to create functions to be called
for delayed events. As we want to introduce more dynamic events through the
randomized testing framework, this approach becomes increasingly complex and
error-prone. In addition, we want to introduce new event types, such as
assertion events and mutation-assertion events to validate the simulation’s
state and verify the effects of prior mutation events.

To achieve these goals, the following changes were made in this patch:
1. Instead of directly adding functions to be invoked to simulator’s
delayEventList, events are now organized into different struct
(`SetSpanConfigEvent`, `AddNodeEvent`, `SetNodeLivenessEvent`, and
`SetCapacityOverrideEvent`). Each struct is equipped to generate its
corresponding function which can be called externally for event execution. These
events can be scheduled with `StaticEvents` using `ScheduleEvent` method with
just the required parameters.
2. `AssertionEvent` struct can now be used to schedule assertion events as part
of the simulation using `ScheduleEvent`.
3. `MutationWithAssertionEvent` can also be used now to schedule mutation events
coupled with subsequent assertion events using the
`ScheduleMutationWithAssertionEvent` method.

Under the hood, these events all implement the event interface. This interface
outlines
1. Func() returns a function that can be called externally.
2. String() returns a formatted output string. To accommodate varying function
types returned by `Func()`, we introduced another interface for different
function types to be used as an event func.

When event executor executes these events at their scheduled time, it retrieves
the functions calling Func() and calls them with the simulator's current state
and history.

Part of: #106192
Release note: none

109381: sqlstats: unskip tests hitting combinedstmts and stmts endpoints r=gtr a=gtr

Fixes: #109184.

Previously, tests which hit the `combinedstmts` and `statements`
endpoints were skipped under stress because they would occaisonally
fail. This commit unskips these tests and instead of unmarshalling the
metadata JSON blob, the select query directly extracts the values needed
from it.

Release note: None

109627: opt: fix bugs in plan gist decoding r=mgartner a=mgartner

#### opt: fix plan gist decoding of inverted filters

Details about inverted filter nodes are not encoded in plan gists. The
plan gist decoder incorrectly assumed there were some details encoded,
and would raise an internal error whenever decoding a plan gist with an
inverted filter. This commit fixes the incorrect assumption to prevent
the internal error.

Fixes #108979

There is no release not because plan gists are an undocumented feature.

Release note: None

#### opt: fix plan gist decoding internal error

This commit fixes some cases where `crdb_internal.decode_plan_gist`
could raise internal index-out-of-bound errors when given incorrectly
formed input.

Fixes #109560

Release note: None


Co-authored-by: wenyihu6 <[email protected]>
Co-authored-by: gtr <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
@craig craig bot closed this as completed in 27ed761 Aug 29, 2023
gtr added a commit to gtr/cockroach that referenced this issue Sep 8, 2023
Fixes: cockroachdb#109184.

Previously, tests which hit the `combinedstmts` and `statements`
endpoints were skipped under stress because they would occaisonally
fail. This commit unskips these tests and instead of unmarshalling the
metadata JSON blob, the select query directly extracts the values needed
from it. The http client's timeout is also increased by 30s.

Release note: None
gtr added a commit to gtr/cockroach that referenced this issue Sep 8, 2023
Fixes: cockroachdb#109184.

Previously, tests which hit the `combinedstmts` and `statements`
endpoints were skipped under stress because they would occaisonally
fail. This commit unskips these tests and instead of unmarshalling the
metadata JSON blob, the select query directly extracts the values needed
from it. The http client's timeout is also increased by 30s.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cluster-observability Related to cluster observability C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants