Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logictest: improve observability of TeamCity timeouts #78185

Closed
mgartner opened this issue Mar 21, 2022 · 3 comments · Fixed by #86363
Closed

logictest: improve observability of TeamCity timeouts #78185

mgartner opened this issue Mar 21, 2022 · 3 comments · Fixed by #86363
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-quick-win Likely to be a quick win for someone experienced. T-sql-queries SQL Queries Team

Comments

@mgartner
Copy link
Collaborator

mgartner commented Mar 21, 2022

It's possible for logic tests to run longer than the time allotted for them in TC, in which case TC abruptly kills the tests and prints a timeout message to the log:

-- Test timed out at 2022-03-17 20:31:33 UTC --

Example Log

From the logs, it's unclear which test(s) were running when the timeout occurred. We should ensure that go test timeouts first, so we get an informative stack trace that will help track down the issue.

Jira issue: CRDB-14009
Epic CRDB-8349

@mgartner mgartner added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Mar 21, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Mar 21, 2022
@mgartner mgartner added the E-quick-win Likely to be a quick win for someone experienced. label Mar 21, 2022
@mgartner
Copy link
Collaborator Author

Until this is fixed, you can use awk and a database to try to find the running tests at the time of the timeout:

cat test.log | awk '/RUN|done/ { if ($3 == "done:") { n = split($4, a, "/"); split($7, c, ":"); print "DONE" "," c[1] "/" a[n] } else { split($3, b, "/"); print $2 "," b[2] "/" b[3] } }' > running_tests.csv

Import running_tests.csv into a table like:

CREATE TABLE running_tests (status TEXT, name TEXT)

And run the query to find tests that started but did not finish:

SELECT name
FROM running_tests t
WHERE status = 'RUN' AND NOT EXISTS (
  SELECT * FROM running_tests s
  WHERE status = 'DONE' AND t.name = s.name
)

@tbg
Copy link
Member

tbg commented Mar 21, 2022

I mislead you a bit - in the above failure, bazel actually timed out the test. So maybe one ask is for bazel to send a signal before killing, and then the signal can dump stacks and print auxiliary information.

@ajwerner
Copy link
Contributor

ajwerner commented Aug 3, 2022

It'd be amazing to get stack traces.

@healthy-pod healthy-pod self-assigned this Aug 17, 2022
healthy-pod pushed a commit to healthy-pod/cockroach that referenced this issue Aug 18, 2022
Previously, bazel will kill the test process if it goes
beyond the timeout duration set for its size. This prevented
us from knowing which tests timed out and also prevented us
from getting their stack traces.

This patch causes the `go test` process to timeout before
bazel kills it to allow us to know which test timed out and
get its stack trace.

Closes cockroachdb#78185

Release justification: Non-production code changes
Release note: None
healthy-pod pushed a commit to healthy-pod/cockroach that referenced this issue Aug 18, 2022
Previously, bazel will kill the test process if it goes
beyond the timeout duration set for its size. This prevented
us from knowing which tests timed out and also prevented us
from getting their stack traces.

This patch causes the `go test` process to timeout before
bazel kills it to allow us to know which test timed out and
get its stack trace.

Closes cockroachdb#78185

Release justification: Non-production code changes
Release note: None
craig bot pushed a commit that referenced this issue Aug 31, 2022
86363: bazel: make `go test` process timeout before bazel kills it r=rickystewart a=healthy-pod

Previously, bazel will kill the test process if it goes
beyond the timeout duration set for its size. This prevented
us from knowing which tests timed out and also prevented us
from getting their stack traces.

This patch causes the `go test` process to timeout before
bazel kills it to allow us to know which test timed out and
get its stack trace.

Closes #78185

Release justification: Non-production code changes
Release note: None

87067: descs,catkv: rewrite Collection backend r=postamar a=postamar

This rewrite was motivated by the upcoming need to validate descriptors
up to different levels depending on why they were being read. Doing so
requires tracking the validation level in a more granular fashion within
the Collection's caches. Trying to make this work within the existing
scheme where the uncommitted and stored descriptors are all in the same
layer turned out to be painful and hard to reason about.

This commit does several things:
1. The uncommitted descriptors are spun off into their own distinct layer
again.
2. The stored descriptors now mirror what's in storage and so that whole
layer could effectively be moved to catkv.
3. Execution flow through the layers in the Collection is now
straightforwardly implemented and easy to reason about. Validation and
hydration now take place in unique, well-defined steps at the very end
of a lookup, be it by name or by ID.
3. Instances of this new stored descriptors layer are now also used by
the lease manager, the Direct() interface provided by the Collection,
etc.
4. As a result the logic in catkv could be significantly cleaned up.
In particular, lookups for descriptor validation are significantly
less convoluted now.
5. The SystemNamespaceCache object was also moved to catkv and now
supports caching descriptors, and so was renamed to
SystemCatalogCache.

Despite significant rewriting of the inner workings of the descriptors
collection, this commit should not functionally change anything.

Informs #85263.

Release justification: low-risk, high-benefit change.
Release note: None


87103: storage: fix mvcc stats on gc of range tombstone over del r=erikgrinaker a=aliher1911

Previously if range tombstone was placed over delete, GC would
not correctly update GC bytes age incorrectly using range tombstone ts
as age of underlying tombstone.

Release justification: Bugfix
Release note: None

Fixing: #87042

87212: ci: set appropriate test tmpdir for compose test r=srosenberg a=rickystewart

Otherwise the tmpdir defaults to `/artifacts` which is non-existent
outside of a Docker container.

Release justification: Non-production code changes
Release note: None

87229: storage: fix leaked iterators in unit tests r=erikgrinaker a=jbowens

Fix a couple instances of leaked iterators in unit tests in the storage
package.

This should also resolve the goroutine leak observed in #86256.

Informs #71481.

Release justification: Non-production code changes
Release note: None

Co-authored-by: healthy-pod <[email protected]>
Co-authored-by: Marius Posta <[email protected]>
Co-authored-by: Andrew Werner <[email protected]>
Co-authored-by: Oleg Afanasyev <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
@craig craig bot closed this as completed in 26192d2 Aug 31, 2022
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) E-quick-win Likely to be a quick win for someone experienced. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants