Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/opt/exec/execbuilder: TestExecBuild failed #75497

Closed
cockroach-teamcity opened this issue Jan 25, 2022 · 3 comments · Fixed by #75515
Closed

sql/opt/exec/execbuilder: TestExecBuild failed #75497

cockroach-teamcity opened this issue Jan 25, 2022 · 3 comments · Fixed by #75515
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-sql-queries SQL Queries Team

Comments

@cockroach-teamcity
Copy link
Member

sql/opt/exec/execbuilder.TestExecBuild failed with artifacts on master @ 236797939ac41bb759feee9bb676a5d6ca381237:

            │
            └── • filter
                │ filter: st_coveredby('01040000000200000001010000009A999999999901409A99999999990140010100000000000000000008400000000000000840', geom)
                │
                └── • index join
                    │ table: geo_table@geo_table_pkey
                    │
                    └── • inverted filter
                        │ inverted column: geom_inverted_key
                        │ num spans: 31
                        │
                        └── • scan
                              missing stats
                              table: geo_table@geom_index
                              spans: 31 spans
            ·
            Diagram: https://cockroachdb.github.io/distsqlplan/decode.html#eJyUlF9v2jwUxu_fT2Gdm4JeF2wndKuv6J90y0ShC0xbNaMqxWddVIgz21RUFd99Cuna0opQfGE4x_6dk-eR7Qdwf6YgIfpx0TuK-6RxGg9Hw6-9JhlGvehkRG7JWTI4Jzdornx6PUXy_XOURMT5q4m5Q4v6-r6xd_6tN4ovBnF_1GiIliCiJZqUNIIWI0GLNZt7Un6KBufRKLmkZalZkwyS0yghx5fkFijkRmM_naED-RM4UBAwplBYM0HnjC3TD6tNsV6AZBSyvJj7Mj2mMDEWQT6Az_wUQULf7Jui3QEKGn2aTVfblhTM3D9Dzqc3CPJgSV8U5vWFR6X-BFONts3WysOTPd1S3VWWa1wAhRMznc9yJ8ltJRsoDIu0TLQVHCu1-KWVWnCm1IJtm2B_V4YrIGmuScCI8b_ROthkA9_Fhji_Q-tRn2VTjxZtm6978W89WhSWmJx0uSSuVE2cT62XKxXBh45STDClGNs2AcFc74qV4l-ppzCYe0m6fKMPYhcfvpgsfzwNYtNpKGw2S-39c2vaFRu7B7t0f3I_WO9d5eWrC8o4C1k1xOMvZ7z6c3h0-DQYD9_EzzvXxsfwTbwnX97zrmi-w_JwF9FDYz3adrguucv_31i-s1Z-y8uRoCtM7vBdTwdbjimgvsHqdXJmbid4Yc1k1aYKBytuldDofLV6UAVxXi2VH_gS5rWwqIdFLRzUw0EtHNbDYS3cqYc7tTB7BY-X__0NAAD__xVlF_8=
            
        but found (query options: "") :
            distribution: local
            vectorized: true
            ·
            • sort
            │ order: +k
            │
            └── • filter
                │ filter: st_coveredby('01040000000200000001010000009A999999999901409A99999999990140010100000000000000000008400000000000000840', geom)
                │
                └── • index join
                    │ table: geo_table@geo_table_pkey
                    │
                    └── • inverted filter
                        │ inverted column: geom_inverted_key
                        │ num spans: 31
                        │
                        └── • scan
                              missing stats
                              table: geo_table@geom_index
                              spans: 31 spans
            ·
            Diagram: https://cockroachdb.github.io/distsqlplan/decode.html#eJyUlN9P2zAQx9_3V1j3Qqt5re0Ebfip_AhbpkJZ2mlDc4VCfWMRbdzZLgpC_d-nJAwoqIH4we3d-XN335OdO3B_5yAh-nk23I9PSecoHk_G34ZdMo6G0eGEXJPjZHRCrtBc-PRyjuTHlyiJiPMXM3ODFvXlbWfn5PtwEp-N4tNJpyN6goie6FLSCXqMBD3W7e5I-TkanUST5JyWqRZdMkqOooQcnJNroJAbjafpAh3IX8BhSmFpzQydM7Z03VUHYl2AZBSyfLnypXtKYWYsgrwDn_k5goRJ2WKCqUbbZ0BBo0-zeZX2QcGgbOAiyzUWQOHQzFeL3ElyXXcGFMbLtHT0FRwoVfzWShWcKVWw1zb40JbhCkiaaxIwYvwftA6mawpm5R8lOp9eIUi-pm8fQ5zfoPWoj7O5R4u2zzdn8T8eFUtLTE4GXBJXqibOp9bLSkXwcVcpJphSjL22AcFct8VK8c_UUxitvCQDvnUOos0cvposv78NYtttWNpskdrbx9J0ILZWD9pUf5h-sFm79stnb4hxFrJ6iftfznj9Z29_72ExHr6wH09urE_hC3tHPn2KA9F9w8jDNqLHxnq0_XBT8oC_35p-t036BN3S5A430m_LzNZTCqivsP6IOLOyMzyzZlaVqc1RxVUOjc7XUV4bcV6HygafwrwRFs2waISDZjhohMNmOGyEd5_B0_W7fwEAAP__bbfl1A==
--- done: /go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/testdata/inverted_filter_geospatial_dist with config 5node: 14 tests, 1 failures
    logic.go:3379: 
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/testdata/inverted_filter_geospatial_dist:238: error while processing
    logic.go:3379: /go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/testdata/inverted_filter_geospatial_dist:238: too many errors encountered, skipping the rest of the input
--- done: /go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/testdata/lookup_join with config 5node: 98 tests, 0 failures
--- done: /go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/testdata/inverted_filter_json_array_dist with config 5node: 30 tests, 0 failures
        --- FAIL: TestExecBuild/5node/inverted_filter_geospatial_dist (3.81s)
=== RUN   TestExecBuild/5node
    --- FAIL: TestExecBuild/5node (0.00s)
Help

See also: How To Investigate a Go Test Failure (internal)
Parameters in this failure:

  • GOFLAGS=-parallel=4

/cc @cockroachdb/sql-queries

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 25, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Jan 25, 2022
@rytaft
Copy link
Collaborator

rytaft commented Jan 25, 2022

Similar to #74933, but the build includes #75304. cc @irfansharif @cucaroach

@irfansharif
Copy link
Contributor

Coming from

# Filtering is at gateway node since the filter is not distributable.
#
# TODO(treilly): What the text above claims does not square with the figure
# generated below.
query T
EXPLAIN (DISTSQL)
SELECT k FROM geo_table WHERE ST_CoveredBy('MULTIPOINT((2.2 2.2), (3.0 3.0))'::geometry, geom) ORDER BY k

I didn't add a retry there in #75304 because the comments confused me (see the TODO added on Tommy's behalf). I'm not really sure what we want to do for that test, but "de-flaking" it is a matter of similarly adding the "retry" directive.

@cucaroach
Copy link
Contributor

I'm just gonna add the retry for now and try to get help from @yuzefovich and/or @sumeerbhola on what these tests are doing and why the comments are confused.

craig bot pushed a commit that referenced this issue Jan 26, 2022
74803: spanconfigsqltranslator: populate protected_timestamps in SpanConfig r=irfansharif,arulajmani a=adityamaru

This change teaches the SQLTranslator to hydrate the SpanConfigs
for a table with protected timestamps that apply to that table.

Concretely, this change initializes a spanconfig.ProtectedTimestampStateReader
in the txn in which the translation is taking place, thereby
providing a transactional view of the system.protected_ts_records
table. After generating the span configurations based on the zone
configurations that apply to the table, we hydrate the newly
introduced protected_timestamps field on each span configuration
with all the protected timestamps that apply to this table. This
includes protected timestamp records that directly target this
table, as well as records targetting the table's parent database.
This information is obtained from the ProtectedTimestampStateReader
mentioned above.

Additionally, this change modifies StartTenant to allow secondary
tenants to interact with the protected timestamp subsystem using a
"real" protectedts.Provider provided the migration
EnableProtectedTimestampsForTenant has run.

For testing purposes, this change teaches the data driven framework
of two additional commands protect and release.

Informs: #73727

Release note: None

75458: ui: move loading and error pages to below page settings r=xinhaoz a=xinhaoz

Fixes: #75247

Previously, the loading and error pages in the statement and txn
pages took up the entire page, including the settings header
used to issue new stats requests. This resulted in the user
being unable to re-issue a query after a query results in
a timeout, as we save the query parameters for re-use on
each request.

Release note (ui change): Loading and error pages are now below
page config in txns and stmts pages.

---------------
https://www.loom.com/share/e903ec74441140be98969c02ebbc7923

75475: distsender: fix ResumeSpan for Gets r=RaduBerinde a=RaduBerinde

The distsender is not fully implementing the ResumeSpan contract for
Get; the promise is that a nil span is set when the operation has
successfully completed.

A Get that reaches a kvserver but that is not executed has the
ResumeSpan set on the server side. But if the requests span multiple
ranges, a subset of the requests will not make it to any kvserver.

This change fills in the missing case and improves the
TestMultiRangeScanWithPagination test to allow running mixed
operations.

Fixes #74736.

Release note (bug fix): In particular cases, some queries that involve
a scan which returns many results and which includes lookups for
individual keys were not returning all results from the table.

75515: sql: de-flake inverted_filter_geospatial_dist r=cucaroach a=cucaroach

Fixes: #75497

Add a retry to a test flaking due to distsender span config cache issues.

Release note: None

75518: tracing: fix span use after release when parent and child finish concurrently r=andreimatei a=andreimatei

Before this patch, it was possible for a span to be released to the
sync.Pool and reused while the tracing code was still holding a
reference to it. This was because a span could maintain unaccounted
references to its children even after those children were released. And
those reference could be inadvertently used, causing a data race and
trace corruption.

This race had two parts:
1) s.mu.openChildren could still hold references to children after
   s.finish() terminated. Those references were unaccounted for, so
   those children could be reused if they finish. This would be fine (or
   at least benign) by itself (since a finished span should generally
   not be accessed), if it wasn't for 2).
2) s' recording can be collected after s.Finish() ran, and that would
   call into s' child.

In order to trigger the bad behavior, I believe you need a child and a
parent to finish() concurrently, and a chain of 4 spans in parent-child
relationships. Lets call the spans ggp, gp, p, c (g=grand, p=parent,
c=child). The race scenario goes like this:

1. gp.Finish() is called; gp.mu.finished is set early.
2. p.Finish() is called; p.mu.finished is set early.
3. c.Finish() is called. c calls into p (p.childFinished(c)), asking p
   to remove c from c.mu.openChildren, and to collect its recording if
   it wants to. This call is a no-op because p.mu.finished is already set.
4. p.Finish() terminates. c remains part of c.mu.openChildren, but the
   corresponding reference count on c is decremented. Similarly, p
   remains part of gp.mu.openChildren.
5. c.Finish() terminates. c's reference count drops to zero (because p
   no longer counts a reference to it). c is release to the pool, and
   reused.
6. gp.Finish() resumes and calls into ggp (ggp.childFinished(gp)). ggp
   collects gp's recording, and recursively that calls into p, and then
   c. This is a data race, since c was already reused.

To fix this, we have p.childFinished(c) no longer automatically be a
no-op when p is already finished. Instead, if c is still part of
c.mu.openChildren, it will be removed. This way, when c.finish()
completes, we know that c is not reference any more by the parent.  We
also wipe p.mu.openChildren at the end of p.Finish(). This way, if a
child c calls p.childFinish(c) afterwards, that call will be
short-circuited and c's recording will not be needlessly collected.

Release note: None

Fixes #75495
Fixes #75414
Fixes #75334

Co-authored-by: Aditya Maru <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: Radu Berinde <[email protected]>
Co-authored-by: Tommy Reilly <[email protected]>
Co-authored-by: Andrei Matei <[email protected]>
@craig craig bot closed this as completed in 900eb99 Jan 26, 2022
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants