Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distsender: fix ResumeSpan for Gets #75475

Merged
merged 1 commit into from
Jan 26, 2022

Conversation

RaduBerinde
Copy link
Member

The distsender is not fully implementing the ResumeSpan contract for
Get; the promise is that a nil span is set when the operation has
successfully completed.

A Get that reaches a kvserver but that is not executed has the
ResumeSpan set on the server side. But if the requests span multiple
ranges, a subset of the requests will not make it to any kvserver.

This change fills in the missing case and improves the
TestMultiRangeScanWithPagination test to allow running mixed
operations.

Fixes #74736.

Release note (bug fix): In particular cases, some queries that involve
a scan which returns many results and which includes lookups for
individual keys were not returning all results from the table.

@RaduBerinde RaduBerinde requested a review from a team as a code owner January 25, 2022 00:36
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@RaduBerinde
Copy link
Member Author

Note, in versions before 21.2, we were never issuing Gets when scanning tables, so this was not causing any problems.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! :lgtm:

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rharding6373)

@RaduBerinde
Copy link
Member Author

TFTR! I will spend a bit of time seeing if I can make the SQL-level repro fast enough for a logictest.

@RaduBerinde
Copy link
Member Author

Added a logic test.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

cc @nvanbenschoten

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @RaduBerinde and @rharding6373)


pkg/sql/logictest/testdata/logic_test/distsql_agg, line 673 at r2 (raw file):

SELECT every(col2) FROM table58683_1 JOIN table58683_2 ON col1 = (table58683_2.tableoid)::INT8 GROUP BY col2 HAVING bool_and(col2);

# Regression test for #745736 - missing Get results when:

nit: I'm not sure when we'll have issue numbers in 700k range :)


pkg/sql/logictest/testdata/logic_test/distsql_agg, line 681 at r2 (raw file):

ALTER TABLE table74736 SPLIT AT VALUES (1000000);
ALTER TABLE table74736 EXPERIMENTAL_RELOCATE VALUES (ARRAY[1], 0), (ARRAY[2], 1000000);
INSERT INTO table74736 SELECT x * 10000, repeat('a', 500000) FROM generate_series(1, 130) AS g(x);

nit: I think we could lower the size of the blob to 200000 to speed up the test. Target bytes limit is 10MiB, so we'll be able to fit 50 out of 90 rows for the query into the first BatchResponse.

Copy link
Member Author

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rharding6373 and @yuzefovich)


pkg/sql/logictest/testdata/logic_test/distsql_agg, line 673 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I'm not sure when we'll have issue numbers in 700k range :)

Give it a couple of years :)


pkg/sql/logictest/testdata/logic_test/distsql_agg, line 681 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I think we could lower the size of the blob to 200000 to speed up the test. Target bytes limit is 10MiB, so we'll be able to fit 50 out of 90 rows for the query into the first BatchResponse.

Done. Confirmed it still repros. It was under 1s anyway.

The distsender is not fully implementing the ResumeSpan contract for
Get; the promise is that a nil span is set when the operation has
successfully completed.

A Get that reaches a kvserver but that is not executed has the
ResumeSpan set on the server side. But if the requests span multiple
ranges, a subset of the requests will not make it to any kvserver. Any
Gets that weren't executed and don't have a ResumeSpan will not be
executed again (it looks to the upper layer as if the Get was executed
and found no key). The end result is that scans can miss rows.

This change fills in the missing case and improves the
TestMultiRangeScanWithPagination test to allow running mixed
operations.

Fixes cockroachdb#74736.

Release note (bug fix): In particular cases, some queries that involve
a scan which returns many results and which includes lookups for
individual keys were not returning all results from the table.
Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! :lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)

@RaduBerinde
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 25, 2022

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Contributor

craig bot commented Jan 25, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jan 26, 2022

Build succeeded:

@craig craig bot merged commit 8f15635 into cockroachdb:master Jan 26, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jan 26, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from bc98764 to blathers/backport-release-21.2-75475: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@RaduBerinde RaduBerinde deleted the fix-get-issue branch January 27, 2022 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Database does not return all rows on select query
5 participants