Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: crdb_internal.jobs cannot be safely joined with other tables #62415

Closed
ajwerner opened this issue Mar 23, 2021 · 12 comments · Fixed by #62581
Closed

sql: crdb_internal.jobs cannot be safely joined with other tables #62415

ajwerner opened this issue Mar 23, 2021 · 12 comments · Fixed by #62581
Assignees
Labels
A-jobs A-sql-vtables Virtual tables - pg_catalog, information_schema etc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Mar 23, 2021

Describe the problem

This seems like a major regression. This works on 20.2.6. I don't understand the source of concurrency. This query does some crazy stuff down here:

if n.Block {
sqlStmt = fmt.Sprintf(
`SELECT * FROM [%s]
WHERE
IF(finished IS NULL,
IF(pg_sleep(1), crdb_internal.force_retry('24h'), 0),
0
) = 0`, sqlStmt)
}
.

The root of the problem is that we have this streaming iterator thing which we kick off underneath a virtual plan node. That thing, however, concurrently uses the same transaction as we use elsewhere. This leads to problems. The below query is just one example. There are plenty of other examples.

[email protected]:39361/movr> SHOW JOBS WHEN COMPLETE SELECT job_id FROM [SHOW JOBS] WHERE job_type = 'SCHEMA CHANGE GC' AND  description like '%idx1%' ;
ERROR: internal error: crdb-internal-jobs-table: concurrent txn use detected. ba: Get [/Table/3/1/15/2/1,/Min), [txn: fcfa4280], [can-forward-ts]
SQLSTATE: XX000
DETAIL: stack trace:
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:73: SendLocked()
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:46: SendLocked()
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:126: SendLocked()
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:267: sendLockedWithRefreshAttempts()
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:202: SendLocked()
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:253: SendLocked()
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105: SendLocked()
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:171: SendLocked()
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:503: Send()
github.com/cockroachdb/cockroach/pkg/kv/db.go:808: sendUsingSender()
github.com/cockroachdb/cockroach/pkg/kv/txn.go:936: Send()
github.com/cockroachdb/cockroach/pkg/kv/db.go:719: sendAndFill()
github.com/cockroachdb/cockroach/pkg/kv/txn.go:607: Run()
github.com/cockroachdb/cockroach/pkg/kv/txn.go:374: Get()
github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:414: getDescriptorByID()
github.com/cockroachdb/cockroach/pkg/sql/catalog/catalogkv/catalogkv.go:519: GetMutableDescriptorByID()
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/collection.go:370: getDescriptorFromStore()
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/collection.go:553: getObjectByName()
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/collection.go:602: getTableByName()
github.com/cockroachdb/cockroach/pkg/sql/catalog/descs/collection.go:586: GetImmutableTableByName()
github.com/cockroachdb/cockroach/pkg/sql/catalog/accessors/logical_schema_accessors.go:177: GetObjectDesc()
github.com/cockroachdb/cockroach/pkg/sql/resolver.go:197: LookupObject()
github.com/cockroachdb/cockroach/pkg/sql/sem/tree/name_resolution.go:334: ResolveExisting()
github.com/cockroachdb/cockroach/pkg/sql/catalog/resolver/resolver.go:156: ResolveExistingObject()
github.com/cockroachdb/cockroach/pkg/sql/catalog/resolver/resolver.go:93: ResolveExistingTableObject()
github.com/cockroachdb/cockroach/pkg/sql/opt_catalog.go:205: ResolveDataSource()
github.com/cockroachdb/cockroach/pkg/sql/opt/metadata.go:284: CheckDependencies()
github.com/cockroachdb/cockroach/pkg/sql/opt/memo/memo.go:296: IsStale()
github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:477: buildExecMemo()
github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:194: makeOptimizerPlan()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:937: makeExecPlan()
github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:818: dispatchToExecutionEngine()

HINT: You have encountered an unexpected error.

Please check the public issue tracker to check whether this problem is
already tracked. If you cannot find it there, please report the error
with details by creating a new issue.

If you would rather not post publicly, please contact us directly
using the support form.
@ajwerner ajwerner added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-jobs GA-blocker branch-release-21.1 labels Mar 23, 2021
@ajwerner
Copy link
Contributor Author

@fqazi to bisect

@fqazi
Copy link
Collaborator

fqazi commented Mar 23, 2021

After bisecting it came down to:
dbc8676 is the first bad commit
commit dbc8676
Author: Yahor Yuzefovich [email protected]
Date: Sun Jan 24 14:20:51 2021 -0800

sql: audit all usages of QueryEx to use iterator pattern

This commit audits the usage of `QueryEx` method of the internal
executor in the following manner:
- if the caller only needed to execute the statement, `ExecEx` is now
used
- if the query can return at most one row, then `QueryRowEx` is now used
- if the caller can be refactored to use the iterator pattern, it is
done so.

As a result, almost all usages have been refactored (most notably the
virtual `crdb_internal.jobs` table now uses the iterator pattern, thus
aleviating OOM concerns - the ad-hoc memory accounting logic has been
removed).

`QueryEx` has been renamed to `QueryBufferedEx` to highlight that the
full buffering occurs, and it was added to `sqlutil.InternalExecutor`
interface. The method is now used only in three places.

Release justification: bug fix.

Release note (bug fix): `crdb_internal.jobs` virtual table is now
populated in a paginated fashion, thus, alleviating memory related
concerns when previously we could encounter OOM crash.

@fqazi
Copy link
Collaborator

fqazi commented Mar 23, 2021

@ajwerner Do you have any thoughts about the right thing here? This is an issue between crdb_internal.force_retry (for the WHEN COMPLETE clause) and QueryBufferedEx of some sort. I don't know if creating a new transaction in the internal table would be correct either.

@ajwerner
Copy link
Contributor Author

Fascinating. Maybe we can rework the query. Let me chew on this a bit. Thanks for the bisect.

@ajwerner
Copy link
Contributor Author

I'm afraid we're violating some principles with some of the new streaming stuff inside of virtual tables. Namely, cockroach transactions don't support parallelism. To deal with this, we have this cool device called a leaf transaction which isn't too heavy in terms of resources but comes with some rules for how to deal with its errors and its state propagation. I don't think we're using these leaf transactions when we kick off these concurrent scans. @yuzefovich let's sync up on what to do about this.

@ajwerner
Copy link
Contributor Author

I'm worried that it's even worse than this. I'm worried we can't even get a leaf to work properly here.

@ajwerner
Copy link
Contributor Author

SELECT a.job_id, b.job_id FROM [SHOW JOBS] a, [SHOW JOBS] b; as a similar sort of problem

@ajwerner ajwerner changed the title sql: SHOW JOBS WHEN COMPLETE (SELECT job_id FROM [SHOW JOBS]) broken sql: crdb_internal.jobs cannot be safely joined with other tables Mar 23, 2021
@ajwerner ajwerner added the A-sql-vtables Virtual tables - pg_catalog, information_schema etc label Mar 23, 2021
@yuzefovich
Copy link
Member

I briefly looked into this, and I agree that it is pretty bad. I'm not sure what to do as I'm not very familiar with internals of how SQL interfacts with KV txns, so definitely let's chat, Andrew.

I also audited the usage of the streaming internal executor API and found the following possibly concerning additional callsites:

  • pkg/jobs/job_scheduler.go:286
  • pkg/sql/sem/builtins/generator_builtins.go:1596.

@yuzefovich
Copy link
Member

yuzefovich commented Mar 24, 2021

I've just talked with Andrew, and we believe we see a good path forward that doesn't require very invasive changes and should be easily backportable without sacrificing the gains that streaming internal executor work provided.

The idea is that we'll require synchronization between the rowsIterator (reader) goroutine and the connExecutor (writer, query runner) goroutine so that only one of the two runs at the same time. This can be achieved by blocking the writer in addRow until the reader is ready to receive the row, and then blocking the reader until the writer is blocked again. I'll look into fixing this today. Some care will need to be taken to block the return of the new rowsIterator until the writer is blocked in addRow for the first time (ensuring that the query planning has been fully performed).

@knz
Copy link
Contributor

knz commented Mar 29, 2021

FWIW/FYI There's a tech note in docs/tech-notes that explains leaf and root txns and what's allowed and what not.

@ajwerner
Copy link
Contributor Author

ajwerner commented Apr 5, 2021

@yuzefovich I'm re-opening this until we get the backports we want merged into 21.1.

@ajwerner ajwerner reopened this Apr 5, 2021
@yuzefovich
Copy link
Member

#62923 is now merged which fixes this issue - hopefully - for good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-jobs A-sql-vtables Virtual tables - pg_catalog, information_schema etc C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants