importccl: write stub table statistics during import #67106

michae2 · 2021-07-01T06:52:57Z

Addresses: #36273

At the end of an import job we notify the StatsRefresher that it should
collect statistics for the tables we imported into, but it can take a
while for those statistics to actually be collected. Meanwhile, if the
table is new, the optimizer will be operating totally blind. To make
this period slightly less painful, write stub statistics at the end of
import consisting of only row counts.

Release note (performance improvement): collect basic table statistics
during import, to help the optimizer until full statistics collection
completes.

cockroach-teamcity · 2021-07-01T06:53:04Z

This change is

ajwerner

Reviewed 1 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @michae2, and @rytaft)

pkg/ccl/importccl/import_stmt.go, line 2090 at r1 (raw file):

	if err := r.job.SetDetails(ctx, nil /* txn */, details); err != nil {
		return err

Is there much value in writing this to the jobs table here? I don't remember the lifecycle of the import job and what happens if it crashes here. It feels to me like either we should update the counts as we ingest if we're already checkpointing and can restart or we should just let the work that happens in publishTables read the in-memory state set here.

rytaft

Very nice! @dt and @ajwerner would be the experts here, but

Reviewed 6 of 6 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dt and @michae2)

pkg/ccl/importccl/import_stmt.go, line 1559 at r1 (raw file):

		for _, desc := range res {
			tableDesc := tabledesc.NewUnsafeImmutable(desc)
			stats, err := sql.StubTableStats(p.ExecCfg().Settings, tableDesc, "__import__")

nit: I'd assign "import" to a const var. You could put it here:

cockroach/pkg/jobs/jobspb/wrap.go

Line 64 in 59756c4

const AutoStatsName = "__auto__"

pkg/ccl/importccl/import_stmt.go, line 2082 at r1 (raw file):

			for _, stat := range details.Tables[i].Statistics {
				stat.RowCount += uint64(count)
				stat.DistinctCount += uint64(count)

in the optimizer we use an "unknown distinct count ratio" of 0.1 and "unknown null count ratio" of 0.01:

cockroach/pkg/sql/opt/memo/statistics_builder.go

Line 2752 in 74a99f8

unknownDistinctCountRatio = 0.1

.

Perhaps we should do that here too?

rytaft · 2021-07-01T12:55:03Z

(This does need some tests, though)

michae2

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @michae2, and @rytaft)

pkg/ccl/importccl/import_stmt.go, line 2082 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

in the optimizer we use an "unknown distinct count ratio" of 0.1 and "unknown null count ratio" of 0.01:

cockroach/pkg/sql/opt/memo/statistics_builder.go

Line 2752 in 74a99f8

unknownDistinctCountRatio = 0.1

.

Perhaps we should do that here too?

Ah, thank you, I was wondering about this. I will use those ratios!

pkg/ccl/importccl/import_stmt.go, line 2090 at r1 (raw file):

Previously, ajwerner wrote…

	if err := r.job.SetDetails(ctx, nil /* txn */, details); err != nil {
		return err
Is there much value in writing this to the jobs table here? I don't remember the lifecycle of the import job and what happens if it crashes here. It feels to me like either we should update the counts as we ingest if we're already checkpointing and can restart or we should just let the work that happens in publishTables read the in-memory state set here.

Great question. I'm not sure why, but without this write the counts are lost before publishTables, even if I do not pause the job. But yes it seems like a kludge. You are right, the counts should be written in some kind of checkpoint so that pausing and restarting work.

I'm still trying to understand why this write is necessary, and how pausing and restarting work, which is the main reason this PR is still a draft! 😅

dt · 2021-07-01T13:30:06Z

pkg/ccl/importccl/import_stmt.go

 	}
 	r.res.DataSize = res.DataSize
 	for id, count := range res.EntryCounts {
-		if _, ok := pkIDs[id]; ok {
+		if i, ok := pkIDs[id]; ok {
 			r.res.Rows += count


So unfortunately r.res (and res) can't really be trusted.

It is accumulated and returned from the distsql flow that does the data ingestion, and is indeed reported to the user as what was imported, but importantly the user is only shown "results" if the job runs straight through in one execution. If it is paused or hits an error, the user-run statement that started it gets an error -- if the job is then resumed/retried, it will pick up from where it left off using persisted progress information, starting a new flow that will only ingest whatever is left to do, and that flow will return a res that only includes what it ingested, not the sum of what was ingested by all flows run by this job.

We don't currently worry about this fact, that res isn't really the actual full IMPORT results, for reporting the user-facing results because of the fact that if a job is resumed, there is no user session attached to it that is expecting to see the results (the one that started it got a paused or lost-lease error already, and RESUME JOB just says it was resumed, it doesn't show job-specific results).

So while we currently get away with res not being correct for resumed jobs just for user-result reporting, it seems like it could be very bad to rely on it in table stats. e.g. if I pause a giant import at 99%, then resume and complete it, using res as it stands today would mean we'd record table stats with a count that is only 1% of the actual table count. That seems like it could mislead the optimizer pretty badly, right?

I think we'll need to persist partial totals as we go, in the job progress field. We were vaguely planning to do this anyway, so that you could see what was imported after the fact by an IMPORT job, particularly a detached import job, but I guess we might need to do it sooner rather than later if you want to use those counts here.

It could definitely mislead the optimizer, although odds are it's probably still better than our current default of 1000 rows. Is there any way to tell whether the job has been paused? If so, it seems like we could use the following logic:

if job was never paused { // use row count from res } else { // use row count from res if > 1000 }

Obviously it would be ideal to persist the partial totals, but maybe this could be a a good stopgap in the mean time?

I might take a stab at persisting partial results next week and then just make res correct on resumes too.

In the meantime, if you wanted to do that, I think you could do something like this, before calling ingest:

var wasResumed bool initialProgress := r.job.Progress() for _, pos := range initialProgress.GetImport().ResumePos { wasResumed = wasResumed || pos > 0 }

In any case, I don't think we need to persist the stats stubs in the job details -- we can just derive it at the end. Also, if/when we do persist something that changes during the execution of the job, we'd want to do so in progress, not details, to utilize the separate column family.

Thank you, this is really helpful. I suspected this was the case, but wasn't totally sure. I can take a look at persisting row counts in the progress field as part of this PR, but if you get to it first that is fine, too.

I'm generating stats stubs during preparation, rather than during table publishing in anticipation of adding distinct and null counts. Unlike row counts, distinct and null counts are specific to the column(s) counted, and so we would need to know the exact column sets being counted before performing the import.

@michae2 I think there might be an existing bug here already -- caused by the addition of a retry loop around ingest -- that points to the fact we need to persist res anyway. If you want to just wait for us to do that, then rebase, I think you can use it once it is fixed without needing to persist anything related to stats in the job progress.

Sounds good!

@michae2 I think there might be an existing bug here already -- caused by the addition of a retry loop around ingest -- that points to the fact we need to persist res anyway. If you want to just wait for us to do that, then rebase, I think you can use it once it is fixed without needing to persist anything related to stats in the job progress.

@dt Just wondering, has anyone had a chance to work on this? (Looks like not yet, based on revision history?)

Sorry, I don't think so -- I've been OOO and I don't think anyone else picked it up yet.

That's fine, it's not urgent.

@michae2, just wanted to mention that I've filed #67987 to track this.

dt · 2021-07-01T19:38:11Z

TFTRs!

bors r+

craig · 2021-07-01T19:38:13Z

👎 Rejected by label

dt · 2021-07-01T19:38:54Z

whoops, commented in the wrong tab, sorry!

bors r-

michae2 · 2021-09-15T05:35:16Z

Ok, I think this is RFAL.

michae2

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @dt, and @rytaft)

pkg/ccl/importccl/import_stmt.go, line 1559 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: I'd assign "import" to a const var. You could put it here:

cockroach/pkg/jobs/jobspb/wrap.go

Line 64 in 59756c4

const AutoStatsName = "__auto__"

Done.

pkg/ccl/importccl/import_stmt.go, line 2082 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Ah, thank you, I was wondering about this. I will use those ratios!

Done.

rytaft

Nice! One think that would be good to check is how this impacts our TPC-C workload. We inject some carefully calculated stats for TPC-C, and I want to make sure that this won't overwrite those:

cockroach/pkg/ccl/workloadccl/cliccl/fixtures.go

Line 106 in 7ef8ac1

var fixturesImportInjectStats = fixturesImportCmd.PersistentFlags().Bool(

Reviewed 9 of 9 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dt and @michae2)

pkg/ccl/importccl/import_stmt.go, line 2392 at r2 (raw file):

				id := roachpb.BulkOpSummaryID(uint64(desc.GetID()), uint64(desc.GetPrimaryIndexID()))
				rowCount := uint64(res.EntryCounts[id])
				distinctCount := uint64(float64(rowCount) * memo.UnknownDistinctCountRatio)

nit: add a TODO explaining that we'd like to collect the real counts eventually

pkg/sql/create_stats.go, line 317 at r2 (raw file):

// plus any other boolean or enum columns (where the "histogram" is tiny).
func createStatsDefaultColumns(
	desc catalog.TableDescriptor, multiColEnabled bool,

Until we can get the real distinct counts, I think it would be preferable to not inject multi-column stats. Using the unknown distinct/null ratios for the multi-col stats will make it look like all columns are perfectly correlated, which would probably do more harm than good. So I would just leave this parameter as-is for now and pass in false.

michae2

One think that would be good to check is how this impacts our TPC-C workload. We inject some carefully calculated stats for TPC-C, and I want to make sure that this won't overwrite those

I checked, and these stub stats do not interfere with the injected stats for TPC-C and other workloads. (cockroach workload fixtures import first runs IMPORT which creates the stub stats, then runs ALTER TABLE INJECT STATISTICS which deletes the stub stats and writes the injected stats.)

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dt and @rytaft)

pkg/ccl/importccl/import_stmt.go, line 2079 at r1 (raw file):

Previously, pbardea (Paul Bardea) wrote…

@michae2, just wanted to mention that I've filed #67987 to track this.

@pbardea thanks again for working on that. Is res now reliable, even across pauses and resumes?

pkg/ccl/importccl/import_stmt.go, line 2392 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: add a TODO explaining that we'd like to collect the real counts eventually

Done.

pkg/sql/create_stats.go, line 317 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Until we can get the real distinct counts, I think it would be preferable to not inject multi-column stats. Using the unknown distinct/null ratios for the multi-col stats will make it look like all columns are perfectly correlated, which would probably do more harm than good. So I would just leave this parameter as-is for now and pass in false.

Done.

rytaft

Thanks for checking!

Reviewed 3 of 3 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dt)

pkg/ccl/importccl/import_stmt.go, line 2079 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

@pbardea thanks again for working on that. Is res now reliable, even across pauses and resumes?

(cc @dt in case @pbardea doesn't see this 😭)

michae2

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dt)

pkg/ccl/importccl/import_stmt.go, line 2079 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

(cc @dt in case @pbardea doesn't see this 😭)

Oh! 😢

Addresses: cockroachdb#36273 At the end of an import job we notify the StatsRefresher that it should collect statistics for the tables we imported into, but it can take a while for those statistics to actually be collected. Meanwhile, if the table is new, the optimizer will be operating totally blind. To make this period slightly less painful, write stub statistics at the end of import consisting of only row counts. Release note (performance improvement): collect basic table statistics during import, to help the optimizer until full statistics collection completes.

michae2 · 2021-09-16T00:01:34Z

TFTRs!

bors r=rytaft,dt

craig · 2021-09-16T01:35:38Z

Build succeeded:

GitHub CI (Cockroach)

michae2 requested review from a team and dt and removed request for a team July 1, 2021 06:52

michae2 added the do-not-merge bors won't merge a PR with this label. label Jul 1, 2021

michae2 marked this pull request as draft July 1, 2021 06:53

michae2 requested review from rytaft and a team July 1, 2021 06:54

ajwerner reviewed Jul 1, 2021

View reviewed changes

rytaft approved these changes Jul 1, 2021

View reviewed changes

michae2 commented Jul 1, 2021

View reviewed changes

dt reviewed Jul 1, 2021

View reviewed changes

pbardea mentioned this pull request Jul 26, 2021

import: job summary is sometimes inaccurate #67987

Closed

michae2 mentioned this pull request Aug 17, 2021

importccl: persist bulk op summary during execution #68218

Merged

jordanlewis mentioned this pull request Aug 31, 2021

opt: inject stats during import #36273

Open

michae2 force-pushed the import_stats2 branch from 49d2166 to 4741230 Compare September 15, 2021 05:32

michae2 removed the do-not-merge bors won't merge a PR with this label. label Sep 15, 2021

michae2 changed the title ~~WIP: importccl: write stub table statistics during import~~ importccl: write stub table statistics during import Sep 15, 2021

michae2 marked this pull request as ready for review September 15, 2021 05:32

michae2 requested a review from a team as a code owner September 15, 2021 05:32

michae2 removed the request for review from a team September 15, 2021 05:33

michae2 commented Sep 15, 2021

View reviewed changes

rytaft reviewed Sep 15, 2021

View reviewed changes

michae2 force-pushed the import_stats2 branch from 4741230 to 0cd1e2a Compare September 15, 2021 21:58

michae2 force-pushed the import_stats2 branch from 0cd1e2a to 6fc090a Compare September 15, 2021 22:02

michae2 commented Sep 15, 2021

View reviewed changes

rytaft approved these changes Sep 15, 2021

View reviewed changes

michae2 commented Sep 15, 2021

View reviewed changes

dt approved these changes Sep 15, 2021

View reviewed changes

michae2 force-pushed the import_stats2 branch from 6fc090a to ac0976c Compare September 15, 2021 23:59

craig bot merged commit a454599 into cockroachdb:master Sep 16, 2021

cockroach-teamcity mentioned this pull request Sep 16, 2021

importccl: write stub table statistics during import cockroachdb/docs#11693

Open

michae2 deleted the import_stats2 branch September 16, 2021 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

importccl: write stub table statistics during import #67106

importccl: write stub table statistics during import #67106

michae2 commented Jul 1, 2021 •

edited

Loading

cockroach-teamcity commented Jul 1, 2021

ajwerner left a comment

rytaft left a comment

rytaft commented Jul 1, 2021

michae2 left a comment

dt Jul 1, 2021

rytaft Jul 1, 2021

dt Jul 1, 2021

michae2 Jul 1, 2021 •

edited

Loading

dt Jul 1, 2021

rytaft Jul 1, 2021

michae2 Jul 20, 2021

dt Jul 21, 2021

michae2 Jul 22, 2021

pbardea Jul 26, 2021

dt commented Jul 1, 2021

craig bot commented Jul 1, 2021

dt commented Jul 1, 2021 •

edited

Loading

michae2 commented Sep 15, 2021

michae2 left a comment

rytaft left a comment

michae2 left a comment

rytaft left a comment

michae2 left a comment

michae2 commented Sep 16, 2021

craig bot commented Sep 16, 2021

importccl: write stub table statistics during import #67106

importccl: write stub table statistics during import #67106

Conversation

michae2 commented Jul 1, 2021 • edited Loading

cockroach-teamcity commented Jul 1, 2021

ajwerner left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

rytaft commented Jul 1, 2021

michae2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michae2 Jul 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dt commented Jul 1, 2021

craig bot commented Jul 1, 2021

dt commented Jul 1, 2021 • edited Loading

michae2 commented Sep 15, 2021

michae2 left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

rytaft left a comment

Choose a reason for hiding this comment

michae2 left a comment

Choose a reason for hiding this comment

michae2 commented Sep 16, 2021

craig bot commented Sep 16, 2021

michae2 commented Jul 1, 2021 •

edited

Loading

michae2 Jul 1, 2021 •

edited

Loading

dt commented Jul 1, 2021 •

edited

Loading