Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: write stub table statistics during import #67106

Merged
merged 1 commit into from
Sep 16, 2021

Conversation

michae2
Copy link
Collaborator

@michae2 michae2 commented Jul 1, 2021

Addresses: #36273

At the end of an import job we notify the StatsRefresher that it should
collect statistics for the tables we imported into, but it can take a
while for those statistics to actually be collected. Meanwhile, if the
table is new, the optimizer will be operating totally blind. To make
this period slightly less painful, write stub statistics at the end of
import consisting of only row counts.

Release note (performance improvement): collect basic table statistics
during import, to help the optimizer until full statistics collection
completes.

@michae2 michae2 requested review from a team and dt and removed request for a team July 1, 2021 06:52
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@michae2 michae2 added the do-not-merge bors won't merge a PR with this label. label Jul 1, 2021
@michae2 michae2 marked this pull request as draft July 1, 2021 06:53
@michae2 michae2 requested review from rytaft and a team July 1, 2021 06:54
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 6 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @michae2, and @rytaft)


pkg/ccl/importccl/import_stmt.go, line 2090 at r1 (raw file):

	if err := r.job.SetDetails(ctx, nil /* txn */, details); err != nil {
		return err

Is there much value in writing this to the jobs table here? I don't remember the lifecycle of the import job and what happens if it crashes here. It feels to me like either we should update the counts as we ingest if we're already checkpointing and can restart or we should just let the work that happens in publishTables read the in-memory state set here.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! @dt and @ajwerner would be the experts here, but :lgtm:

Reviewed 6 of 6 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dt and @michae2)


pkg/ccl/importccl/import_stmt.go, line 1559 at r1 (raw file):

		for _, desc := range res {
			tableDesc := tabledesc.NewUnsafeImmutable(desc)
			stats, err := sql.StubTableStats(p.ExecCfg().Settings, tableDesc, "__import__")

nit: I'd assign "import" to a const var. You could put it here:

const AutoStatsName = "__auto__"


pkg/ccl/importccl/import_stmt.go, line 2082 at r1 (raw file):

			for _, stat := range details.Tables[i].Statistics {
				stat.RowCount += uint64(count)
				stat.DistinctCount += uint64(count)

in the optimizer we use an "unknown distinct count ratio" of 0.1 and "unknown null count ratio" of 0.01:

unknownDistinctCountRatio = 0.1
.

Perhaps we should do that here too?

@rytaft
Copy link
Collaborator

rytaft commented Jul 1, 2021

(This does need some tests, though)

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @michae2, and @rytaft)


pkg/ccl/importccl/import_stmt.go, line 2082 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

in the optimizer we use an "unknown distinct count ratio" of 0.1 and "unknown null count ratio" of 0.01:

unknownDistinctCountRatio = 0.1
.

Perhaps we should do that here too?

Ah, thank you, I was wondering about this. I will use those ratios!


pkg/ccl/importccl/import_stmt.go, line 2090 at r1 (raw file):

Previously, ajwerner wrote…
	if err := r.job.SetDetails(ctx, nil /* txn */, details); err != nil {
		return err

Is there much value in writing this to the jobs table here? I don't remember the lifecycle of the import job and what happens if it crashes here. It feels to me like either we should update the counts as we ingest if we're already checkpointing and can restart or we should just let the work that happens in publishTables read the in-memory state set here.

Great question. I'm not sure why, but without this write the counts are lost before publishTables, even if I do not pause the job. But yes it seems like a kludge. You are right, the counts should be written in some kind of checkpoint so that pausing and restarting work.

I'm still trying to understand why this write is necessary, and how pausing and restarting work, which is the main reason this PR is still a draft! 😅

}
r.res.DataSize = res.DataSize
for id, count := range res.EntryCounts {
if _, ok := pkIDs[id]; ok {
if i, ok := pkIDs[id]; ok {
r.res.Rows += count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So unfortunately r.res (and res) can't really be trusted.

It is accumulated and returned from the distsql flow that does the data ingestion, and is indeed reported to the user as what was imported, but importantly the user is only shown "results" if the job runs straight through in one execution. If it is paused or hits an error, the user-run statement that started it gets an error -- if the job is then resumed/retried, it will pick up from where it left off using persisted progress information, starting a new flow that will only ingest whatever is left to do, and that flow will return a res that only includes what it ingested, not the sum of what was ingested by all flows run by this job.

We don't currently worry about this fact, that res isn't really the actual full IMPORT results, for reporting the user-facing results because of the fact that if a job is resumed, there is no user session attached to it that is expecting to see the results (the one that started it got a paused or lost-lease error already, and RESUME JOB just says it was resumed, it doesn't show job-specific results).

So while we currently get away with res not being correct for resumed jobs just for user-result reporting, it seems like it could be very bad to rely on it in table stats. e.g. if I pause a giant import at 99%, then resume and complete it, using res as it stands today would mean we'd record table stats with a count that is only 1% of the actual table count. That seems like it could mislead the optimizer pretty badly, right?

I think we'll need to persist partial totals as we go, in the job progress field. We were vaguely planning to do this anyway, so that you could see what was imported after the fact by an IMPORT job, particularly a detached import job, but I guess we might need to do it sooner rather than later if you want to use those counts here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could definitely mislead the optimizer, although odds are it's probably still better than our current default of 1000 rows. Is there any way to tell whether the job has been paused? If so, it seems like we could use the following logic:

if job was never paused {
  // use row count from res 
} else {
  // use row count from res if > 1000
}

Obviously it would be ideal to persist the partial totals, but maybe this could be a a good stopgap in the mean time?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might take a stab at persisting partial results next week and then just make res correct on resumes too.

In the meantime, if you wanted to do that, I think you could do something like this, before calling ingest:

	var wasResumed bool
	initialProgress := r.job.Progress()
	for _, pos := range initialProgress.GetImport().ResumePos {
		wasResumed = wasResumed || pos > 0
	}

In any case, I don't think we need to persist the stats stubs in the job details -- we can just derive it at the end. Also, if/when we do persist something that changes during the execution of the job, we'd want to do so in progress, not details, to utilize the separate column family.

Copy link
Collaborator Author

@michae2 michae2 Jul 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is really helpful. I suspected this was the case, but wasn't totally sure. I can take a look at persisting row counts in the progress field as part of this PR, but if you get to it first that is fine, too.

I'm generating stats stubs during preparation, rather than during table publishing in anticipation of adding distinct and null counts. Unlike row counts, distinct and null counts are specific to the column(s) counted, and so we would need to know the exact column sets being counted before performing the import.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michae2 I think there might be an existing bug here already -- caused by the addition of a retry loop around ingest -- that points to the fact we need to persist res anyway. If you want to just wait for us to do that, then rebase, I think you can use it once it is fixed without needing to persist anything related to stats in the job progress.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michae2 I think there might be an existing bug here already -- caused by the addition of a retry loop around ingest -- that points to the fact we need to persist res anyway. If you want to just wait for us to do that, then rebase, I think you can use it once it is fixed without needing to persist anything related to stats in the job progress.

@dt Just wondering, has anyone had a chance to work on this? (Looks like not yet, based on revision history?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't think so -- I've been OOO and I don't think anyone else picked it up yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, it's not urgent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michae2, just wanted to mention that I've filed #67987 to track this.

@dt
Copy link
Member

dt commented Jul 1, 2021

TFTRs!

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 1, 2021

👎 Rejected by label

@dt
Copy link
Member

dt commented Jul 1, 2021

whoops, commented in the wrong tab, sorry!

bors r-

@michae2 michae2 removed the do-not-merge bors won't merge a PR with this label. label Sep 15, 2021
@michae2 michae2 changed the title WIP: importccl: write stub table statistics during import importccl: write stub table statistics during import Sep 15, 2021
@michae2 michae2 marked this pull request as ready for review September 15, 2021 05:32
@michae2 michae2 requested a review from a team as a code owner September 15, 2021 05:32
@michae2 michae2 removed the request for review from a team September 15, 2021 05:33
@michae2
Copy link
Collaborator Author

michae2 commented Sep 15, 2021

Ok, I think this is RFAL.

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @dt, and @rytaft)


pkg/ccl/importccl/import_stmt.go, line 1559 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: I'd assign "import" to a const var. You could put it here:

const AutoStatsName = "__auto__"

Done.


pkg/ccl/importccl/import_stmt.go, line 2082 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

Ah, thank you, I was wondering about this. I will use those ratios!

Done.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! One think that would be good to check is how this impacts our TPC-C workload. We inject some carefully calculated stats for TPC-C, and I want to make sure that this won't overwrite those:

var fixturesImportInjectStats = fixturesImportCmd.PersistentFlags().Bool(

Reviewed 9 of 9 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dt and @michae2)


pkg/ccl/importccl/import_stmt.go, line 2392 at r2 (raw file):

				id := roachpb.BulkOpSummaryID(uint64(desc.GetID()), uint64(desc.GetPrimaryIndexID()))
				rowCount := uint64(res.EntryCounts[id])
				distinctCount := uint64(float64(rowCount) * memo.UnknownDistinctCountRatio)

nit: add a TODO explaining that we'd like to collect the real counts eventually


pkg/sql/create_stats.go, line 317 at r2 (raw file):

// plus any other boolean or enum columns (where the "histogram" is tiny).
func createStatsDefaultColumns(
	desc catalog.TableDescriptor, multiColEnabled bool,

Until we can get the real distinct counts, I think it would be preferable to not inject multi-column stats. Using the unknown distinct/null ratios for the multi-col stats will make it look like all columns are perfectly correlated, which would probably do more harm than good. So I would just leave this parameter as-is for now and pass in false.

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One think that would be good to check is how this impacts our TPC-C workload. We inject some carefully calculated stats for TPC-C, and I want to make sure that this won't overwrite those

I checked, and these stub stats do not interfere with the injected stats for TPC-C and other workloads. (cockroach workload fixtures import first runs IMPORT which creates the stub stats, then runs ALTER TABLE INJECT STATISTICS which deletes the stub stats and writes the injected stats.)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dt and @rytaft)


pkg/ccl/importccl/import_stmt.go, line 2079 at r1 (raw file):

Previously, pbardea (Paul Bardea) wrote…

@michae2, just wanted to mention that I've filed #67987 to track this.

@pbardea thanks again for working on that. Is res now reliable, even across pauses and resumes?


pkg/ccl/importccl/import_stmt.go, line 2392 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

nit: add a TODO explaining that we'd like to collect the real counts eventually

Done.


pkg/sql/create_stats.go, line 317 at r2 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

Until we can get the real distinct counts, I think it would be preferable to not inject multi-column stats. Using the unknown distinct/null ratios for the multi-col stats will make it look like all columns are perfectly correlated, which would probably do more harm than good. So I would just leave this parameter as-is for now and pass in false.

Done.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking!

:lgtm:

Reviewed 3 of 3 files at r3.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dt)


pkg/ccl/importccl/import_stmt.go, line 2079 at r1 (raw file):

Previously, michae2 (Michael Erickson) wrote…

@pbardea thanks again for working on that. Is res now reliable, even across pauses and resumes?

(cc @dt in case @pbardea doesn't see this 😭)

Copy link
Collaborator Author

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @dt)


pkg/ccl/importccl/import_stmt.go, line 2079 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

(cc @dt in case @pbardea doesn't see this 😭)

Oh! 😢

Addresses: cockroachdb#36273

At the end of an import job we notify the StatsRefresher that it should
collect statistics for the tables we imported into, but it can take a
while for those statistics to actually be collected. Meanwhile, if the
table is new, the optimizer will be operating totally blind. To make
this period slightly less painful, write stub statistics at the end of
import consisting of only row counts.

Release note (performance improvement): collect basic table statistics
during import, to help the optimizer until full statistics collection
completes.
@michae2
Copy link
Collaborator Author

michae2 commented Sep 16, 2021

TFTRs!

bors r=rytaft,dt

@craig
Copy link
Contributor

craig bot commented Sep 16, 2021

Build succeeded:

@craig craig bot merged commit a454599 into cockroachdb:master Sep 16, 2021
@michae2 michae2 deleted the import_stats2 branch September 16, 2021 01:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants