-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importccl: write stub table statistics during import #67106
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dt, @michae2, and @rytaft)
pkg/ccl/importccl/import_stmt.go, line 2090 at r1 (raw file):
if err := r.job.SetDetails(ctx, nil /* txn */, details); err != nil { return err
Is there much value in writing this to the jobs table here? I don't remember the lifecycle of the import job and what happens if it crashes here. It feels to me like either we should update the counts as we ingest if we're already checkpointing and can restart or we should just let the work that happens in publishTables
read the in-memory state set here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! @dt and @ajwerner would be the experts here, but
Reviewed 6 of 6 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dt and @michae2)
pkg/ccl/importccl/import_stmt.go, line 1559 at r1 (raw file):
for _, desc := range res { tableDesc := tabledesc.NewUnsafeImmutable(desc) stats, err := sql.StubTableStats(p.ExecCfg().Settings, tableDesc, "__import__")
nit: I'd assign "import" to a const var. You could put it here:
cockroach/pkg/jobs/jobspb/wrap.go
Line 64 in 59756c4
const AutoStatsName = "__auto__" |
pkg/ccl/importccl/import_stmt.go, line 2082 at r1 (raw file):
for _, stat := range details.Tables[i].Statistics { stat.RowCount += uint64(count) stat.DistinctCount += uint64(count)
in the optimizer we use an "unknown distinct count ratio" of 0.1 and "unknown null count ratio" of 0.01:
cockroach/pkg/sql/opt/memo/statistics_builder.go
Line 2752 in 74a99f8
unknownDistinctCountRatio = 0.1 |
Perhaps we should do that here too?
(This does need some tests, though) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @dt, @michae2, and @rytaft)
pkg/ccl/importccl/import_stmt.go, line 2082 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
in the optimizer we use an "unknown distinct count ratio" of 0.1 and "unknown null count ratio" of 0.01:
.cockroach/pkg/sql/opt/memo/statistics_builder.go
Line 2752 in 74a99f8
unknownDistinctCountRatio = 0.1 Perhaps we should do that here too?
Ah, thank you, I was wondering about this. I will use those ratios!
pkg/ccl/importccl/import_stmt.go, line 2090 at r1 (raw file):
Previously, ajwerner wrote…
if err := r.job.SetDetails(ctx, nil /* txn */, details); err != nil { return err
Is there much value in writing this to the jobs table here? I don't remember the lifecycle of the import job and what happens if it crashes here. It feels to me like either we should update the counts as we ingest if we're already checkpointing and can restart or we should just let the work that happens in
publishTables
read the in-memory state set here.
Great question. I'm not sure why, but without this write the counts are lost before publishTables
, even if I do not pause the job. But yes it seems like a kludge. You are right, the counts should be written in some kind of checkpoint so that pausing and restarting work.
I'm still trying to understand why this write is necessary, and how pausing and restarting work, which is the main reason this PR is still a draft! 😅
pkg/ccl/importccl/import_stmt.go
Outdated
} | ||
r.res.DataSize = res.DataSize | ||
for id, count := range res.EntryCounts { | ||
if _, ok := pkIDs[id]; ok { | ||
if i, ok := pkIDs[id]; ok { | ||
r.res.Rows += count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So unfortunately r.res
(and res
) can't really be trusted.
It is accumulated and returned from the distsql flow that does the data ingestion, and is indeed reported to the user as what was imported, but importantly the user is only shown "results" if the job runs straight through in one execution. If it is paused or hits an error, the user-run statement that started it gets an error -- if the job is then resumed/retried, it will pick up from where it left off using persisted progress information, starting a new flow that will only ingest whatever is left to do, and that flow will return a res
that only includes what it ingested, not the sum of what was ingested by all flows run by this job.
We don't currently worry about this fact, that res
isn't really the actual full IMPORT results, for reporting the user-facing results because of the fact that if a job is resumed, there is no user session attached to it that is expecting to see the results (the one that started it got a paused or lost-lease error already, and RESUME JOB just says it was resumed, it doesn't show job-specific results).
So while we currently get away with res
not being correct for resumed jobs just for user-result reporting, it seems like it could be very bad to rely on it in table stats. e.g. if I pause a giant import at 99%, then resume and complete it, using res
as it stands today would mean we'd record table stats with a count that is only 1% of the actual table count. That seems like it could mislead the optimizer pretty badly, right?
I think we'll need to persist partial totals as we go, in the job progress field. We were vaguely planning to do this anyway, so that you could see what was imported after the fact by an IMPORT job, particularly a detached import job, but I guess we might need to do it sooner rather than later if you want to use those counts here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could definitely mislead the optimizer, although odds are it's probably still better than our current default of 1000 rows. Is there any way to tell whether the job has been paused? If so, it seems like we could use the following logic:
if job was never paused {
// use row count from res
} else {
// use row count from res if > 1000
}
Obviously it would be ideal to persist the partial totals, but maybe this could be a a good stopgap in the mean time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might take a stab at persisting partial results next week and then just make res
correct on resumes too.
In the meantime, if you wanted to do that, I think you could do something like this, before calling ingest:
var wasResumed bool
initialProgress := r.job.Progress()
for _, pos := range initialProgress.GetImport().ResumePos {
wasResumed = wasResumed || pos > 0
}
In any case, I don't think we need to persist the stats stubs in the job details -- we can just derive it at the end. Also, if/when we do persist something that changes during the execution of the job, we'd want to do so in progress, not details, to utilize the separate column family.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is really helpful. I suspected this was the case, but wasn't totally sure. I can take a look at persisting row counts in the progress field as part of this PR, but if you get to it first that is fine, too.
I'm generating stats stubs during preparation, rather than during table publishing in anticipation of adding distinct and null counts. Unlike row counts, distinct and null counts are specific to the column(s) counted, and so we would need to know the exact column sets being counted before performing the import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michae2 I think there might be an existing bug here already -- caused by the addition of a retry loop around ingest -- that points to the fact we need to persist res
anyway. If you want to just wait for us to do that, then rebase, I think you can use it once it is fixed without needing to persist anything related to stats in the job progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michae2 I think there might be an existing bug here already -- caused by the addition of a retry loop around ingest -- that points to the fact we need to persist
res
anyway. If you want to just wait for us to do that, then rebase, I think you can use it once it is fixed without needing to persist anything related to stats in the job progress.
@dt Just wondering, has anyone had a chance to work on this? (Looks like not yet, based on revision history?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't think so -- I've been OOO and I don't think anyone else picked it up yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine, it's not urgent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs! bors r+ |
👎 Rejected by label |
whoops, commented in the wrong tab, sorry! bors r- |
49d2166
to
4741230
Compare
Ok, I think this is RFAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner, @dt, and @rytaft)
pkg/ccl/importccl/import_stmt.go, line 1559 at r1 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: I'd assign "import" to a const var. You could put it here:
cockroach/pkg/jobs/jobspb/wrap.go
Line 64 in 59756c4
const AutoStatsName = "__auto__"
Done.
pkg/ccl/importccl/import_stmt.go, line 2082 at r1 (raw file):
Previously, michae2 (Michael Erickson) wrote…
Ah, thank you, I was wondering about this. I will use those ratios!
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! One think that would be good to check is how this impacts our TPC-C workload. We inject some carefully calculated stats for TPC-C, and I want to make sure that this won't overwrite those:
var fixturesImportInjectStats = fixturesImportCmd.PersistentFlags().Bool( |
Reviewed 9 of 9 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dt and @michae2)
pkg/ccl/importccl/import_stmt.go, line 2392 at r2 (raw file):
id := roachpb.BulkOpSummaryID(uint64(desc.GetID()), uint64(desc.GetPrimaryIndexID())) rowCount := uint64(res.EntryCounts[id]) distinctCount := uint64(float64(rowCount) * memo.UnknownDistinctCountRatio)
nit: add a TODO explaining that we'd like to collect the real counts eventually
pkg/sql/create_stats.go, line 317 at r2 (raw file):
// plus any other boolean or enum columns (where the "histogram" is tiny). func createStatsDefaultColumns( desc catalog.TableDescriptor, multiColEnabled bool,
Until we can get the real distinct counts, I think it would be preferable to not inject multi-column stats. Using the unknown distinct/null ratios for the multi-col stats will make it look like all columns are perfectly correlated, which would probably do more harm than good. So I would just leave this parameter as-is for now and pass in false
.
4741230
to
0cd1e2a
Compare
0cd1e2a
to
6fc090a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One think that would be good to check is how this impacts our TPC-C workload. We inject some carefully calculated stats for TPC-C, and I want to make sure that this won't overwrite those
I checked, and these stub stats do not interfere with the injected stats for TPC-C and other workloads. (cockroach workload fixtures import
first runs IMPORT
which creates the stub stats, then runs ALTER TABLE INJECT STATISTICS
which deletes the stub stats and writes the injected stats.)
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dt and @rytaft)
pkg/ccl/importccl/import_stmt.go, line 2079 at r1 (raw file):
Previously, pbardea (Paul Bardea) wrote…
@michae2, just wanted to mention that I've filed #67987 to track this.
@pbardea thanks again for working on that. Is res
now reliable, even across pauses and resumes?
pkg/ccl/importccl/import_stmt.go, line 2392 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
nit: add a TODO explaining that we'd like to collect the real counts eventually
Done.
pkg/sql/create_stats.go, line 317 at r2 (raw file):
Previously, rytaft (Rebecca Taft) wrote…
Until we can get the real distinct counts, I think it would be preferable to not inject multi-column stats. Using the unknown distinct/null ratios for the multi-col stats will make it look like all columns are perfectly correlated, which would probably do more harm than good. So I would just leave this parameter as-is for now and pass in
false
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking!
Reviewed 3 of 3 files at r3.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dt)
pkg/ccl/importccl/import_stmt.go, line 2079 at r1 (raw file):
Previously, michae2 (Michael Erickson) wrote…
@pbardea thanks again for working on that. Is
res
now reliable, even across pauses and resumes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @dt)
pkg/ccl/importccl/import_stmt.go, line 2079 at r1 (raw file):
Oh! 😢
Addresses: cockroachdb#36273 At the end of an import job we notify the StatsRefresher that it should collect statistics for the tables we imported into, but it can take a while for those statistics to actually be collected. Meanwhile, if the table is new, the optimizer will be operating totally blind. To make this period slightly less painful, write stub statistics at the end of import consisting of only row counts. Release note (performance improvement): collect basic table statistics during import, to help the optimizer until full statistics collection completes.
6fc090a
to
ac0976c
Compare
TFTRs! bors r=rytaft,dt |
Build succeeded: |
Addresses: #36273
At the end of an import job we notify the StatsRefresher that it should
collect statistics for the tables we imported into, but it can take a
while for those statistics to actually be collected. Meanwhile, if the
table is new, the optimizer will be operating totally blind. To make
this period slightly less painful, write stub statistics at the end of
import consisting of only row counts.
Release note (performance improvement): collect basic table statistics
during import, to help the optimizer until full statistics collection
completes.