-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sqlccl: merge IMPORT and RESTORE jobs #21490
Conversation
@tschottdorf Adding you to review my usage of AddSSTable to make sure it's correct since Dan is out. |
Benchmark: 4 node geo-distributed cluster doing a 8-way split SF-10 TPCH IMPORT |
Review status: 0 of 3 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/ccl/sqlccl/csv.go, line 1543 at r1 (raw file):
The restore code has a thing that computes stats after this. Do we need that here? I'm not familiar enough with this to know. Comments from Reviewable |
Reviewed 1 of 3 files at r1, 1 of 2 files at r2. pkg/ccl/sqlccl/csv.go, line 1543 at r1 (raw file): Previously, mjibson (Matt Jibson) wrote…
As far as correctness, I think I can certainly see wanting to know how many rows I IMPORTed, and wouldn't mind the other numbers on index entries and byte size either. We could probably accumulate these during the aggregation / SST creation step above using a BulkOpSummary? Anyway, don't think it needs to block this change. pkg/ccl/sqlccl/csv.go, line 1540 at r2 (raw file):
Comment is stale here (i.e. not in RESTORE). Could maybe just remove it. We could mention something about in the future, trying to split and scatter on the coordinator so that the distsql collector can be scheduled on the actual ingesting node, instead of adding another network trip for the AddSSTable, but that's an optimization for later either way. Comments from Reviewable |
Review status: 2 of 3 files reviewed at latest revision, 2 unresolved discussions, all commit checks successful. pkg/ccl/sqlccl/csv.go, line 1543 at r1 (raw file): Previously, dt (David Taylor) wrote…
Yes, I'd like to do this in a followup commit. pkg/ccl/sqlccl/csv.go, line 1540 at r2 (raw file): Previously, dt (David Taylor) wrote…
I've changed RESTORE to IMPORT, but I've otherwise kept the comment the same because afaik it's still true, or at least untested. Yes, we could pre-split everything in the coordinator, but that would then take up 10s of minutes just splitting, because we'd have to do it all up front. Your proposal would also require the distsql plan to know which nodes are in the correct zone to be able to store the table, whereas the current code allows the scatter implementation to determine that. Comments from Reviewable |
Review status: 2 of 3 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed. Comments from Reviewable |
Running a SF-100 import had an error: pq: ADDSSTable [/Table/51/1/10190243/2/0, /Table/51/1/10462464/1/0/NULL]: result is ambiguous (removing replica) |
Current error: pq: addsstable [/Table/51/1/4938598/4/0,/Table/51/1/6192549/6/0/NULL): command is too large: 75242286 bytes (max: 67108864) |
Ok all bugs fixed I think. Changed some stuff around AddSSTables to retry if there was a split and to split up commands larger than max raft size. Tested with a 100GB SF IMPORT and it worked correctly over 8 hours. RFAL |
Sorry to let this sit, Matt. I'll take a look first thing tomorrow.
On Sat, Jan 20, 2018, 7:18 PM Matt Jibson ***@***.***> wrote:
Ok all bugs fixed I think. Changed some stuff around AddSSTables to retry
if there was a split and to split up commands larger than max raft size.
Tested with a 100GB SF IMPORT and it worked correctly over 8 hours.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21490 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135JLdPbBMw2dDy-28xxW2-C9__PBaks5tMoJwgaJpZM4Rgb9R>
.
--
…-- Tobias
|
on the AddSSTable part mod a comment. Reviewed 1 of 2 files at r2, 4 of 4 files at r3. pkg/ccl/sqlccl/csv.go, line 1559 at r3 (raw file):
Comments from Reviewable |
Heads up that I added a use of |
Previously the IMPORT statement would create both an IMPORT and a RESTORE job (unless transform_only was specified). This created some problems: it forced a temp directory to be specified and prevented IMPORT from being a real job due to how the job code currently works. Change IMPORT to instead directly ingest sstables as soon as they are created, thus removing the need for the RESTORE job. This new functionality is only available in distributed mode. Make distributed the default since it is now well tested enough that we have high enough confidence in it to flip that switch. Add a local option to use local import instead. (If a user uses local, and thus must also use transform, they will need to use a normal RESTORE job in order to import the produced backups, which would require a CCL license.) This change allows for some optimizations during sst generation, like preallocating the table ID and generating data with it, thus removing the need for any key rewriting, which has historically been limiting speed factor during restores. Remove transform_only and temp, but merge them into a new transform option that performs only a transform. This commit does not change IMPORT jobs to be handled by the registry, which means that a failure could leave orphaned data. While here, fix up subtests in TestImportStmt to use correct indexes for jobs, and database and directory names. While in the vicinity, fix a needless function call in a loop in import.go. Release note (enterprise): IMPORT CSV has had its required `temp` parameter removed. In addition, the `transform_only` option has been renamed to `transform` and now takes an argument specifying where to store transformed CSV data (but keeps its behavior that, if specified, only the transform part of the IMPORT will be performed). Finally, IMPORT no longer creates a RESTORE job, but instead directly restores data itself. For most uses of IMPORT, simply remove the `temp` option from the query will achieve the same result as before.
Rebased for #21373 fix. Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful. pkg/ccl/sqlccl/csv.go, line 1559 at r3 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
Comments from Reviewable |
@danhhz @tschottdorf Do I need to worry about MVCC stats after calling addsstable? I have no idea what those are and saw some code in restore dealing with them. Just want to make sure this is safe to merge if the tests pass. |
AddSSTable handles updating them for you, so shouldn't be anything to worry about at the RESTORE/IMPORT level |
Previously the IMPORT statement would create both an IMPORT and a
RESTORE job (unless transform_only was specified). This created some
problems: it forced a temp directory to be specified and prevented
IMPORT from being a real job due to how the job code currently works.
Change IMPORT to instead directly ingest sstables as soon as they
are created, thus removing the need for the RESTORE job. This new
functionality is only available in distributed mode. Make distributed
the default since it is now well tested enough that we have high
enough confidence in it to flip that switch. Add a local option to
use local import instead. (If a user uses local, and thus must also
use transform, they will need to use a normal RESTORE job in order
to import the produced backups, which would require a CCL license.)
This change allows for some optimizations during sst generation,
like preallocating the table ID and generating data with it, thus
removing the need for any key rewriting, which has historically been
limiting speed factor during restores.
Remove transform_only and temp, but merge them into a new transform
option that performs only a transform.
This commit does not change IMPORT jobs to be handled by the registry,
which means that a failure could leave orphaned data.
While here, fix up subtests in TestImportStmt to use correct indexes
for jobs, and database and directory names.
Release note (enterprise): IMPORT CSV has had its required
temp
parameter removed. In addition, the
transform_only
option has beenrenamed to
transform
and now takes an argument specifying where tostore transformed CSV data (but keeps its behavior that, if specified,
only the transform part of the IMPORT will be performed). Finally,
IMPORT no longer creates a RESTORE job, but instead directly restores
data itself. For most uses of IMPORT, simply remove the
temp
optionfrom the query will achieve the same result as before.