bulkio/import: performance difference between IMPORT and IMPORT INTO? #50731

nvanbenschoten · 2020-06-27T17:41:26Z

More of a question than an issue. I thought that only IMPORT INTO would need to DisallowShadowing, which has implications on mvcc stats handling, but looking at the code, that doesn't appear to be true.

If this is true though, should we consider a fast-path for IMPORT INTO into empty tables? Given the ergonomics of IMPORT, which requires the table schema and import arguments to be combined into a single statement, it is often easier to use IMPORT INTO, even when initializing a fresh dataset. This is especially true when initialization is performed programmatically because IMPORT INTO allows you to break it up into a two-step process: 1) create schema for all tables, 2) IMPORT INTO all tables.

So if there is a performance difference between the two, I think we should consider optimizing the empty IMPORT INTO case. It seems easy enough to check if the table is completely empty after taking it offline.

If there's not a difference between the two, feel free to close.

The text was updated successfully, but these errors were encountered:

dt · 2020-06-27T20:50:48Z

INTO and regular IMPORT both need it to catch uniqueness violations within the the data being imported -- the table being empty initially isn't enough to skip the check since it is non-empty once we flush the first SST to it, so we don't really have a choice.

Also, as you point out, if we don't disallow collisions, we can't trust stats are actually additive, so we'd pay for it in stat computation.

Note: we don't really need an empty-initial-table special case since there's already a fast-path around the individual ingest's check if its target span is empty (which it should be for an empty table and sorted input): https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/batcheval/cmd_add_sstable.go#L235

nvanbenschoten · 2020-06-29T18:48:04Z

Great, thanks for the answer. Any reason to keep this open then?

dt · 2020-06-29T21:39:31Z

There is a performance difference between them on failure -- both need to return to the pre-IMPORT state but for INTO, that means finding and rolling back the keys it changed which is O(table size), while for regular import we can just drop the table which is much much cheaper.

We could keep track of if the initial table was truly empty so that INTO's revert could have a special-case of TRUNCATE in that case but it is a pretty extreme edge case. I'd rather just say that if your import of a branch new table is big enough that reverting it is expensive, you should use regular import and not INTO.

But this is all just on failure, so hopefully an edge case of an edge case. There is indeed no differences in the part you identified.

nvanbenschoten · 2020-08-13T15:11:37Z

We could keep track of if the initial table was truly empty so that INTO's revert could have a special-case of TRUNCATE in that case but it is a pretty extreme edge case.

Is this what you ended up doing in #52754?

dt · 2020-08-13T15:14:20Z

Yeah, that's being tracked in #51512

nvanbenschoten · 2020-08-13T15:23:23Z

Nice, looking forward to this change! It would have saved me a lot of time in some of the earlier TPC-E testing.

Also, you might want to reference #51512 somewhere on the PR.

dt · 2020-08-13T15:32:55Z

Yeah, I'll add it once it is good to go, i just hate putting it in the commit early since it spams the issue every time you push

nvanbenschoten added A-disaster-recovery C-question A question rather than an issue. No code/spec/doc change needed. labels Jun 27, 2020

nvanbenschoten assigned dt Jun 27, 2020

nvanbenschoten mentioned this issue Jun 27, 2020

bulkio/import: bug when mixing existing and new tables? #50733

Closed

dt removed their assignment Jun 29, 2020

dt closed this as completed Jun 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bulkio/import: performance difference between IMPORT and IMPORT INTO? #50731

bulkio/import: performance difference between IMPORT and IMPORT INTO? #50731

nvanbenschoten commented Jun 27, 2020

dt commented Jun 27, 2020

nvanbenschoten commented Jun 29, 2020

dt commented Jun 29, 2020 •

edited

Loading

nvanbenschoten commented Aug 13, 2020

dt commented Aug 13, 2020

nvanbenschoten commented Aug 13, 2020

dt commented Aug 13, 2020

bulkio/import: performance difference between IMPORT and IMPORT INTO? #50731

bulkio/import: performance difference between IMPORT and IMPORT INTO? #50731

Comments

nvanbenschoten commented Jun 27, 2020

dt commented Jun 27, 2020

nvanbenschoten commented Jun 29, 2020

dt commented Jun 29, 2020 • edited Loading

nvanbenschoten commented Aug 13, 2020

dt commented Aug 13, 2020

nvanbenschoten commented Aug 13, 2020

dt commented Aug 13, 2020

dt commented Jun 29, 2020 •

edited

Loading