Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bulkio/import: performance difference between IMPORT and IMPORT INTO? #50731

Closed
nvanbenschoten opened this issue Jun 27, 2020 · 7 comments
Closed
Labels
A-disaster-recovery C-question A question rather than an issue. No code/spec/doc change needed.

Comments

@nvanbenschoten
Copy link
Member

More of a question than an issue. I thought that only IMPORT INTO would need to DisallowShadowing, which has implications on mvcc stats handling, but looking at the code, that doesn't appear to be true.

If this is true though, should we consider a fast-path for IMPORT INTO into empty tables? Given the ergonomics of IMPORT, which requires the table schema and import arguments to be combined into a single statement, it is often easier to use IMPORT INTO, even when initializing a fresh dataset. This is especially true when initialization is performed programmatically because IMPORT INTO allows you to break it up into a two-step process: 1) create schema for all tables, 2) IMPORT INTO all tables.

So if there is a performance difference between the two, I think we should consider optimizing the empty IMPORT INTO case. It seems easy enough to check if the table is completely empty after taking it offline.

If there's not a difference between the two, feel free to close.

@dt
Copy link
Member

dt commented Jun 27, 2020

INTO and regular IMPORT both need it to catch uniqueness violations within the the data being imported -- the table being empty initially isn't enough to skip the check since it is non-empty once we flush the first SST to it, so we don't really have a choice.

Also, as you point out, if we don't disallow collisions, we can't trust stats are actually additive, so we'd pay for it in stat computation.

Note: we don't really need an empty-initial-table special case since there's already a fast-path around the individual ingest's check if its target span is empty (which it should be for an empty table and sorted input): https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/batcheval/cmd_add_sstable.go#L235

@dt dt removed their assignment Jun 29, 2020
@nvanbenschoten
Copy link
Member Author

Great, thanks for the answer. Any reason to keep this open then?

@dt
Copy link
Member

dt commented Jun 29, 2020

There is a performance difference between them on failure -- both need to return to the pre-IMPORT state but for INTO, that means finding and rolling back the keys it changed which is O(table size), while for regular import we can just drop the table which is much much cheaper.

We could keep track of if the initial table was truly empty so that INTO's revert could have a special-case of TRUNCATE in that case but it is a pretty extreme edge case. I'd rather just say that if your import of a branch new table is big enough that reverting it is expensive, you should use regular import and not INTO.

But this is all just on failure, so hopefully an edge case of an edge case. There is indeed no differences in the part you identified.

@dt dt closed this as completed Jun 29, 2020
@nvanbenschoten
Copy link
Member Author

We could keep track of if the initial table was truly empty so that INTO's revert could have a special-case of TRUNCATE in that case but it is a pretty extreme edge case.

Is this what you ended up doing in #52754?

@dt
Copy link
Member

dt commented Aug 13, 2020

Yeah, that's being tracked in #51512

@nvanbenschoten
Copy link
Member Author

Nice, looking forward to this change! It would have saved me a lot of time in some of the earlier TPC-E testing.

Also, you might want to reference #51512 somewhere on the PR.

@dt
Copy link
Member

dt commented Aug 13, 2020

Yeah, I'll add it once it is good to go, i just hate putting it in the commit early since it spams the issue every time you push

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-disaster-recovery C-question A question rather than an issue. No code/spec/doc change needed.
Projects
None yet
Development

No branches or pull requests

2 participants