-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bulkio/import: performance difference between IMPORT and IMPORT INTO? #50731
Comments
INTO and regular IMPORT both need it to catch uniqueness violations within the the data being imported -- the table being empty initially isn't enough to skip the check since it is non-empty once we flush the first SST to it, so we don't really have a choice. Also, as you point out, if we don't disallow collisions, we can't trust stats are actually additive, so we'd pay for it in stat computation. Note: we don't really need an empty-initial-table special case since there's already a fast-path around the individual ingest's check if its target span is empty (which it should be for an empty table and sorted input): https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/batcheval/cmd_add_sstable.go#L235 |
Great, thanks for the answer. Any reason to keep this open then? |
There is a performance difference between them on failure -- both need to return to the pre-IMPORT state but for INTO, that means finding and rolling back the keys it changed which is O(table size), while for regular import we can just drop the table which is much much cheaper. We could keep track of if the initial table was truly empty so that INTO's revert could have a special-case of TRUNCATE in that case but it is a pretty extreme edge case. I'd rather just say that if your import of a branch new table is big enough that reverting it is expensive, you should use regular import and not INTO. But this is all just on failure, so hopefully an edge case of an edge case. There is indeed no differences in the part you identified. |
Is this what you ended up doing in #52754? |
Yeah, that's being tracked in #51512 |
Nice, looking forward to this change! It would have saved me a lot of time in some of the earlier TPC-E testing. Also, you might want to reference #51512 somewhere on the PR. |
Yeah, I'll add it once it is good to go, i just hate putting it in the commit early since it spams the issue every time you push |
More of a question than an issue. I thought that only IMPORT INTO would need to
DisallowShadowing
, which has implications on mvcc stats handling, but looking at the code, that doesn't appear to be true.If this is true though, should we consider a fast-path for IMPORT INTO into empty tables? Given the ergonomics of IMPORT, which requires the table schema and import arguments to be combined into a single statement, it is often easier to use IMPORT INTO, even when initializing a fresh dataset. This is especially true when initialization is performed programmatically because IMPORT INTO allows you to break it up into a two-step process: 1) create schema for all tables, 2) IMPORT INTO all tables.
So if there is a performance difference between the two, I think we should consider optimizing the empty IMPORT INTO case. It seems easy enough to check if the table is completely empty after taking it offline.
If there's not a difference between the two, feel free to close.
The text was updated successfully, but these errors were encountered: