Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

importccl: prototype support for IMPORT INTO #37451

Merged
merged 1 commit into from
May 21, 2019
Merged

Conversation

dt
Copy link
Member

@dt dt commented May 10, 2019

This adds a prototype of incremental IMPORT, allowing importing CSV data
into an existing table as opposed to only into as new table with current
IMPORT.

Unlike traditional IMPORT which takes a specification of the table to
create, this takes a reference to an existing table into which it will
import data. Initially only CSV data, importing into a single table, is
supported (the SQL dumpfiles are typically dumps of an entire table so
it seems likess likely that we need to support them here for now).

Since the actual bulk ingestion is done via non-transactional AddSSTable
commands, the table must be taken offline during ingestion. The IMPORT
job begins by schema-changing the table to an offline 'IMPORTING' state
that should prevent leasing it and moves it back to public when it
finishes (on success or failure, unlike a newly table created table
which is usually rolled back via a drop on failure).

This prototype, and as such has many unfinshed pieces (some of which are
captured as TODOs). Given the number of unresolved UX questions
around this feature, we wanted to start playing with something concrete
to help guide the feature specification and further development.

A few of the more significant areas to be resolved are how the hook uses
transactions for table creation, resolution and alteration versus the
job creation, resolving, checking and plumbing specified subsets of the
columns to IMPORT into, handling rollbacks of partial ingestion,
updating job status messages, and a careful audit of everywhere that
table descriptors are acquired to ensure the IMPORTING state is handled
correctly. That said, these are mostly seperable issues that make better
standalone follow-up changes (and can be done in parallel)

Release note: none.

@dt dt requested review from maddyblue, thoszhang, vivekmenezes and a team May 10, 2019 15:13
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@maddyblue maddyblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's here looks good, but I'm a bit confused by how partial it is. I don't recall the importStmt.Into field existing. Is that new in this work or did that get added a while ago and I forget? Should there be tests here? If you are going to do this in small pieces where it won't actually work until the end then LGTM and merge away.

importing.Version++
if err := p.ExecCfg().DB.Txn(ctx, func(ctx context.Context, txn *client.Txn) error {
return errors.Wrap(
txn.CPut(ctx, sqlbase.MakeDescMetadataKey(found.TableDescriptor.ID),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this interact with other schema changes running on the table?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cput of the whole desc from before to after should avoid any concurrent writes, but after that we're not yet doing anything to actively lock-out other schema changes. We'll need to.

if err != nil {
// Take the table offline for import.
importing := found.TableDescriptor
importing.State = sqlbase.TableDescriptor_IMPORTING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do to the table? Users cannot read or write it during this time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, similar to ADD state: can't lease it in SQL code so no reads or writes.

@dt dt force-pushed the import-into branch 2 times, most recently from 6cb1ca6 to 8e0eef7 Compare May 15, 2019 13:12
@dt
Copy link
Member Author

dt commented May 15, 2019

@mjibson certainly partial: for now, trying to get a functioning skeleton in and then we'll fill in the gaps in standalone changes, but we want to get a prototype into the hands of product or design partners in time to collect feedback early in the development.

This adds a prototype of incremental IMPORT, allowing importing CSV data
into an existing table as opposed to only into as new table with current
IMPORT.

Unlike traditional IMPORT which takes a specification of the table to
create, this takes a reference to an existing table into which it will
import data. Initially only CSV data, importing into a single table, is
supported (the SQL dumpfiles are typically dumps of an entire table so
it seems likess likely that we need to support them here for now).

Since the actual bulk ingestion is done via non-transactional AddSSTable
commands, the table must be taken offline during ingestion. The IMPORT
job begins by schema-changing the table to an offline 'IMPORTING' state
that should prevent leasing it and moves it back to public when it
finishes (on success or failure, unlike a newly table created table
which is usually rolled back via a drop on failure).

Release note: none.
@dt
Copy link
Member Author

dt commented May 21, 2019

bors r+

craig bot pushed a commit that referenced this pull request May 21, 2019
37451:  importccl: prototype support for IMPORT INTO r=dt a=dt

This adds a prototype of incremental IMPORT, allowing importing CSV data
into an existing table as opposed to only into as new table with current
IMPORT.

Unlike traditional IMPORT which takes a specification of the table to
create, this takes a reference to an existing table into which it will
import data. Initially only CSV data, importing into a single table, is
supported (the SQL dumpfiles are typically dumps of an entire table so
it seems likess likely that we need to support them here for now).

Since the actual bulk ingestion is done via non-transactional AddSSTable
commands, the table must be taken offline during ingestion. The IMPORT
job begins by schema-changing the table to an offline 'IMPORTING' state
that should prevent leasing it and moves it back to public when it
finishes (on success or failure, unlike a newly table created table
which is usually rolled back via a drop on failure).

This prototype, and as such has many unfinshed pieces (some of which are
captured as TODOs). Given the number of unresolved UX questions
around this feature, we wanted to start playing with something concrete
to help guide the feature specification and further development.

A few of the more significant areas to be resolved are how the hook uses
transactions for table creation, resolution and alteration versus the
job creation, resolving, checking and plumbing specified subsets of the
columns to IMPORT into, handling rollbacks of partial ingestion,
updating job status messages, and a careful audit of everywhere that 
table descriptors are acquired to ensure the IMPORTING state is handled
correctly. That said, these are mostly seperable issues that make better
standalone follow-up changes (and can be done in parallel) 


Release note: none.

Co-authored-by: David Taylor <[email protected]>
@craig
Copy link
Contributor

craig bot commented May 21, 2019

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants