-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
importccl: support default expressions #48253
Comments
Expanding a bit on this here, I think in particular we want to support the following: Given:
I think to start we should at least implement:
I think it would also be reasonable to expect the following to work:
|
I wonder if replacing
vs the following (where
|
Yep - good point. Default values can indeed be nullable and are only populated when not included in an insert statement. |
Reopened again: the linked PR #50295 only handles constant default expressions: the next step will be to support other expressions with |
like now(), localtimestamp, transaction_timestamp, current_date This PR follows up from cockroachdb#50295 to add support for functions concerning current timestamps, which include now(), localtimestamp(), transaction_timestamp(), current_date(). This is achieved by injecting the walltime recorded at `importCtx` into evalCtx required for the evaluation of these functions. Partially addresses cockroachdb#48253. Release note (general change): timestamp functions are now supported by IMPORT INTO.
Strategy for
The last point, however, is difficult (if not impractical) to implement given that Strategy: from a table descriptor, the columns using any sequence and the corresponding sequences (under
Notice that this is a modification from the previous strategy that says that at the end of an
The previous strategy would use 2931 as the next value for Overall, this strategy should be safe to adopt as it maintains the uniqueness property; a drawback is that it could potentially leave a huge gap between numbers (say, we allocate 1000 numbers but only 10 rows are being imported. That's a gap of 990 numbers). Separate from this: once this is supported,
In this case: which
as default expression. Then it's not even clear what should be the value of |
This sounds about right ,but I don't know if we need the main proc / coordinator in the picture at all -- when we run out of IDs in a given processor, we'll need to reserve more, and, as you say, that reservation needs to be done atomically from the point of view of any other traffic to the sequence unrelated to the IMPORT so we'll use a txn. If we have a txn, we don't really need to go back to the coordinator to coordinate, since if multiple processors need a chunk at the same time, the txn should serialize them on its own, and that avoids a lot of mess trying to do 2-way data flow / rpcs between processors and the coordinator. The other thing I'd do is start with smaller chunk size -- e.g. reserve 10 initially, then on subsequent reservations go back for an order of magnitude more each time until we hit, say, 100k. That way small tables of 5 values don't blow a 100k hole in the sequence, but big tables don't spend all their time contending on the sequence reservations. |
If you have a txn during an IMPORT won't that block any INSERTs from happening at the same time? Is that ok? Also, unless it's changed, sequence increments happen in their own internal txn such that they are atomic, but not transactional (with respect to any user-created transaction). That is, an aborted transaction that had an INSERT would still result in an incremented sequence. |
Oh, it's for the above reason (atomic but not transactional sequence increments) that I'm not sure why you need that property during IMPORT. Users already can't have transactional sequences and INSERT ids be created out-of-order (for example, multiple INSERT transactions can finish in some order and have a different order for their sequence IDs). Why do IMPORTs need that property? |
@mjibson I'm not sure I follow. What property? The plan is that when an IMPORT processor hits a row that needs a next-val, it'll go to an txn to atomically a) update of the sequence to take a chunk of IDs and b) record the file and line number at which that chunk was taken in the job progress. Then on resume, if we previously reserved a chunk for a given line, we'll use it. That is an important property of IMPORT, in that it ensures every time you re-run the same IMPORT (i.e. resume the same job), you produce the same KVs, so if you resume after a pause or crash, your uncheckpointed prior work is perfectly shadowed by your re-done version. If you abort the IMPORT, the "holes" are still in the sequence keyspace -- we updated it in a short-lived txn that committed on the first resume, regardless of what happens later. |
I was referring to 3 in the previous comment:
|
Ah, we sort of want that property since if used in keys, we'd like adjacent input rows to be adjacent KVs for ingestion performance. And simple incrementing within our reserved chunk is also just the easiest way to know when we've exhausted the chunk and know we need to get another one. But we don't make any effort to ensure sequential chunks of the same file have sequential IDs. |
Following the consideration of "determinism" (that is, on resume we produce the same KVs), an additional thing we'll need to do is to include both the value to be used next and the chunk limit (together with the resume position) when saving progress. |
Here's one tricky scenario that I can think of: Imagine we saved at row 30 (say), last used value 1024, chunk limit 1100.
Separate from this, I concur with the idea of starting with a smaller chunk size and increasing this chunk size as we progress. |
We shouldn't need to remember which processor requested which chunk. All we need to know is which file and line number corresponds to which chunk. These (file num, row num, reserved-chunk) entries should be encoded in the import job details in the same txn that reserves the chunk. On resume, the processor should lookup if there is already a chunk reserved from the previous attempt. |
Closing this. We didn't get to |
Previously, `nextval` was not supported as a default expression for a non-targeted import into column. This change adds that functionality for a CSV import. There is a lot of great discussion about the approach to this problem at cockroachdb#48253 (comment). At a high level, on encountering a nextval(seqname) for the first time, IMPORT will reserve a chunk of values for this sequence, and tie those values to the (fileIdx, rowNum) which is a unique reference to a particular row in a distributed import. The size of this chunk grows exponentially based on how many times a single processor encounters a nextval call for that particular sequence. The reservation of the chunk piggy backs on existing methods which provide atomic, non-transactional guarantees when it comes to increasing the value of a sequence. Information about the reserved chunks is stored in the import job progress details so as to ensure the following property: If the import job were to be paused and then resumed, assuming all the rows imported were not checkpointed, we need to ensure that the nextval value for a previously processed (fileIdx, rowNum) is identical to the value computed in the first run of the import job. This property is necessary to prevent duplicate entries with the same key but different value. We use the jobs progress details to check if we have a previously reserved chunk of sequence values which can be used for the current (fileIdx, rowNum). Release note (sql change): IMPORT INTO for CSV now supports nextval as a default expression of a non-targeted column.
Previously, `nextval` was not supported as a default expression for a non-targeted import into column. This change adds that functionality for a CSV import. There is a lot of great discussion about the approach to this problem at cockroachdb#48253 (comment). At a high level, on encountering a nextval(seqname) for the first time, IMPORT will reserve a chunk of values for this sequence, and tie those values to the (fileIdx, rowNum) which is a unique reference to a particular row in a distributed import. The size of this chunk grows exponentially based on how many times a single processor encounters a nextval call for that particular sequence. The reservation of the chunk piggy backs on existing methods which provide atomic, non-transactional guarantees when it comes to increasing the value of a sequence. Information about the reserved chunks is stored in the import job progress details so as to ensure the following property: If the import job were to be paused and then resumed, assuming all the rows imported were not checkpointed, we need to ensure that the nextval value for a previously processed (fileIdx, rowNum) is identical to the value computed in the first run of the import job. This property is necessary to prevent duplicate entries with the same key but different value. We use the jobs progress details to check if we have a previously reserved chunk of sequence values which can be used for the current (fileIdx, rowNum). Release note (sql change): IMPORT INTO for CSV now supports nextval as a default expression of a non-targeted column.
Previously, `nextval` was not supported as a default expression for a non-targeted import into column. This change adds that functionality for a CSV import. There is a lot of great discussion about the approach to this problem at cockroachdb#48253 (comment). At a high level, on encountering a nextval(seqname) for the first time, IMPORT will reserve a chunk of values for this sequence, and tie those values to the (fileIdx, rowNum) which is a unique reference to a particular row in a distributed import. The size of this chunk grows exponentially based on how many times a single processor encounters a nextval call for that particular sequence. The reservation of the chunk piggy backs on existing methods which provide atomic, non-transactional guarantees when it comes to increasing the value of a sequence. Information about the reserved chunks is stored in the import job progress details so as to ensure the following property: If the import job were to be paused and then resumed, assuming all the rows imported were not checkpointed, we need to ensure that the nextval value for a previously processed (fileIdx, rowNum) is identical to the value computed in the first run of the import job. This property is necessary to prevent duplicate entries with the same key but different value. We use the jobs progress details to check if we have a previously reserved chunk of sequence values which can be used for the current (fileIdx, rowNum). Release note (sql change): IMPORT INTO for CSV now supports nextval as a default expression of a non-targeted column.
Previously, `nextval` was not supported as a default expression for a non-targeted import into column. This change adds that functionality for a CSV import. There is a lot of great discussion about the approach to this problem at cockroachdb#48253 (comment). At a high level, on encountering a nextval(seqname) for the first time, IMPORT will reserve a chunk of values for this sequence, and tie those values to the (fileIdx, rowNum) which is a unique reference to a particular row in a distributed import. The size of this chunk grows exponentially based on how many times a single processor encounters a nextval call for that particular sequence. The reservation of the chunk piggy backs on existing methods which provide atomic, non-transactional guarantees when it comes to increasing the value of a sequence. Information about the reserved chunks is stored in the import job progress details so as to ensure the following property: If the import job were to be paused and then resumed, assuming all the rows imported were not checkpointed, we need to ensure that the nextval value for a previously processed (fileIdx, rowNum) is identical to the value computed in the first run of the import job. This property is necessary to prevent duplicate entries with the same key but different value. We use the jobs progress details to check if we have a previously reserved chunk of sequence values which can be used for the current (fileIdx, rowNum). Release note (sql change): IMPORT INTO for CSV now supports nextval as a default expression of a non-targeted column.
Previously, `nextval` was not supported as a default expression for a non-targeted import into column. This change adds that functionality for a CSV import. There is a lot of great discussion about the approach to this problem at cockroachdb#48253 (comment). At a high level, on encountering a nextval(seqname) for the first time, IMPORT will reserve a chunk of values for this sequence, and tie those values to the (fileIdx, rowNum) which is a unique reference to a particular row in a distributed import. The size of this chunk grows exponentially based on how many times a single processor encounters a nextval call for that particular sequence. The reservation of the chunk piggy backs on existing methods which provide atomic, non-transactional guarantees when it comes to increasing the value of a sequence. Information about the reserved chunks is stored in the import job progress details so as to ensure the following property: If the import job were to be paused and then resumed, assuming all the rows imported were not checkpointed, we need to ensure that the nextval value for a previously processed (fileIdx, rowNum) is identical to the value computed in the first run of the import job. This property is necessary to prevent duplicate entries with the same key but different value. We use the jobs progress details to check if we have a previously reserved chunk of sequence values which can be used for the current (fileIdx, rowNum). Release note (sql change): IMPORT INTO for CSV now supports nextval as a default expression of a non-targeted column.
Previously, `nextval` was not supported as a default expression for a non-targeted import into column. This change adds that functionality for a CSV import. There is a lot of great discussion about the approach to this problem at cockroachdb#48253 (comment). At a high level, on encountering a nextval(seqname) for the first time, IMPORT will reserve a chunk of values for this sequence, and tie those values to the (fileIdx, rowNum) which is a unique reference to a particular row in a distributed import. The size of this chunk grows exponentially based on how many times a single processor encounters a nextval call for that particular sequence. The reservation of the chunk piggy backs on existing methods which provide atomic, non-transactional guarantees when it comes to increasing the value of a sequence. Information about the reserved chunks is stored in the import job progress details so as to ensure the following property: If the import job were to be paused and then resumed, assuming all the rows imported were not checkpointed, we need to ensure that the nextval value for a previously processed (fileIdx, rowNum) is identical to the value computed in the first run of the import job. This property is necessary to prevent duplicate entries with the same key but different value. We use the jobs progress details to check if we have a previously reserved chunk of sequence values which can be used for the current (fileIdx, rowNum). Release note (sql change): IMPORT INTO for CSV now supports nextval as a default expression of a non-targeted column.
56473: importccl: add `nextval` support for IMPORT INTO CSV r=miretskiy,pbardea a=adityamaru Previously, `nextval` was not supported as a default expression for a non-targeted import into column. This change adds that functionality for a CSV import. There is a lot of great discussion about the approach to this problem at #48253 (comment). At a high level, on encountering a nextval(seqname) for the first time, IMPORT will reserve a chunk of values for this sequence, and tie those values to the (fileIdx, rowNum) which is a unique reference to a particular row in a distributed import. The size of this chunk grows exponentially based on how many times a single processor encounters a nextval call for that particular sequence. The reservation of the chunk piggy backs on existing methods which provide atomic, non-transactional guarantees when it comes to increasing the value of a sequence. Information about the reserved chunks is stored in the import job progress details so as to ensure the following property: If the import job were to be paused and then resumed, assuming all the rows imported were not checkpointed, we need to ensure that the nextval value for a previously processed (fileIdx, rowNum) is identical to the value computed in the first run of the import job. This property is necessary to prevent duplicate entries with the same key but different value. We use the jobs progress details to check if we have a previously reserved chunk of sequence values which can be used for the current (fileIdx, rowNum). Informs: #54797 Release note (sql change): IMPORT INTO for CSV now supports nextval as a default expression of a non-targeted column. Co-authored-by: Aditya Maru <[email protected]>
It would be nice if IMPORT could support default column expressions. We probably want to plumb down an eval ctx and populate the defaults during row production.
The text was updated successfully, but these errors were encountered: