-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: add option to further shard cloudstorage folders #70207
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @samiskin)
-- commits, line 17 at r1:
Lets expands the comment to note that the ability to shard by date does not imply
that all events for that day have been emitted into that folder.
pkg/ccl/changefeedccl/sink_cloudstorage.go, line 355 at r1 (raw file):
if partitionFormat, ok := opts[changefeedbase.OptPartitionFormat]; ok && partitionFormat != "" { s.partitionFormat = partitionFormat }
We briefly chatted about this... namely, what happens when partition is invalid. And, I'm kinda liking
the silent failure less and less. I know go is rather... strange, as it comes to time parsing, but it is what it is.
If you look at format.go, you'll see some mnemonic time formats (s.a time.RFC3339
).
What if we adopted a similar approach? What if we had something like this:
shardFormats := map[string][string] {
"" /* that's the default */: "2006-01-02",
"@daily": "2006-01-02", /* same as default */
"@hourly": "2006-01-02-15",
}
We can even add few time formats to the list (e.g. time.RFC339
, etc).
The way that I see this might work would be to lookup shard format based on the option (if nothing was specified, we'll lookup ""
-- i.e. the default). If shard format doesn't exist (e.g. something entirely custom), we would just use
whatever value provided. I just don't think anybody would want to go beyond "@hourly".. But if so, then we can even include documentation with some examples -- here is how to get finer grained sharding.
Once we have shard format, we time.Format()
to get the string; we then time.Parse(formatTimeString) back into time.Time. Parse returns an error if it can't parse provided string.
It would be very nice to do this type of validation when we start the changefeed because if the format is invalid,
we will keep retrying forever.
WDYT?
@miretskiy I like that better too, like I mentioned in Slack I don't really like the customization as much if its solely for sharding rather than organization. When going with this there's also the question of would we want date/hour or date-hour. On the issue it is requested to be day/hour, and I like that more since its more readable. For the string constants you did "@hourly" rather than "hourly". Is adding an "@" a normal pattern when it comes to specific string token parameters rather than a freeform string? |
@@ -55,6 +56,17 @@ func cloudStorageFormatTime(ts hlc.Timestamp) string { | |||
return fmt.Sprintf(`%s%09d%010d`, t.Format(f), t.Nanosecond(), ts.Logical) | |||
} | |||
|
|||
func generateDataFilePartition(format string, timestamp time.Time) string { | |||
folderPath := format | |||
folderPath = strings.ReplaceAll(folderPath, "%d", timestamp.Format("2006-01-02")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there would be any benefit to aligning this with strftime? I believe that would mean this changes to %F
and the rest can remain the same.
Nah.. I was just brainstorming. It's somewhat similar to crontab format. But you could choose something else... |
20b6e82
to
02d2869
Compare
Previously we always bucketed cloudstorage paths into 'folders' delineated by date which described the earliest possible timestamp that could exist in that folder. Ex: topic-name/2021-09-13/file.ndjson A customer with a ton of files wanted the folders to be separated hourly rather than just into dates. Ex: topic-name/2021-0-13/04/file.ndjson This patch adds a `partition_format` query param which allows the user to elect to use the default behaviour with the "daily" value, split it by date/hour/ with the "hourly" value, or not partition at all with the "flat" value So to solve the customer's issue they would add the query parameter partition_format="hourly" Release justification: low risk new ux sink query param Release note (api change): CREATE CHANGEFEED on a CloudStorage sink now allows a new query parameter to specify how the file paths are partitioned, for example partition_format="daily" represents the default behavior of splitting into dates (2021-05-01/), while partition_format="hourly" will further partition them them by hour (2021-05-01/05/), and partition_format="flat" will not partition at all
02d2869
to
3432ec6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @samiskin)
pkg/ccl/changefeedccl/sink_cloudstorage.go, line 351 at r3 (raw file):
dateFormat, ok := partitionDateFormats[partitionFormat] if !ok { return nil, errors.Errorf("invalid partition_format of %s", partitionFormat)
I see you've decided to omit the possibility of overriding w/ something entirely custom. I'm cool with that actually.
pkg/ccl/changefeedccl/sink_cloudstorage_test.go, line 598 at r3 (raw file):
require.Equal(t, []string{"v0\n", "v1\n", "v2\n", "v3\n", "v4\n"}, slurpDir(t, dir)) }) }
nice test.
bors r+ |
Build succeeded: |
Informs #70920 |
I initially wanted this to be a sinkURI query param, but that didn't allow me to have the "%" characters without them being escaped.
Resolves #63470