Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: add option to further shard cloudstorage folders #70207

Merged
merged 1 commit into from
Sep 20, 2021

Conversation

samiskin
Copy link
Contributor

@samiskin samiskin commented Sep 14, 2021

I initially wanted this to be a sinkURI query param, but that didn't allow me to have the "%" characters without them being escaped.

Resolves #63470


changefeedccl: add query param to shard cloudstorage folders

Previously we always bucketed cloudstorage paths into 'folders' delineated
by date which described the earliest possible timestamp that could exist
in that folder.  Ex: topic-name/2021-09-13/file.ndjson

A customer with a ton of files wanted the folders to be separated hourly
rather than just into dates.  Ex: topic-name/2021-0-13/04/file.ndjson

This patch adds a `partition_format` query param which allows the
user to elect to use the default behaviour with the "daily" value, split
it by date/hour/ with the "hourly" value, or not partition at all with
the "flat" value

So to solve the customer's issue they would add the query parameter
partition_format="hourly"

Release justification: low risk new ux sink query param

Release note (api change): CREATE CHANGEFEED on a CloudStorage sink now
allows a new query parameter to specify how the file paths are
partitioned, for example partition_format="daily" represents the default
behavior of splitting into dates (2021-05-01/), while
partition_format="hourly" will further partition them them by hour
(2021-05-01/05/), and partition_format="flat" will not partition at all

@samiskin samiskin requested a review from miretskiy September 14, 2021 17:09
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @samiskin)


-- commits, line 17 at r1:
Lets expands the comment to note that the ability to shard by date does not imply
that all events for that day have been emitted into that folder.


pkg/ccl/changefeedccl/sink_cloudstorage.go, line 355 at r1 (raw file):

	if partitionFormat, ok := opts[changefeedbase.OptPartitionFormat]; ok && partitionFormat != "" {
		s.partitionFormat = partitionFormat
	}

We briefly chatted about this... namely, what happens when partition is invalid. And, I'm kinda liking
the silent failure less and less. I know go is rather... strange, as it comes to time parsing, but it is what it is.
If you look at format.go, you'll see some mnemonic time formats (s.a time.RFC3339).
What if we adopted a similar approach? What if we had something like this:

shardFormats := map[string][string] {
   "" /* that's the default */: "2006-01-02",
  "@daily": "2006-01-02", /* same as default */
  "@hourly": "2006-01-02-15",  
}

We can even add few time formats to the list (e.g. time.RFC339, etc).

The way that I see this might work would be to lookup shard format based on the option (if nothing was specified, we'll lookup "" -- i.e. the default). If shard format doesn't exist (e.g. something entirely custom), we would just use
whatever value provided. I just don't think anybody would want to go beyond "@hourly".. But if so, then we can even include documentation with some examples -- here is how to get finer grained sharding.
Once we have shard format, we time.Format() to get the string; we then time.Parse(formatTimeString) back into time.Time. Parse returns an error if it can't parse provided string.

It would be very nice to do this type of validation when we start the changefeed because if the format is invalid,
we will keep retrying forever.

WDYT?

@samiskin
Copy link
Contributor Author

@miretskiy I like that better too, like I mentioned in Slack I don't really like the customization as much if its solely for sharding rather than organization.

When going with this there's also the question of would we want date/hour or date-hour. On the issue it is requested to be day/hour, and I like that more since its more readable.

For the string constants you did "@hourly" rather than "hourly". Is adding an "@" a normal pattern when it comes to specific string token parameters rather than a freeform string?

@@ -55,6 +56,17 @@ func cloudStorageFormatTime(ts hlc.Timestamp) string {
return fmt.Sprintf(`%s%09d%010d`, t.Format(f), t.Nanosecond(), ts.Logical)
}

func generateDataFilePartition(format string, timestamp time.Time) string {
folderPath := format
folderPath = strings.ReplaceAll(folderPath, "%d", timestamp.Format("2006-01-02"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there would be any benefit to aligning this with strftime? I believe that would mean this changes to %F and the rest can remain the same.

https://man7.org/linux/man-pages/man3/strftime.3.html

@miretskiy
Copy link
Contributor

Is adding an "@" a normal pattern when it comes to specific string token parameters rather than a freeform string?

Nah.. I was just brainstorming. It's somewhat similar to crontab format. But you could choose something else...
Feel free to drop @...

@samiskin samiskin force-pushed the granular-s3-cdc-63470 branch from 20b6e82 to 02d2869 Compare September 15, 2021 21:42
Previously we always bucketed cloudstorage paths into 'folders' delineated
by date which described the earliest possible timestamp that could exist
in that folder.  Ex: topic-name/2021-09-13/file.ndjson

A customer with a ton of files wanted the folders to be separated hourly
rather than just into dates.  Ex: topic-name/2021-0-13/04/file.ndjson

This patch adds a `partition_format` query param which allows the
user to elect to use the default behaviour with the "daily" value, split
it by date/hour/ with the "hourly" value, or not partition at all with
the "flat" value

So to solve the customer's issue they would add the query parameter
partition_format="hourly"

Release justification: low risk new ux sink query param

Release note (api change): CREATE CHANGEFEED on a CloudStorage sink now
allows a new query parameter to specify how the file paths are
partitioned, for example partition_format="daily" represents the default
behavior of splitting into dates (2021-05-01/), while
partition_format="hourly" will further partition them them by hour
(2021-05-01/05/), and partition_format="flat" will not partition at all
@samiskin samiskin force-pushed the granular-s3-cdc-63470 branch from 02d2869 to 3432ec6 Compare September 16, 2021 13:11
@samiskin samiskin requested a review from miretskiy September 16, 2021 13:12
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @samiskin)


pkg/ccl/changefeedccl/sink_cloudstorage.go, line 351 at r3 (raw file):

		dateFormat, ok := partitionDateFormats[partitionFormat]
		if !ok {
			return nil, errors.Errorf("invalid partition_format of %s", partitionFormat)

I see you've decided to omit the possibility of overriding w/ something entirely custom. I'm cool with that actually.


pkg/ccl/changefeedccl/sink_cloudstorage_test.go, line 598 at r3 (raw file):

				require.Equal(t, []string{"v0\n", "v1\n", "v2\n", "v3\n", "v4\n"}, slurpDir(t, dir))
			})
		}

nice test.

@miretskiy miretskiy self-requested a review September 16, 2021 17:05
@samiskin
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Sep 20, 2021

Build succeeded:

@miretskiy
Copy link
Contributor

Informs #70920

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More granular S3 bucketing
4 participants