-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lightning: support compress for lightning, add compress unit and integration tests #39153
Conversation
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. Reviewer can indicate their review by submitting an approval review. |
…into lightningCompress
…into lightningCompress
/run-integration-br-test |
@@ -34,6 +34,8 @@ const ( | |||
tableRegionSizeWarningThreshold int64 = 1024 * 1024 * 1024 | |||
// the increment ratio of large CSV file size threshold by `region-split-size` | |||
largeCSVLowerThresholdRation = 10 | |||
// TableFileSizeINF for compressed size, for lightning 10TB is a relatively big value and will strongly affect efficiency | |||
TableFileSizeINF = 10 * 1024 * tableRegionSizeWarningThreshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does INF
mean? infinite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, INF
means infinite. It's used to make sure compressed files can be read until the EOF error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set fileSize to INF to make sure compressed files can be read until EOF. Because we can't get the exact size of the compressed files.
I would expect to put its usage in the comment (as well as why this works, maybe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in 46ddf53
@@ -324,6 +324,9 @@ func (p regexRouterParser) Parse(r *config.FileRouteRule, logger log.Logger) (*R | |||
if err != nil { | |||
return err | |||
} | |||
if result.Type == SourceTypeParquet && compression != CompressionNone { | |||
return errors.Errorf("can't support whole compressed parquet file, should compress parquet files by choosing correct parquet compress writer, path: %s", r.Path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this workaround mean here? I remember we support compressed parquet files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We support compressed binary data in parquet files, but we don't support compressed parquet files like parquet.gz
.
if err != nil { | ||
break | ||
} | ||
reader, err = storage.WithCompression(p.srcStorage, compressType).Open(ctx, sampleFile.Path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate with reader.go
too. Maybe we should encapsulate this with a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in 4c7eaf9
/component lightning |
@buchuitoudegou @dsdashun PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest lgtm
br/pkg/lightning/mydump/region.go
Outdated
// set fileSize to INF to make sure compressed files can be read until EOF. Because we can't get the exact size of the compressed files. | ||
// TODO: update progress bar calculation for compressed files. | ||
if fi.FileMeta.Compression != CompressionNone { | ||
rowIDMax = fileSize * 100 / divisor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we should do some tests on it. It seems this estimate can easily allocate too many row ids and make lightning fail to insert rows that need auto id even when the number of rows is far smaller than the type limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave a fixme in 46ddf53
I will do some tests and maybe fix this problem in the next PR.
@@ -34,6 +34,8 @@ const ( | |||
tableRegionSizeWarningThreshold int64 = 1024 * 1024 * 1024 | |||
// the increment ratio of large CSV file size threshold by `region-split-size` | |||
largeCSVLowerThresholdRation = 10 | |||
// TableFileSizeINF for compressed size, for lightning 10TB is a relatively big value and will strongly affect efficiency | |||
TableFileSizeINF = 10 * 1024 * tableRegionSizeWarningThreshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set fileSize to INF to make sure compressed files can be read until EOF. Because we can't get the exact size of the compressed files.
I would expect to put its usage in the comment (as well as why this works, maybe)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest LGTM
br/pkg/lightning/mydump/region.go
Outdated
}, | ||
} | ||
|
||
if tableRegion.Size() > tableRegionSizeWarningThreshold { | ||
if fi.FileMeta.Compression == CompressionNone && tableRegion.Size() > tableRegionSizeWarningThreshold { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a compressed file's size exceeds the threshold, the actual size is even bigger. So the threshold check is suitable for both compressed and uncompressed file ? Otherwise, the compressed file won't receive this warning message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed in a3ce953
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/merge |
This pull request has been accepted and is ready to merge. Commit hash: a91ad73
|
TiDB MergeCI notify🔴 Bad News! New failing [1] after this pr merged.
|
What problem does this PR solve?
Issue Number: ref #38514
Problem Summary:
What is changed and how it works?
Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.