-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix crash on parquet column type mismatch #68131
Conversation
This is an automated comment for commit 496a4c7 with description of existing statuses. It's updated for the latest CI running ✅ Click here to open a full report in a separate page Successful checks
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
> Refer ClickHouse/ClickHouse#68494 (cherry picked from commit 9ec7071c3a0b00038989bdad0891842472d13098)
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20240817) * Add ut for ClickHouse/ClickHouse#68131 > Refer ClickHouse/ClickHouse#68494 (cherry picked from commit 9ec7071c3a0b00038989bdad0891842472d13098) --------- Co-authored-by: kyligence-git <[email protected]> Co-authored-by: Chang Chen <[email protected]>
/// may mean one of two things: | ||
/// * The byte array is the 16 bytes of Int128, little-endian. | ||
/// * The byte array is an ASCII string containing the Int128 formatted in base 10. | ||
/// There's no reliable way to distinguish these cases. We just guess: if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgive the drive-by comment, but don't Parquet logical types provide the ability to make this distinction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They don't support integers wider than 64 bits. From parquet.thrift
:
/**
* Integer logical type annotation
*
* bitWidth must be 8, 16, 32, or 64.
*
* Allowed for physical types: INT32, INT64
*/
struct IntType {
1: required i8 bitWidth
2: required bool isSigned
}
[...]
/**
* Decimal logical type annotation
*
* To maintain forward-compatibility in v1, implementations using this logical
* type must also set scale and precision on the annotated SchemaElement.
*
* Allowed for physical types: INT32, INT64, FIXED, and BINARY
*/
struct DecimalType {
1: required i32 scale
2: required i32 precision
}
[...]
union LogicalType {
1: StringType STRING // use ConvertedType UTF8
2: MapType MAP // use ConvertedType MAP
3: ListType LIST // use ConvertedType LIST
4: EnumType ENUM // use ConvertedType ENUM
5: DecimalType DECIMAL // use ConvertedType DECIMAL + SchemaElement.{scale, precision}
6: DateType DATE // use ConvertedType DATE
// use ConvertedType TIME_MICROS for TIME(isAdjustedToUTC = *, unit = MICROS)
// use ConvertedType TIME_MILLIS for TIME(isAdjustedToUTC = *, unit = MILLIS)
7: TimeType TIME
// use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC = *, unit = MICROS)
// use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC = *, unit = MILLIS)
8: TimestampType TIMESTAMP
// 9: reserved for INTERVAL
10: IntType INTEGER // use ConvertedType INT_* or UINT_*
11: NullType UNKNOWN // no compatible ConvertedType
12: JsonType JSON // use ConvertedType JSON
13: BsonType BSON // use ConvertedType BSON
14: UUIDType UUID // no compatible ConvertedType
}
(Maybe we could write them as DecimalType
with scale = 0
, Decimal supports arbitrary length. The precision
parameter would be awkward. E.g. UInt256 can hold some but not all 78-digit numbers. We could use precision = 77
but allow the value to be greater than 10^77, but then other parquet decoders may consider this an error. We could use precision = 78
and 33-byte-long value, but then our decoder can't know in advance whether all values will fit in UInt256. The latter seems ok though, maybe that would be the best way to do this. Also, signedness would probably be awkward in some way too.)
Backport #68131 to 24.6: Fix crash on parquet column type mismatch
Backport #68131 to 24.7: Fix crash on parquet column type mismatch
Backport #68131 to 24.8: Fix crash on parquet column type mismatch
Backport #68131 to 24.5: Fix crash on parquet column type mismatch
…68844) Co-authored-by: robot-clickhouse <[email protected]>
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fixed crash in Parquet filtering when data types in the file substantially differ from requested types (e.g.
... FROM file('a.parquet', Parquet, 'x String')
, but the file hasx Int64
). Without this fix, useinput_format_parquet_filter_push_down = 0
as a workaround.ParquetBlockInputFormat didn't cast Field-s to the requested types because I thought KeyCondition allows any types. I don't remember why I thought that. It was crashing in
MergeTreeSetIndex::checkInRange()
whenFieldValue::update()
tried to insert Field of wrong type into a column.Fixes #68403