-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strip '@' symbols when merging pull requests. #1239
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Merged build triggered. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16175/ |
Yesss, thank you, great idea |
xiliu82
pushed a commit
to xiliu82/spark
that referenced
this pull request
Sep 4, 2014
Currently all of the commits with 'X' in them cause person X to receive e-mails every time someone makes a public fork of Spark. marmbrus who requested this. Author: Patrick Wendell <[email protected]> Closes apache#1239 from pwendell/strip and squashes the following commits: 22e5a97 [Patrick Wendell] Strip '@' symbols when merging pull requests.
sunchao
added a commit
to sunchao/spark
that referenced
this pull request
Dec 8, 2021
…reader (apache#1239) (apache#1260) ### What changes were proposed in this pull request? This PR adds support for complex types (e.g., list, map, array) for Spark's vectorized Parquet reader. In particular, this introduces the following changes: 1. Added a new class `ParquetType` which binds a Spark type with its corresponding Parquet definition & repetition level. This is used when Spark assembles a vector of complex type for Parquet. 2. Changed `ParquetSchemaConverter` and added a new method `convertTypeInfo` which converts a Parquet `MessageType` to a `ParquetType` above. The existing conversion logic in the class remains the same but now operates with `ParquetType` instead of `DataType`, and annotate the former with extra information such as definition & repetition level, column path, column descriptor, etc. 3. Added a new class `ParquetColumn` which encapsulates all the necessary information needed when reading a Parquet column, including the `ParquetType` for the column, the repetition & definition levels (only allocated for a leaf-node of a complex type), as well as the reader for the column. In addition, it also contains logic for assembling nested columnar batches, via interpreting Parquet repetition & definition levels. 4. Changes are made in `VectorizedParquetRecordReader` to initialize a list of `ParquetColumn` for the columns read. 5. `VectorizedColumnReader` now also creates a reader for repetition column. Depending on whether maximum repetition level is 0, the batch read is now split into two code paths, e.g., `readBatch` versus `readBatchNested`. 6. Added logic to handle complex type in `VectorizedRleValuesReader`. For data types involving only struct or primitive types, it still goes with the old `readBatch` method which now also saves definition levels into a vector for later assembly. Otherwise, for data types involving array or map, a separate code path `readBatchNested` is introduced to handle repetition levels. 7. Added a new config `spark.sql.parquet.enableNestedColumnVectorizedReader` to turn on or turn off the feature. By default it is true. 8. Modified `WritableColumnVector` to better support null structs. Currently it requires populating null entries to all child vectors when there is a null struct, however this will waste space and also doesn't work well with Parquet scan. This adds an extra field `structOffsets` which records the mapping from a row ID to the position of the row in the child vector, so that child vectors will only need to store real null elements. To test this, the PR introduced an interface `ParquetRowGroupReader ` in `SpecificParquetRecordReaderBase` to abstract the Parquet file reading logic. The bulk of the tests are in `ParquetVectorizedSuite` which covers different batch size & page size, column index, first row index, nulls, etc. The `DataSourceReadBenchmark` is extended with two more cases: reading struct fields of primitive types and reading array of struct & map field. ### Why are the changes needed? Whenever read schema containing complex types, at the moment Spark will fallback to the row-based reader in parquet-mr, which is much slower. As benchmark shows, by adding support into the vectorized reader, we can get ~15x on average speed up on reading struct fields, and ~1.5x when reading array of struct and map. Micro benchmark of reading primitive fields from a struct, over 400m rows: ``` ================================================================================================ SQL Single Numeric Column Scan in Struct ================================================================================================ OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz SQL Single TINYINT Column Scan in Struct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------- SQL ORC Vectorized (Disabled Nested Column) 77684 78174 692 5.4 185.2 1.0X SQL ORC Vectorized (Enabled Nested Column) 4137 4226 126 101.4 9.9 18.8X SQL Parquet Vectorized (Disabled Nested Column) 42095 42193 138 10.0 100.4 1.8X SQL Parquet Vectorized (Enabled Nested Column) 3317 4147 1174 126.4 7.9 23.4X OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz SQL Single SMALLINT Column Scan in Struct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------- SQL ORC Vectorized (Disabled Nested Column) 82438 82443 7 5.1 196.5 1.0X SQL ORC Vectorized (Enabled Nested Column) 4746 5022 391 88.4 11.3 17.4X SQL Parquet Vectorized (Disabled Nested Column) 43689 43761 102 9.6 104.2 1.9X SQL Parquet Vectorized (Enabled Nested Column) 2894 2986 130 144.9 6.9 28.5X OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz SQL Single INT Column Scan in Struct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------- SQL ORC Vectorized (Disabled Nested Column) 82749 82774 34 5.1 197.3 1.0X SQL ORC Vectorized (Enabled Nested Column) 4848 4869 30 86.5 11.6 17.1X SQL Parquet Vectorized (Disabled Nested Column) 47718 47957 338 8.8 113.8 1.7X SQL Parquet Vectorized (Enabled Nested Column) 3055 3056 2 137.3 7.3 27.1X OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz SQL Single BIGINT Column Scan in Struct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------- SQL ORC Vectorized (Disabled Nested Column) 82398 82416 25 5.1 196.5 1.0X SQL ORC Vectorized (Enabled Nested Column) 6562 7010 634 63.9 15.6 12.6X SQL Parquet Vectorized (Disabled Nested Column) 51007 51032 35 8.2 121.6 1.6X SQL Parquet Vectorized (Enabled Nested Column) 4300 4358 82 97.6 10.3 19.2X OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz SQL Single FLOAT Column Scan in Struct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------- SQL ORC Vectorized (Disabled Nested Column) 85791 86323 753 4.9 204.5 1.0X SQL ORC Vectorized (Enabled Nested Column) 7231 7246 21 58.0 17.2 11.9X SQL Parquet Vectorized (Disabled Nested Column) 48381 48476 134 8.7 115.3 1.8X SQL Parquet Vectorized (Enabled Nested Column) 2770 2791 29 151.4 6.6 31.0X OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz SQL Single DOUBLE Column Scan in Struct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------- SQL ORC Vectorized (Disabled Nested Column) 85566 85598 45 4.9 204.0 1.0X SQL ORC Vectorized (Enabled Nested Column) 8579 8591 17 48.9 20.5 10.0X SQL Parquet Vectorized (Disabled Nested Column) 56052 56106 77 7.5 133.6 1.5X SQL Parquet Vectorized (Enabled Nested Column) 4135 4185 70 101.4 9.9 20.7X ``` ### Does this PR introduce _any_ user-facing change? With the PR Spark should now support reading complex types in its vectorized Parquet reader. A new config `spark.sql.parquet.enableNestedColumnVectorizedReader` is introduced to turn the feature on or off. ### How was this patch tested? Added new unit tests.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently all of the commits with '@x' in them cause person X to
receive e-mails every time someone makes a public fork of Spark.
@marmbrus who requested this.