Strip '@' symbols when merging pull requests. #1239

pwendell · 2014-06-27T00:04:08Z

Currently all of the commits with '@x' in them cause person X to
receive e-mails every time someone makes a public fork of Spark.

@marmbrus who requested this.

AmplabJenkins · 2014-06-27T00:05:25Z

Merged build triggered.

@x

Currently all of the commits with '@x' in them cause person X to receive e-mails every time someone makes a public fork of Spark. @marmbrus who requested this.

AmplabJenkins · 2014-06-27T00:15:39Z

Merged build finished.

AmplabJenkins · 2014-06-27T00:15:39Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16175/

sryza · 2014-06-27T17:47:07Z

Yesss, thank you, great idea

Currently all of the commits with 'X' in them cause person X to receive e-mails every time someone makes a public fork of Spark. marmbrus who requested this. Author: Patrick Wendell <[email protected]> Closes apache#1239 from pwendell/strip and squashes the following commits: 22e5a97 [Patrick Wendell] Strip '@' symbols when merging pull requests.

…reader (apache#1239) (apache#1260) ### What changes were proposed in this pull request? This PR adds support for complex types (e.g., list, map, array) for Spark's vectorized Parquet reader. In particular, this introduces the following changes: 1. Added a new class `ParquetType` which binds a Spark type with its corresponding Parquet definition & repetition level. This is used when Spark assembles a vector of complex type for Parquet. 2. Changed `ParquetSchemaConverter` and added a new method `convertTypeInfo` which converts a Parquet `MessageType` to a `ParquetType` above. The existing conversion logic in the class remains the same but now operates with `ParquetType` instead of `DataType`, and annotate the former with extra information such as definition & repetition level, column path, column descriptor, etc. 3. Added a new class `ParquetColumn` which encapsulates all the necessary information needed when reading a Parquet column, including the `ParquetType` for the column, the repetition & definition levels (only allocated for a leaf-node of a complex type), as well as the reader for the column. In addition, it also contains logic for assembling nested columnar batches, via interpreting Parquet repetition & definition levels. 4. Changes are made in `VectorizedParquetRecordReader` to initialize a list of `ParquetColumn` for the columns read. 5. `VectorizedColumnReader` now also creates a reader for repetition column. Depending on whether maximum repetition level is 0, the batch read is now split into two code paths, e.g., `readBatch` versus `readBatchNested`. 6. Added logic to handle complex type in `VectorizedRleValuesReader`. For data types involving only struct or primitive types, it still goes with the old `readBatch` method which now also saves definition levels into a vector for later assembly. Otherwise, for data types involving array or map, a separate code path `readBatchNested` is introduced to handle repetition levels. 7. Added a new config `spark.sql.parquet.enableNestedColumnVectorizedReader` to turn on or turn off the feature. By default it is true. 8. Modified `WritableColumnVector` to better support null structs. Currently it requires populating null entries to all child vectors when there is a null struct, however this will waste space and also doesn't work well with Parquet scan. This adds an extra field `structOffsets` which records the mapping from a row ID to the position of the row in the child vector, so that child vectors will only need to store real null elements. To test this, the PR introduced an interface `ParquetRowGroupReader ` in `SpecificParquetRecordReaderBase` to abstract the Parquet file reading logic. The bulk of the tests are in `ParquetVectorizedSuite` which covers different batch size & page size, column index, first row index, nulls, etc. The `DataSourceReadBenchmark` is extended with two more cases: reading struct fields of primitive types and reading array of struct & map field. ### Why are the changes needed? Whenever read schema containing complex types, at the moment Spark will fallback to the row-based reader in parquet-mr, which is much slower. As benchmark shows, by adding support into the vectorized reader, we can get ~15x on average speed up on reading struct fields, and ~1.5x when reading array of struct and map. Micro benchmark of reading primitive fields from a struct, over 400m rows: ``` ================================================================================================ SQL Single Numeric Column Scan in Struct ================================================================================================ OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz SQL Single TINYINT Column Scan in Struct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------- SQL ORC Vectorized (Disabled Nested Column) 77684 78174 692 5.4 185.2 1.0X SQL ORC Vectorized (Enabled Nested Column) 4137 4226 126 101.4 9.9 18.8X SQL Parquet Vectorized (Disabled Nested Column) 42095 42193 138 10.0 100.4 1.8X SQL Parquet Vectorized (Enabled Nested Column) 3317 4147 1174 126.4 7.9 23.4X OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz SQL Single SMALLINT Column Scan in Struct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------- SQL ORC Vectorized (Disabled Nested Column) 82438 82443 7 5.1 196.5 1.0X SQL ORC Vectorized (Enabled Nested Column) 4746 5022 391 88.4 11.3 17.4X SQL Parquet Vectorized (Disabled Nested Column) 43689 43761 102 9.6 104.2 1.9X SQL Parquet Vectorized (Enabled Nested Column) 2894 2986 130 144.9 6.9 28.5X OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz SQL Single INT Column Scan in Struct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------- SQL ORC Vectorized (Disabled Nested Column) 82749 82774 34 5.1 197.3 1.0X SQL ORC Vectorized (Enabled Nested Column) 4848 4869 30 86.5 11.6 17.1X SQL Parquet Vectorized (Disabled Nested Column) 47718 47957 338 8.8 113.8 1.7X SQL Parquet Vectorized (Enabled Nested Column) 3055 3056 2 137.3 7.3 27.1X OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz SQL Single BIGINT Column Scan in Struct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------- SQL ORC Vectorized (Disabled Nested Column) 82398 82416 25 5.1 196.5 1.0X SQL ORC Vectorized (Enabled Nested Column) 6562 7010 634 63.9 15.6 12.6X SQL Parquet Vectorized (Disabled Nested Column) 51007 51032 35 8.2 121.6 1.6X SQL Parquet Vectorized (Enabled Nested Column) 4300 4358 82 97.6 10.3 19.2X OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz SQL Single FLOAT Column Scan in Struct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------- SQL ORC Vectorized (Disabled Nested Column) 85791 86323 753 4.9 204.5 1.0X SQL ORC Vectorized (Enabled Nested Column) 7231 7246 21 58.0 17.2 11.9X SQL Parquet Vectorized (Disabled Nested Column) 48381 48476 134 8.7 115.3 1.8X SQL Parquet Vectorized (Enabled Nested Column) 2770 2791 29 151.4 6.6 31.0X OpenJDK 64-Bit Server VM 11.0.10+9-LTS on Mac OS X 10.16 Intel(R) Core(TM) i9-10910 CPU @ 3.60GHz SQL Single DOUBLE Column Scan in Struct: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------------- SQL ORC Vectorized (Disabled Nested Column) 85566 85598 45 4.9 204.0 1.0X SQL ORC Vectorized (Enabled Nested Column) 8579 8591 17 48.9 20.5 10.0X SQL Parquet Vectorized (Disabled Nested Column) 56052 56106 77 7.5 133.6 1.5X SQL Parquet Vectorized (Enabled Nested Column) 4135 4185 70 101.4 9.9 20.7X ``` ### Does this PR introduce _any_ user-facing change? With the PR Spark should now support reading complex types in its vectorized Parquet reader. A new config `spark.sql.parquet.enableNestedColumnVectorizedReader` is introduced to turn the feature on or off. ### How was this patch tested? Added new unit tests.

Strip '@' symbols when merging pull requests.

22e5a97

Currently all of the commits with '@x' in them cause person X to receive e-mails every time someone makes a public fork of Spark. @marmbrus who requested this.

asfgit closed this in f1f7385 Jun 27, 2014

schlosna mentioned this pull request Feb 8, 2018

Bump codahale metrics palantir/spark#309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip '@' symbols when merging pull requests. #1239

Strip '@' symbols when merging pull requests. #1239

pwendell commented Jun 27, 2014

AmplabJenkins commented Jun 27, 2014

AmplabJenkins commented Jun 27, 2014

AmplabJenkins commented Jun 27, 2014

sryza commented Jun 27, 2014

Strip '@' symbols when merging pull requests. #1239

Strip '@' symbols when merging pull requests. #1239

Conversation

pwendell commented Jun 27, 2014

AmplabJenkins commented Jun 27, 2014

AmplabJenkins commented Jun 27, 2014

AmplabJenkins commented Jun 27, 2014

sryza commented Jun 27, 2014