[SPARK-7673] [SQL] WIP: HadoopFsRelation and ParquetRelation2 performance optimizations #6225

liancheng · 2015-05-18T01:49:44Z

This PR introduces several performance optimizations to HadoopFsRelation and ParquetRelation2:

Moving FileStatus listing from DataSourceStrategy into a cache within HadoopFsRelation.

This new cache generalizes and replaces the one used in ParquetRelation2.

This also introduces an interface change: to reuse cached FileStatus objects, HadoopFsRelation.buildScan methods now receive Array[FileStatus] instead of Array[String].
2. When Parquet task side metadata reading is enabled, skip reading row group information when reading Parquet footers.

This is basically what PR #5334 does. Also, now we uses ParquetFileReader.readAllFootersInParallel to read footers in parallel.

Another optimization in question is, instead of asking HadoopFsRelation.buildScan to return an RDD[Row] for a single selected partition and then union them all, we ask it to return an RDD[Row] for all selected partitions. This optimization is based on the fact that Hadoop configuration broadcasting used in NewHadoopRDD takes 34% time in the following microbenchmark. However, this complicates data source user code because user code must merge partition values manually.

To check the cost of broadcasting in NewHadoopRDD, I also did microbenchmark after removing the broadcast call in NewHadoopRDD. All results are shown below.

Microbenchmark

Preparation code

Generating a partitioned table with 50k partitions, 1k rows per partition:

import sqlContext._
import sqlContext.implicits._

for (n <- 0 until 500) {
  val data = for {
    p <- (n * 10) until ((n + 1) * 10)
    i <- 0 until 1000
  } yield (i, f"val_$i%04d", f"$p%04d")

  data.
    toDF("a", "b", "p").
    write.
    partitionBy("p").
    mode("append").
    parquet(path)
}

Benchmarking code

import sqlContext._
import sqlContext.implicits._

import org.apache.spark.sql.types._
import com.google.common.base.Stopwatch

val path = "hdfs://localhost:9000/user/lian/5k"

def benchmark(n: Int)(f: => Unit) {
  val stopwatch = new Stopwatch()

  def run() = {
    stopwatch.reset()
    stopwatch.start()
    f
    stopwatch.stop()
    stopwatch.elapsedMillis()
  }

  val records = (0 until n).map(_ => run())

  (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms"))
  println(s"Average: ${records.sum / n.toDouble} ms")
}

benchmark(3) { read.parquet(path).explain(extended = true) }

Results

Before:

Round 0: 72528 ms
Round 1: 68938 ms
Round 2: 65372 ms
Average: 68946.0 ms

After:

Round 0: 59499 ms
Round 1: 53645 ms
Round 2: 53844 ms
Round 3: 49093 ms
Round 4: 50555 ms
Average: 53327.2 ms

Also removing Hadoop configuration broadcasting:

(Note that I was testing on a local laptop, thus network cost is pretty low.)

Round 0: 15806 ms
Round 1: 14394 ms
Round 2: 14699 ms
Round 3: 15334 ms
Round 4: 14123 ms
Average: 14871.2 ms

AmplabJenkins · 2015-05-18T01:52:10Z

Merged build triggered.

AmplabJenkins · 2015-05-18T01:52:18Z

Merged build started.

SparkQA · 2015-05-18T01:52:38Z

Test build #32959 has started for PR 6225 at commit 6a08b02.

SparkQA · 2015-05-18T01:54:07Z

Test build #32959 has finished for PR 6225 at commit 6a08b02.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-18T01:54:08Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-18T01:54:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32959/
Test FAILed.

AmplabJenkins · 2015-05-18T01:57:11Z

Merged build triggered.

AmplabJenkins · 2015-05-18T01:57:18Z

Merged build started.

SparkQA · 2015-05-18T01:57:32Z

Test build #32960 has started for PR 6225 at commit b84612a.

SparkQA · 2015-05-18T02:21:00Z

Test build #32960 has finished for PR 6225 at commit b84612a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-18T02:21:03Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-18T02:21:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32960/
Test FAILed.

AmplabJenkins · 2015-05-18T02:27:10Z

Merged build triggered.

AmplabJenkins · 2015-05-18T02:27:18Z

Merged build started.

SparkQA · 2015-05-18T02:27:39Z

Test build #32962 has started for PR 6225 at commit 3d278f7.

AmplabJenkins · 2015-05-18T03:47:10Z

Merged build triggered.

AmplabJenkins · 2015-05-18T03:47:18Z

Merged build started.

SparkQA · 2015-05-18T03:49:18Z

Test build #32964 has started for PR 6225 at commit ba41250.

SparkQA · 2015-05-18T04:20:05Z

Test build #32962 has finished for PR 6225 at commit 3d278f7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-18T04:20:10Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-18T04:20:10Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32962/
Test PASSed.

SparkQA · 2015-05-18T05:43:54Z

Test build #32964 has finished for PR 6225 at commit ba41250.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-18T05:43:58Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-18T05:43:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32964/
Test PASSed.

…s to child files

AmplabJenkins · 2015-05-18T15:02:23Z

Merged build started.

SparkQA · 2015-05-18T15:04:15Z

Test build #32996 has started for PR 6225 at commit 7aa3748.

…ding

AmplabJenkins · 2015-05-18T16:07:11Z

Merged build triggered.

AmplabJenkins · 2015-05-18T16:07:19Z

Merged build started.

SparkQA · 2015-05-18T16:09:17Z

Test build #32999 has started for PR 6225 at commit 2d58a2b.

SparkQA · 2015-05-18T16:55:00Z

Test build #32996 has finished for PR 6225 at commit 7aa3748.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-18T16:55:04Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-18T16:55:04Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32996/
Test PASSed.

SparkQA · 2015-05-18T18:00:21Z

Test build #32999 has finished for PR 6225 at commit 2d58a2b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-18T18:00:25Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-18T18:00:26Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32999/
Test PASSed.

…ance optimizations This PR introduces several performance optimizations to `HadoopFsRelation` and `ParquetRelation2`: 1. Moving `FileStatus` listing from `DataSourceStrategy` into a cache within `HadoopFsRelation`. This new cache generalizes and replaces the one used in `ParquetRelation2`. This also introduces an interface change: to reuse cached `FileStatus` objects, `HadoopFsRelation.buildScan` methods now receive `Array[FileStatus]` instead of `Array[String]`. 1. When Parquet task side metadata reading is enabled, skip reading row group information when reading Parquet footers. This is basically what PR #5334 does. Also, now we uses `ParquetFileReader.readAllFootersInParallel` to read footers in parallel. Another optimization in question is, instead of asking `HadoopFsRelation.buildScan` to return an `RDD[Row]` for a single selected partition and then union them all, we ask it to return an `RDD[Row]` for all selected partitions. This optimization is based on the fact that Hadoop configuration broadcasting used in `NewHadoopRDD` takes 34% time in the following microbenchmark. However, this complicates data source user code because user code must merge partition values manually. To check the cost of broadcasting in `NewHadoopRDD`, I also did microbenchmark after removing the `broadcast` call in `NewHadoopRDD`. All results are shown below. ### Microbenchmark #### Preparation code Generating a partitioned table with 50k partitions, 1k rows per partition: ```scala import sqlContext._ import sqlContext.implicits._ for (n <- 0 until 500) { val data = for { p <- (n * 10) until ((n + 1) * 10) i <- 0 until 1000 } yield (i, f"val_$i%04d", f"$p%04d") data. toDF("a", "b", "p"). write. partitionBy("p"). mode("append"). parquet(path) } ``` #### Benchmarking code ```scala import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types._ import com.google.common.base.Stopwatch val path = "hdfs://localhost:9000/user/lian/5k" def benchmark(n: Int)(f: => Unit) { val stopwatch = new Stopwatch() def run() = { stopwatch.reset() stopwatch.start() f stopwatch.stop() stopwatch.elapsedMillis() } val records = (0 until n).map(_ => run()) (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms")) println(s"Average: ${records.sum / n.toDouble} ms") } benchmark(3) { read.parquet(path).explain(extended = true) } ``` #### Results Before: ``` Round 0: 72528 ms Round 1: 68938 ms Round 2: 65372 ms Average: 68946.0 ms ``` After: ``` Round 0: 59499 ms Round 1: 53645 ms Round 2: 53844 ms Round 3: 49093 ms Round 4: 50555 ms Average: 53327.2 ms ``` Also removing Hadoop configuration broadcasting: (Note that I was testing on a local laptop, thus network cost is pretty low.) ``` Round 0: 15806 ms Round 1: 14394 ms Round 2: 14699 ms Round 3: 15334 ms Round 4: 14123 ms Average: 14871.2 ms ``` Author: Cheng Lian <[email protected]> Closes #6225 from liancheng/spark-7673 and squashes the following commits: 2d58a2b [Cheng Lian] Skips reading row group information when using task side metadata reading 7aa3748 [Cheng Lian] Optimizes FileStatusCache by introducing a map from parent directories to child files ba41250 [Cheng Lian] Reuses HadoopFsRelation FileStatusCache in ParquetRelation2 3d278f7 [Cheng Lian] Fixes a bug when reading a single Parquet data file b84612a [Cheng Lian] Fixes Scala style issue 6a08b02 [Cheng Lian] WIP: Moves file status cache into HadoopFSRelation (cherry picked from commit 9dadf01) Signed-off-by: Yin Huai <[email protected]>

yhuai · 2015-05-18T19:51:21Z

I have merged it to master and branch 1.4. I also tested manually and it did fix the performance issue of calling list status. The WIP in the title was for the work of using a broadcast hadoop conf and to make sure we do not have regression comparing with 1.3 (broadcasting a conf for every partition's Hadoop RDD is pretty expensive). Since this issue is an separate issue, I am going to create another PR to address it.

Fix break caused by merging #6225 and #6194. Author: Michael Armbrust <[email protected]> Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break

Fix break caused by merging #6225 and #6194. Author: Michael Armbrust <[email protected]> Closes #6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break (cherry picked from commit fcf90b7) Signed-off-by: Andrew Or <[email protected]>

…ance optimizations This PR introduces several performance optimizations to `HadoopFsRelation` and `ParquetRelation2`: 1. Moving `FileStatus` listing from `DataSourceStrategy` into a cache within `HadoopFsRelation`. This new cache generalizes and replaces the one used in `ParquetRelation2`. This also introduces an interface change: to reuse cached `FileStatus` objects, `HadoopFsRelation.buildScan` methods now receive `Array[FileStatus]` instead of `Array[String]`. 1. When Parquet task side metadata reading is enabled, skip reading row group information when reading Parquet footers. This is basically what PR apache#5334 does. Also, now we uses `ParquetFileReader.readAllFootersInParallel` to read footers in parallel. Another optimization in question is, instead of asking `HadoopFsRelation.buildScan` to return an `RDD[Row]` for a single selected partition and then union them all, we ask it to return an `RDD[Row]` for all selected partitions. This optimization is based on the fact that Hadoop configuration broadcasting used in `NewHadoopRDD` takes 34% time in the following microbenchmark. However, this complicates data source user code because user code must merge partition values manually. To check the cost of broadcasting in `NewHadoopRDD`, I also did microbenchmark after removing the `broadcast` call in `NewHadoopRDD`. All results are shown below. ### Microbenchmark #### Preparation code Generating a partitioned table with 50k partitions, 1k rows per partition: ```scala import sqlContext._ import sqlContext.implicits._ for (n <- 0 until 500) { val data = for { p <- (n * 10) until ((n + 1) * 10) i <- 0 until 1000 } yield (i, f"val_$i%04d", f"$p%04d") data. toDF("a", "b", "p"). write. partitionBy("p"). mode("append"). parquet(path) } ``` #### Benchmarking code ```scala import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types._ import com.google.common.base.Stopwatch val path = "hdfs://localhost:9000/user/lian/5k" def benchmark(n: Int)(f: => Unit) { val stopwatch = new Stopwatch() def run() = { stopwatch.reset() stopwatch.start() f stopwatch.stop() stopwatch.elapsedMillis() } val records = (0 until n).map(_ => run()) (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms")) println(s"Average: ${records.sum / n.toDouble} ms") } benchmark(3) { read.parquet(path).explain(extended = true) } ``` #### Results Before: ``` Round 0: 72528 ms Round 1: 68938 ms Round 2: 65372 ms Average: 68946.0 ms ``` After: ``` Round 0: 59499 ms Round 1: 53645 ms Round 2: 53844 ms Round 3: 49093 ms Round 4: 50555 ms Average: 53327.2 ms ``` Also removing Hadoop configuration broadcasting: (Note that I was testing on a local laptop, thus network cost is pretty low.) ``` Round 0: 15806 ms Round 1: 14394 ms Round 2: 14699 ms Round 3: 15334 ms Round 4: 14123 ms Average: 14871.2 ms ``` Author: Cheng Lian <[email protected]> Closes apache#6225 from liancheng/spark-7673 and squashes the following commits: 2d58a2b [Cheng Lian] Skips reading row group information when using task side metadata reading 7aa3748 [Cheng Lian] Optimizes FileStatusCache by introducing a map from parent directories to child files ba41250 [Cheng Lian] Reuses HadoopFsRelation FileStatusCache in ParquetRelation2 3d278f7 [Cheng Lian] Fixes a bug when reading a single Parquet data file b84612a [Cheng Lian] Fixes Scala style issue 6a08b02 [Cheng Lian] WIP: Moves file status cache into HadoopFSRelation

Fix break caused by merging apache#6225 and apache#6194. Author: Michael Armbrust <[email protected]> Closes apache#6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break

…ance optimizations This PR introduces several performance optimizations to `HadoopFsRelation` and `ParquetRelation2`: 1. Moving `FileStatus` listing from `DataSourceStrategy` into a cache within `HadoopFsRelation`. This new cache generalizes and replaces the one used in `ParquetRelation2`. This also introduces an interface change: to reuse cached `FileStatus` objects, `HadoopFsRelation.buildScan` methods now receive `Array[FileStatus]` instead of `Array[String]`. 1. When Parquet task side metadata reading is enabled, skip reading row group information when reading Parquet footers. This is basically what PR apache#5334 does. Also, now we uses `ParquetFileReader.readAllFootersInParallel` to read footers in parallel. Another optimization in question is, instead of asking `HadoopFsRelation.buildScan` to return an `RDD[Row]` for a single selected partition and then union them all, we ask it to return an `RDD[Row]` for all selected partitions. This optimization is based on the fact that Hadoop configuration broadcasting used in `NewHadoopRDD` takes 34% time in the following microbenchmark. However, this complicates data source user code because user code must merge partition values manually. To check the cost of broadcasting in `NewHadoopRDD`, I also did microbenchmark after removing the `broadcast` call in `NewHadoopRDD`. All results are shown below. ### Microbenchmark #### Preparation code Generating a partitioned table with 50k partitions, 1k rows per partition: ```scala import sqlContext._ import sqlContext.implicits._ for (n <- 0 until 500) { val data = for { p <- (n * 10) until ((n + 1) * 10) i <- 0 until 1000 } yield (i, f"val_$i%04d", f"$p%04d") data. toDF("a", "b", "p"). write. partitionBy("p"). mode("append"). parquet(path) } ``` #### Benchmarking code ```scala import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types._ import com.google.common.base.Stopwatch val path = "hdfs://localhost:9000/user/lian/5k" def benchmark(n: Int)(f: => Unit) { val stopwatch = new Stopwatch() def run() = { stopwatch.reset() stopwatch.start() f stopwatch.stop() stopwatch.elapsedMillis() } val records = (0 until n).map(_ => run()) (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms")) println(s"Average: ${records.sum / n.toDouble} ms") } benchmark(3) { read.parquet(path).explain(extended = true) } ``` #### Results Before: ``` Round 0: 72528 ms Round 1: 68938 ms Round 2: 65372 ms Average: 68946.0 ms ``` After: ``` Round 0: 59499 ms Round 1: 53645 ms Round 2: 53844 ms Round 3: 49093 ms Round 4: 50555 ms Average: 53327.2 ms ``` Also removing Hadoop configuration broadcasting: (Note that I was testing on a local laptop, thus network cost is pretty low.) ``` Round 0: 15806 ms Round 1: 14394 ms Round 2: 14699 ms Round 3: 15334 ms Round 4: 14123 ms Average: 14871.2 ms ``` Author: Cheng Lian <[email protected]> Closes apache#6225 from liancheng/spark-7673 and squashes the following commits: 2d58a2b [Cheng Lian] Skips reading row group information when using task side metadata reading 7aa3748 [Cheng Lian] Optimizes FileStatusCache by introducing a map from parent directories to child files ba41250 [Cheng Lian] Reuses HadoopFsRelation FileStatusCache in ParquetRelation2 3d278f7 [Cheng Lian] Fixes a bug when reading a single Parquet data file b84612a [Cheng Lian] Fixes Scala style issue 6a08b02 [Cheng Lian] WIP: Moves file status cache into HadoopFSRelation

Fix break caused by merging apache#6225 and apache#6194. Author: Michael Armbrust <[email protected]> Closes apache#6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break

…ance optimizations This PR introduces several performance optimizations to `HadoopFsRelation` and `ParquetRelation2`: 1. Moving `FileStatus` listing from `DataSourceStrategy` into a cache within `HadoopFsRelation`. This new cache generalizes and replaces the one used in `ParquetRelation2`. This also introduces an interface change: to reuse cached `FileStatus` objects, `HadoopFsRelation.buildScan` methods now receive `Array[FileStatus]` instead of `Array[String]`. 1. When Parquet task side metadata reading is enabled, skip reading row group information when reading Parquet footers. This is basically what PR apache#5334 does. Also, now we uses `ParquetFileReader.readAllFootersInParallel` to read footers in parallel. Another optimization in question is, instead of asking `HadoopFsRelation.buildScan` to return an `RDD[Row]` for a single selected partition and then union them all, we ask it to return an `RDD[Row]` for all selected partitions. This optimization is based on the fact that Hadoop configuration broadcasting used in `NewHadoopRDD` takes 34% time in the following microbenchmark. However, this complicates data source user code because user code must merge partition values manually. To check the cost of broadcasting in `NewHadoopRDD`, I also did microbenchmark after removing the `broadcast` call in `NewHadoopRDD`. All results are shown below. ### Microbenchmark #### Preparation code Generating a partitioned table with 50k partitions, 1k rows per partition: ```scala import sqlContext._ import sqlContext.implicits._ for (n <- 0 until 500) { val data = for { p <- (n * 10) until ((n + 1) * 10) i <- 0 until 1000 } yield (i, f"val_$i%04d", f"$p%04d") data. toDF("a", "b", "p"). write. partitionBy("p"). mode("append"). parquet(path) } ``` #### Benchmarking code ```scala import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types._ import com.google.common.base.Stopwatch val path = "hdfs://localhost:9000/user/lian/5k" def benchmark(n: Int)(f: => Unit) { val stopwatch = new Stopwatch() def run() = { stopwatch.reset() stopwatch.start() f stopwatch.stop() stopwatch.elapsedMillis() } val records = (0 until n).map(_ => run()) (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms")) println(s"Average: ${records.sum / n.toDouble} ms") } benchmark(3) { read.parquet(path).explain(extended = true) } ``` #### Results Before: ``` Round 0: 72528 ms Round 1: 68938 ms Round 2: 65372 ms Average: 68946.0 ms ``` After: ``` Round 0: 59499 ms Round 1: 53645 ms Round 2: 53844 ms Round 3: 49093 ms Round 4: 50555 ms Average: 53327.2 ms ``` Also removing Hadoop configuration broadcasting: (Note that I was testing on a local laptop, thus network cost is pretty low.) ``` Round 0: 15806 ms Round 1: 14394 ms Round 2: 14699 ms Round 3: 15334 ms Round 4: 14123 ms Average: 14871.2 ms ``` Author: Cheng Lian <[email protected]> Closes apache#6225 from liancheng/spark-7673 and squashes the following commits: 2d58a2b [Cheng Lian] Skips reading row group information when using task side metadata reading 7aa3748 [Cheng Lian] Optimizes FileStatusCache by introducing a map from parent directories to child files ba41250 [Cheng Lian] Reuses HadoopFsRelation FileStatusCache in ParquetRelation2 3d278f7 [Cheng Lian] Fixes a bug when reading a single Parquet data file b84612a [Cheng Lian] Fixes Scala style issue 6a08b02 [Cheng Lian] WIP: Moves file status cache into HadoopFSRelation

Fix break caused by merging apache#6225 and apache#6194. Author: Michael Armbrust <[email protected]> Closes apache#6244 from marmbrus/fixOrcBuildBreak and squashes the following commits: b10e47b [Michael Armbrust] [HOTFIX] Fix ORC Build break

WIP: Moves file status cache into HadoopFSRelation

6a08b02

Fixes Scala style issue

b84612a

Fixes a bug when reading a single Parquet data file

3d278f7

liancheng changed the title ~~[SPARK-7673] [SQL] WIP: Moves file status cache into HadoopFSRelation~~ [SPARK-7673] [SQL] WIP: Moves file status cache into HadoopFsRelation May 18, 2015

Reuses HadoopFsRelation FileStatusCache in ParquetRelation2

ba41250

Optimizes FileStatusCache by introducing a map from parent directorie…

7aa3748

…s to child files

liancheng changed the title ~~[SPARK-7673] [SQL] WIP: Moves file status cache into HadoopFsRelation~~ [SPARK-7673] [SQL] WIP: HadoopFsRelation performance optimizations May 18, 2015

liancheng changed the title ~~[SPARK-7673] [SQL] WIP: HadoopFsRelation performance optimizations~~ [SPARK-7673] [SQL] WIP: HadoopFsRelation and ParquetRelation2 performance optimizations May 18, 2015

Skips reading row group information when using task side metadata rea…

2d58a2b

…ding

asfgit closed this in 9dadf01 May 18, 2015

marmbrus mentioned this pull request May 18, 2015

[HOTFIX] Fix ORC build break #6244

Closed

liancheng deleted the spark-7673 branch May 19, 2015 02:25

This was referenced May 19, 2015

[SPARK-7718] [SQL] Speed up partitioning by avoiding closure cleaning #6256

Closed

[SQL] [WIP] Tries to skip row groups when reading footers #5334

Closed

[SPARK-2883][SQL] Spark Support for ORCFile with New Framework #6135

Closed

[SPARK-7673] [SQL] WIP: HadoopFsRelation and ParquetRelation2 performance optimizations #6225

[SPARK-7673] [SQL] WIP: HadoopFsRelation and ParquetRelation2 performance optimizations #6225

Conversation

liancheng commented May 18, 2015

Microbenchmark

Preparation code

Benchmarking code

Results

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

SparkQA commented May 18, 2015

SparkQA commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

SparkQA commented May 18, 2015

SparkQA commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

SparkQA commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

SparkQA commented May 18, 2015

SparkQA commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

SparkQA commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

SparkQA commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

SparkQA commented May 18, 2015

SparkQA commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

SparkQA commented May 18, 2015

AmplabJenkins commented May 18, 2015

AmplabJenkins commented May 18, 2015

yhuai commented May 18, 2015