-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12355][SQL] Implement unhandledFilter interface for Parquet #10502
Conversation
Test build #48396 has finished for PR 10502 at commit
|
The test is failed from wrong results from Parquet.
As can be seen, |
I see.
I think I should disable |
Test build #48403 has finished for PR 10502 at commit
|
Test build #48402 has finished for PR 10502 at commit
|
Test build #48405 has finished for PR 10502 at commit
|
cc @yhuai @liancheng |
@HyukjinKwon Thank you for the PR? Can you post some benchmarking results (with your testing code)? It will be good to have these numbers to help others understand if it can provide benefit. |
@yhuai Sure. I will try! |
Benchmark (Removed Spark-side Filter)MotivationThis PR simplifies the query plans for Parquet files by stripping duplicated Spark-side filtering, from:
to :
However, in terms of performance, it is unkown if there is benefit. So, this benchmark was performed. Environment
Method
DatasetRaw Data
Create Target Parquet File
case class Lineitem(l_orderkey: Int,
l_partkey: Int,
l_suppkey: Int,
l_linenumber: Int,
l_quantity: Float,
l_extendedprice: Float,
l_discount: Float,
l_tax: Float,
l_returnflag: String,
l_linestatus: String,
l_shipdate: String,
l_commitdate: String,
l_receiptdate: String,
l_shipinstruct: String,
l_shipmode: String,
l_comment: String)
import sqlContext.implicits._
var conf = new SparkConf()
conf.setAppName("Test").setMaster("local")
conf.set("spark.sql.parquet.compression.codec", "uncompressed")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sc.textFile("lineitem.tbl").map(_.split('|')).map { v =>
Lineitem(
v(0).trim.toInt,
v(1).trim.toInt,
v(2).trim.toInt,
v(3).trim.toInt,
v(4).trim.toFloat,
v(5).trim.toFloat,
v(6).trim.toFloat,
v(7).trim.toFloat,
v(8),
v(9),
v(10),
v(11),
v(12),
v(13),
v(14),
v(15))
}.toDF()
df.save("lineitem", "parquet") Parquet file
Test Codes
def time[A](f: => A) = {
val s = System.nanoTime
val ret = f
println("time: "+(System.nanoTime-s)/1e6+"ms")
ret
}
var conf = new SparkConf()
conf.setAppName("Test").setMaster("local")
conf.set("spark.sql.parquet.enableUnsafeRowRecordReader", "false")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val source = sqlContext.read.parquet("lineitem")
val df = source.filter("l_orderkey IS NULL").select("l_orderkey")
time(df.collect())
val source = sqlContext.read.parquet("lineitem")
val df = source.filter("l_orderkey IS NOT NULL").select("l_orderkey")
time(df.collect())
val source = sqlContext.read.parquet("lineitem")
val df = source.filter("l_orderkey = 1").select("l_orderkey")
time(df.collect())
val source = sqlContext.read.parquet("lineitem")
val df = source.filter("l_orderkey != 1").select("l_orderkey")
time(df.collect())
val source = sqlContext.read.parquet("lineitem")
val df = source.filter("l_orderkey <=> 1").select("l_orderkey")
time(df.collect())
val source = sqlContext.read.parquet("lineitem")
val df = source.filter("l_orderkey < 3000000").select("l_orderkey")
time(df.collect())
val source = sqlContext.read.parquet("lineitem")
val df = source.filter("l_orderkey > 3000000").select("l_orderkey")
time(df.collect())
val source = sqlContext.read.parquet("lineitem")
val df = source.filter("l_orderkey <= 3000000").select("l_orderkey")
time(df.collect())
val source = sqlContext.read.parquet("lineitem")
val df = source.filter("l_orderkey >= 3000000").select("l_orderkey")
time(df.collect()) Results
Basically, in a simple view, the difference was below. The original codes would work as below (With Spark Filtering): data
// Parquet-side filtering
.filter(pushedFilter)
// Spark-side filtering
.filter(pushedFilter) This PR would change this into below (Without Spark Filtering): data
// Parquet-side filtering
.filter(pushedFilter) Although both have the same O(n) time complexity, the former was 2n and the latter was n. So, it seems there is performance benefit. One notable thing is, there was still considerable performance differences for So, in conclusion, although we cannot depend only on this benchmark, it seems there is performance benefit approximately from 1% to 4% for basic queries with pushed filters in terms of elapsed time. |
@yhuai @liancheng @rxin Would you look through this please? |
@@ -208,11 +210,30 @@ private[sql] object ParquetFilters { | |||
} | |||
|
|||
/** | |||
* Return referenced columns in [[sources.Filter]]. | |||
*/ | |||
def referencedColumns(schema: StructType, predicate: sources.Filter): Array[String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to add private[parquet]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes it looks so. I think I might also have to change createFilter()
in that way because I just followed up in the way of createFilter()
for this function because both createFilter()
and referencedColumns()
are called in the same places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Jenkins, retest this please. |
Test build #49115 has finished for PR 10502 at commit
|
Test build #49122 has finished for PR 10502 at commit
|
I will resolve this conflict on Thursday. |
Test build #51075 has finished for PR 10502 at commit
|
@liancheng @yhuai Would you look through this please? |
Test build #52014 has finished for PR 10502 at commit
|
@HyukjinKwon Thank you for working on it. We have been actively improving the efficiency of code-gen and unsafe parquet reader. For the long term, letting parquet to evaluate filters like |
@yhuai No problem. Then, please inform me later when I am supposed to do something else. |
yea will do. Thank you. |
@yhuai Let me close this for now. Please let me know although this is closed. I will reopen this when I start to work on this again. |
Hi @yhuai ! Would this be okay if I give a try for this one again maybe? |
What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-12355
This is similar with #10427.
As discussed here #10221, this PR implemented
unhandledFilter
to remove duplicated Spark-side filtering.In case of Parquet, the columns referenced in pushed down filters should be given to
org.apache.spark.sql.parquet.row.requested_schema
whereas general datasources such as JDBC do not require the columns.However,
DataSourceStrategy.pruneFilterProjectRaw()
removes the columns only referenced in pushed down filters. Therefore, this PR resolved this problem by manually generating the columns referenced in pushed down filters.How was the this patch tested?
This was tested with unittests and with
dev/run_tests
for coding style