-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21056][SQL] Use at most one spark job to list files in InMemoryFileIndex #18269
Conversation
Can one of the admins verify this patch? |
Array.empty[FileStatus] | ||
val statuses = paths.flatMap { path => | ||
try { | ||
val fs = path.getFileSystem(hadoopConf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be safe to use the same instance of fs
for all the paths in a InMemoryFileIndex? If this is the case, I can move this back to where it was before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's the same instance and should be reused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed.
Could you please update the PR description by copying the contents from the JIRA? Any performance number you can share? |
I ran a synthetic scenario to show what changes, since deploying this branch would be more involved. I created two very simple relations on a small HDFS cluster (4 datanodes). Running spark with master Setup:
Using master branch before my commits:
Using this PR:
Is there something more specific that I should look into? |
ping @gatorsmile @srowen and possibly @cloud-fan : Would like to hear your thoughts on this. |
val fs = path.getFileSystem(hadoopConf) | ||
sessionOpt: Option[SparkSession]): Seq[(Path, Seq[FileStatus])] = { | ||
logTrace(s"Listing ${paths.mkString(", ")}") | ||
val fs = paths.headOption.map(_.getFileSystem(hadoopConf)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
if (paths.isEmpty) {
Nil
} else {
val fs = paths.head.getFileSystem(hadoopConf)
......
}
} | ||
|
||
val filteredStatuses = statuses.filterNot(status => shouldFilterOut(status.getPath.getName)) | ||
val filteredStatuses = statuses.flatMap { case (path, fStatuses) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we merge these flatMaps
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like
paths.flatMap { path =>
try {
val status = fs.get.listStatus(path)
val filteredStatuses = statuses.filterNot(status => shouldFilterOut(status.getPath.getName))
....
} catch ...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then we can still keep the previous code structure
val filteredStatuses = statuses.filterNot(status => shouldFilterOut(status.getPath.getName)) | ||
val filteredStatuses = statuses.flatMap { case (path, fStatuses) => | ||
val filtered = fStatuses.filterNot(status => shouldFilterOut(status.getPath.getName)) | ||
if (filtered.isEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: style issue. this if-then-else should be moved to left with 2 spaces
@bbossy I've built and deployed a branch of Spark 2.2 with your patch and compared its behavior to the same branch of Spark 2.2 without your patch. I'm seeing different behavior, but not what I expected. My test table has three partition columns, I use |
@mallman I'm not sure where this difference in behaviour is coming from. The following test in
Does it match your scenario? I'll dig around a bit later to see if I can come up with an explanation. |
@cloud-fan Could you take another look, please? |
try { | ||
val fStatuses = fs.listStatus(path) | ||
val filtered = fStatuses.filterNot(status => shouldFilterOut(status.getPath.getName)) | ||
if (filtered.nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: filtered.map(path -> _)
, so that we don't need the if-else
here, and the flatMap
there
fStatuses.map { f => path -> f } | ||
}.partition { case (_, fStatus) => fStatus.isDirectory } | ||
val pathsToList = dirs.map { case (_, fStatus) => fStatus.getPath } | ||
val nestedFiles = if (pathsToList.nonEmpty) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need this if
check?
let's wait @mallman 's response to make sure this patch does fix the problem |
Hi @bbossy
It does not match my scenario. I'm reading files from HDFS. In your test, you're reading files from the local filesystem. Can you try a test using files stored in HDFS? Also, I'm not testing with |
gentle ping @bbossy, I just want to be sure if it is in progress in any way. |
@HyukjinKwon I'll see that I can address the outstanding review comments in the next day or two. |
What changes were proposed in this pull request?
This PR changes
InMemoryFileIndex.listLeafFiles
behaviour to launch at most one spark job to list leaf files.##JIRA
https://issues.apache.org/jira/browse/SPARK-21056
Given partitioned file relation (e.g. parquet):
root/a=../b=../c=..
InMemoryFileIndex.listLeafFiles runs numberOfPartitions(a) times numberOfPartitions(b) spark jobs sequentially to list leaf files, if both numberOfPartitions(a) and numberOfPartitions(b) are below
spark.sql.sources.parallelPartitionDiscovery.threshold
and numberOfPartitions(c) is abovespark.sql.sources.parallelPartitionDiscovery.threshold
Since the jobs are run sequentially, the overhead of the jobs dominates and the file listing operation can become significantly slower than listing the files from the driver.
I propose that InMemoryFileIndex.listLeafFiles should launch at most one spark job for listing leaf files.
How was this patch tested?
Adapted existing tests to match expected behaviour.