-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8813][SQL]Combine splits by size #9097
Conversation
Test build #43641 has finished for PR 9097 at commit
|
retest this please |
out.writeUTF(location); | ||
} | ||
out.writeInt(splits.length); | ||
out.writeUTF(splits[0].getClass().getCanonicalName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment says, we only process combination within a single table partition, so all of the class name of the splits should be exactly the identical.
Nit: Writing the split class name in the very beginning? Instead of after all of the location info.
It looks good in general, and can you also attach the benchmark result? |
434c599
to
5793af1
Compare
@chenghao-intel Just tested with data which have 15w small files and 1000 partitions.
|
Test build #45090 has finished for PR 9097 at commit
|
retest this please. |
cc/ @scwf @Sephiroth-Lin @zhichao-li has posted the benchmark result that we've done, but it's based on the fake data, I know you guys have requirement on this improvement, too, can you please test it with some real world cases? |
Test build #45104 has finished for PR 9097 at commit
|
@zhichao-li Can this patch support all of formats(Text/ORC/Parquet)? |
@watermen , Yes. It should support all formats in theory, since it combine on |
+public class CombineSplit implements InputSplit {
+ private InputSplit[] splits;
+ private long totalLen;
+ private String[] locations; VS public class CombineFileSplit extends InputSplit implements Writable {
private Path[] paths;
private long[] startoffset;
private long[] lengths;
private String[] locations;
private long totLength; |
import org.apache.hadoop.io.WritableFactories; | ||
import org.apache.hadoop.mapred.InputSplit; | ||
|
||
public class CombineSplit implements InputSplit { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment here to point out which version of Hive/Hadoop this implementation is based on.
useless import refactor
5793af1
to
701700b
Compare
retest this please. |
Test build #51839 has finished for PR 9097 at commit
|
Test build #51845 has finished for PR 9097 at commit
|
retest this please. seems like it's not related to this pr: |
retest this please |
Test build #51929 has finished for PR 9097 at commit
|
I believe this has been fixed in Spark SQL in 2.0.0. Going to close this. Thanks! |
This issue was marked as fixed in spark 2.0.0, but "spark.sql.mapper.splitCombineSize" doesn't show up in the list of the SQL configuration when I run command "spark.sql("SET -v").show(numRows = 200, truncate = false)" in spark-sql session. Do I make something wrong? |
The idea is simple and it try to solve this problem by combining splits by size which has been generated by the underlying inputformat, so it would support all of the inputformat in theory.
The combining size can be specified by
spark.sql.mapper.splitCombineSize
, the default value is: -1 meaning turn off the combining logic.i.e partition -> splits-> [combineSplit, combineSplit,...]-> RDD