[SPARK-8813][SQL]Combine splits by size #9097

zhichao-li · 2015-10-13T09:13:00Z

The idea is simple and it try to solve this problem by combining splits by size which has been generated by the underlying inputformat, so it would support all of the inputformat in theory.
The combining size can be specified by spark.sql.mapper.splitCombineSize, the default value is: -1 meaning turn off the combining logic.
i.e partition -> splits-> [combineSplit, combineSplit,...]-> RDD

SparkQA · 2015-10-13T11:35:38Z

Test build #43641 has finished for PR 9097 at commit f9392c3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- public class CombineSplit implements InputSplit
- public class CombineSplitInputFormat<K, V> implements InputFormat<K, V>
- public class CombineSplitRecordReader<K, V> implements RecordReader<K, V>
- class HadoopCombineRDD[K, V](

zhichao-li · 2015-10-14T01:20:39Z

cc @chenghao-intel

zhichao-li · 2015-10-14T01:20:52Z

retest this please

chenghao-intel · 2015-10-16T01:04:37Z

sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplit.java

+      out.writeUTF(location);
+    }
+    out.writeInt(splits.length);
+    out.writeUTF(splits[0].getClass().getCanonicalName());


Can you add a comment says, we only process combination within a single table partition, so all of the class name of the splits should be exactly the identical.

Nit: Writing the split class name in the very beginning? Instead of after all of the location info.

chenghao-intel · 2015-10-16T04:27:17Z

It looks good in general, and can you also attach the benchmark result?

zhichao-li · 2015-11-05T02:34:07Z

@chenghao-intel Just tested with data which have 15w small files and 1000 partitions.

SQL (select count(_) from test), only improve a little bit, I guess tasks scheduling is not the bottle neck here, so reducing tasks number would not have too much effect.
SQL (select count(_) from test group by a ), the performance would increase by 3 times. reducing the tasks would largely improve the shuffle performance.

SparkQA · 2015-11-05T05:32:32Z

Test build #45090 has finished for PR 9097 at commit 5793af1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class CombineSplit implements InputSplit\n * public class CombineSplitInputFormat<K, V> implements InputFormat<K, V>\n * public class CombineSplitRecordReader<K, V> implements RecordReader<K, V>\n * class HadoopCombineRDD[K, V](\n

zhichao-li · 2015-11-05T06:57:57Z

retest this please.

chenghao-intel · 2015-11-05T07:04:34Z

cc/ @scwf @Sephiroth-Lin

@zhichao-li has posted the benchmark result that we've done, but it's based on the fake data, I know you guys have requirement on this improvement, too, can you please test it with some real world cases?

SparkQA · 2015-11-05T09:46:16Z

Test build #45104 has finished for PR 9097 at commit 5793af1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class CombineSplit implements InputSplit\n * public class CombineSplitInputFormat<K, V> implements InputFormat<K, V>\n * public class CombineSplitRecordReader<K, V> implements RecordReader<K, V>\n * class HadoopCombineRDD[K, V](\n

watermen · 2015-11-09T03:56:14Z

@zhichao-li Can this patch support all of formats(Text/ORC/Parquet)?

zhichao-li · 2015-11-09T04:24:57Z

@watermen , Yes. It should support all formats in theory, since it combine on InputSplit level which is the result of inputformat.getSplits. In other words, split is transparent to inputformat. I've tested it with Sequence, ORC and LZO. but this patch sometimes may not suitable for Parquet since it would not always go through TableReader

zhichao-li · 2015-11-09T04:39:11Z

CombineHiveInputFormat or CombineFileInputFormat would have the restriction that it would always suppose the combined inputformat is a subclass of FileInputformat, but would not the same case if we can combine on InputSplit.

+public class CombineSplit implements InputSplit {
+  private InputSplit[] splits;
+  private long totalLen;
+  private String[] locations;

VS

public class CombineFileSplit extends InputSplit implements Writable {

 private Path[] paths;
 private long[] startoffset;
 private long[] lengths;
 private String[] locations;
 private long totLength;

liancheng · 2016-02-02T00:16:36Z

sql/hive/src/main/java/org/apache/spark/sql/hive/mapred/CombineSplit.java

+import org.apache.hadoop.io.WritableFactories;
+import org.apache.hadoop.mapred.InputSplit;
+
+public class CombineSplit implements InputSplit {


Please add a comment here to point out which version of Hive/Hadoop this implementation is based on.

useless import refactor

zhichao-li · 2016-02-24T02:55:59Z

retest this please.

SparkQA · 2016-02-24T03:09:04Z

Test build #51839 has finished for PR 9097 at commit 701700b.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-24T06:59:30Z

Test build #51845 has finished for PR 9097 at commit 085ce5f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhichao-li · 2016-02-24T07:50:48Z

retest this please.

seems like it's not related to this pr: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.JoinedRow cannot be cast to org.apache.spark.sql.catalyst.expressions.UnsafeRow

zhichao-li · 2016-02-25T04:04:46Z

retest this please

SparkQA · 2016-02-25T06:40:59Z

Test build #51929 has finished for PR 9097 at commit 085ce5f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-06-15T22:28:53Z

I believe this has been fixed in Spark SQL in 2.0.0. Going to close this. Thanks!

KevinZwx · 2016-08-18T10:26:23Z

This issue was marked as fixed in spark 2.0.0, but "spark.sql.mapper.splitCombineSize" doesn't show up in the list of the SQL configuration when I run command "spark.sql("SET -v").show(numRows = 200, truncate = false)" in spark-sql session. Do I make something wrong?

zhichao-li changed the title ~~[WIP]Combine splits by size~~ [SPARK-8813][SQL][WIP]Combine splits by size Oct 13, 2015

chenghao-intel reviewed Oct 16, 2015
View reviewed changes

zhichao-li force-pushed the newCombine branch from 434c599 to 5793af1 Compare November 5, 2015 02:27

zhichao-li changed the title ~~[SPARK-8813][SQL][WIP]Combine splits by size~~ [SPARK-8813][SQL]Combine splits by size Nov 9, 2015

liancheng reviewed Feb 2, 2016
View reviewed changes

zhichao-li added 4 commits February 23, 2016 15:45

combine split by specific size

c40526d

comments

faedb9a

refactor name

051931b

add unit test

5b2496c

zhichao-li added 2 commits February 23, 2016 15:56

fix rebase

0ab38b3

refactor code and unit test

701700b

useless import refactor

zhichao-li force-pushed the newCombine branch from 5793af1 to 701700b Compare February 24, 2016 02:31

remove transient

085ce5f

asfgit closed this in 1a33f2e Jun 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8813][SQL]Combine splits by size #9097

[SPARK-8813][SQL]Combine splits by size #9097

zhichao-li commented Oct 13, 2015

SparkQA commented Oct 13, 2015

zhichao-li commented Oct 14, 2015

zhichao-li commented Oct 14, 2015

chenghao-intel Oct 16, 2015

chenghao-intel commented Oct 16, 2015

zhichao-li commented Nov 5, 2015

SparkQA commented Nov 5, 2015

zhichao-li commented Nov 5, 2015

chenghao-intel commented Nov 5, 2015

SparkQA commented Nov 5, 2015

watermen commented Nov 9, 2015

zhichao-li commented Nov 9, 2015

zhichao-li commented Nov 9, 2015

liancheng Feb 2, 2016

zhichao-li commented Feb 24, 2016

SparkQA commented Feb 24, 2016

SparkQA commented Feb 24, 2016

zhichao-li commented Feb 24, 2016

zhichao-li commented Feb 25, 2016

SparkQA commented Feb 25, 2016

rxin commented Jun 15, 2016 •

edited

Loading

KevinZwx commented Aug 18, 2016

[SPARK-8813][SQL]Combine splits by size #9097

[SPARK-8813][SQL]Combine splits by size #9097

Conversation

zhichao-li commented Oct 13, 2015

SparkQA commented Oct 13, 2015

zhichao-li commented Oct 14, 2015

zhichao-li commented Oct 14, 2015

chenghao-intel Oct 16, 2015

Choose a reason for hiding this comment

chenghao-intel commented Oct 16, 2015

zhichao-li commented Nov 5, 2015

SparkQA commented Nov 5, 2015

zhichao-li commented Nov 5, 2015

chenghao-intel commented Nov 5, 2015

SparkQA commented Nov 5, 2015

watermen commented Nov 9, 2015

zhichao-li commented Nov 9, 2015

zhichao-li commented Nov 9, 2015

liancheng Feb 2, 2016

Choose a reason for hiding this comment

zhichao-li commented Feb 24, 2016

SparkQA commented Feb 24, 2016

SparkQA commented Feb 24, 2016

zhichao-li commented Feb 24, 2016

zhichao-li commented Feb 25, 2016

SparkQA commented Feb 25, 2016

rxin commented Jun 15, 2016 • edited Loading

KevinZwx commented Aug 18, 2016

rxin commented Jun 15, 2016 •

edited

Loading