[SPARK-20451] Filter out nested mapType datatypes from sort order in randomSplit #17751

sameeragarwal · 2017-04-24T21:10:42Z

What changes were proposed in this pull request?

In randomSplit, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping
splits.

To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that MapTypes cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism.

How was this patch tested?

Extended randomSplit on reordered partitions in DataFrameStatSuite to also test for dataframes with mapTypes nested mapTypes.

sameeragarwal · 2017-04-24T21:14:10Z

cc @cloud-fan @gatorsmile

gatorsmile · 2017-04-24T23:05:07Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+    // ordering deterministic. Note that MapTypes cannot be sorted and are explicitly pruned out
+    // from the sort order.
+    val sortOrder = logicalPlan.output
+      .filterNot(_.dataType.existsRecursively(dt => dt.isInstanceOf[MapType]))


How about calling RowOrdering.isOrderable?

UDT with underlying MapType is also not sortable.

nice, thanks!

gatorsmile · 2017-04-24T23:05:29Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+    val plan = if (sortOrder.nonEmpty) {
+      Sort(sortOrder, global = false, logicalPlan)
+    } else {
+      // SPARK-12662: If sort order is empty, we materialize the dataset to guarantee determinism


SPARK-12662 -> SPARK-20451?

We actually discussed materialization in https://issues.apache.org/jira/browse/SPARK-12662 so that ticket should provide direct context.

SparkQA · 2017-04-24T23:21:25Z

Test build #76118 has finished for PR 17751 at commit c859b60.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-04-24T23:53:37Z

LGTM pending Jenkins

SparkQA · 2017-04-25T02:00:13Z

Test build #76123 has finished for PR 17751 at commit b9dbb9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…randomSplit ## What changes were proposed in this pull request? In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping splits. To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism. ## How was this patch tested? Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes. Author: Sameer Agarwal <[email protected]> Closes #17751 from sameeragarwal/randomsplit2. (cherry picked from commit 31345fd) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2017-04-25T05:07:36Z

thanks, merging to master/2.2/2.1/2.0!

…randomSplit ## What changes were proposed in this pull request? In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping splits. To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism. ## How was this patch tested? Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes. Author: Sameer Agarwal <[email protected]> Closes #17751 from sameeragarwal/randomsplit2. (cherry picked from commit 31345fd) Signed-off-by: Wenchen Fan <[email protected]>

sameeragarwal added 2 commits April 21, 2017 15:41

fix

9206702

unit test

c859b60

gatorsmile reviewed Apr 24, 2017

View reviewed changes

CR

b9dbb9c

asfgit closed this in 31345fd Apr 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20451] Filter out nested mapType datatypes from sort order in randomSplit #17751

[SPARK-20451] Filter out nested mapType datatypes from sort order in randomSplit #17751

sameeragarwal commented Apr 24, 2017

sameeragarwal commented Apr 24, 2017

gatorsmile Apr 24, 2017

gatorsmile Apr 24, 2017

sameeragarwal Apr 24, 2017

gatorsmile Apr 24, 2017

sameeragarwal Apr 24, 2017

SparkQA commented Apr 24, 2017

gatorsmile commented Apr 24, 2017

SparkQA commented Apr 25, 2017

cloud-fan commented Apr 25, 2017

[SPARK-20451] Filter out nested mapType datatypes from sort order in randomSplit #17751

[SPARK-20451] Filter out nested mapType datatypes from sort order in randomSplit #17751

Conversation

sameeragarwal commented Apr 24, 2017

What changes were proposed in this pull request?

How was this patch tested?

sameeragarwal commented Apr 24, 2017

gatorsmile Apr 24, 2017

Choose a reason for hiding this comment

gatorsmile Apr 24, 2017

Choose a reason for hiding this comment

sameeragarwal Apr 24, 2017

Choose a reason for hiding this comment

gatorsmile Apr 24, 2017

Choose a reason for hiding this comment

sameeragarwal Apr 24, 2017

Choose a reason for hiding this comment

SparkQA commented Apr 24, 2017

gatorsmile commented Apr 24, 2017

SparkQA commented Apr 25, 2017

cloud-fan commented Apr 25, 2017