[SPARK-19454][PYTHON][SQL] DataFrame.replace improvements #16793

zero323 · 2017-02-03T20:13:43Z

What changes were proposed in this pull request?

Allows skipping value argument if to_replace is a dict:

 df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
 df.replace({"Alice": "Bob"}).show()

Adds validation step to ensure homogeneous values / replacements.
Simplifies internal control flow.
Improves unit tests coverage.

How was this patch tested?

Existing unit tests, additional unit tests, manual testing.

SparkQA · 2017-02-03T20:18:48Z

Test build #72318 has finished for PR 16793 at commit 904db24.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-03T21:02:18Z

Test build #72319 has finished for PR 16793 at commit a3a3127.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-04T19:24:43Z

Test build #72390 has finished for PR 16793 at commit a7b6dba.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-04T19:34:39Z

Test build #72392 has finished for PR 16793 at commit f61b782.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-04T19:47:05Z

Test build #72389 has finished for PR 16793 at commit c06b97c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-04T20:13:14Z

Test build #72393 has finished for PR 16793 at commit a02e4ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nchammas

All the new tests were intimidating at first, but then I realized that they were all testing for existing functionality. Are they all necessary?

Fixing the value API wart looks good to me. I can take a closer look at the reworked flow and all_of_ logic if necessary.

nchammas · 2017-02-12T19:37:54Z

python/pyspark/sql/tests.py

+        # replace with dict
+        row = self.spark.createDataFrame(
+            [(u'Alice', 10, 80.1)], schema).replace({10: 11}).first()
+        self.assertTupleEqual(row, (u'Alice', 11, 80.1))


This is the only test of "new" functionality (excluding error cases), correct?

These tests are mostly a side effect of discussions related to #16792 Right now test coverage is low and we depend on a certain behavior of Py4j and Scala counterpart. Also I wanted to be sure that all the expected types are still accepted after the changes I've made.

So maybe not necessary, but I will argue it is a good idea to have these.

I think (and I could be wrong) that @nchammas was suggesting it might make sense to have some more tests with dict, not that the other additional new tests are bad.

zero323 · 2017-02-25T14:16:54Z

cc @holdenk

holdenk

Thanks for working on this! :) I've done a quick first pass and I've got a few questions/comments - let me know what you think and I'll follow up with a more thorough read through :)

holdenk · 2017-02-25T22:01:39Z

python/pyspark/sql/dataframe.py

@@ -1307,43 +1307,66 @@ def replace(self, to_replace, value, subset=None):
        |null|  null|null|
        +----+------+----+
        """
-        if not isinstance(to_replace, (float, int, long, basestring, list, tuple, dict)):
+        # Helper functions
+        def all_of(types):


Maybe give this a doc-string to clarify what all_of does even though its not user facing better to have a docstring than not.

holdenk · 2017-02-25T22:07:05Z

python/pyspark/sql/dataframe.py

            subset = [subset]

-        if not isinstance(subset, (list, tuple)):
-            raise ValueError("subset should be a list or tuple of column names")
+        # Check if we won't pass mixed type generics


This reads a bit awkwardly. How about "Verify we were not passed in mixed type generics."?

holdenk · 2017-02-25T22:09:22Z

python/pyspark/sql/tests.py

+        # replace with dict
+        row = self.spark.createDataFrame(
+            [(u'Alice', 10, 80.1)], schema).replace({10: 11}).first()
+        self.assertTupleEqual(row, (u'Alice', 11, 80.1))


I think (and I could be wrong) that @nchammas was suggesting it might make sense to have some more tests with dict, not that the other additional new tests are bad.

zero323 · 2017-02-25T23:56:01Z

I think (and I could be wrong) that @nchammas was suggesting it might make sense to have some more tests with dict, not that the other additional new tests are bad.

I am like Python - you have to be explicit :) I'll try to figure out some useful tests and get back to you. Thanks for the feedback @holdenk, @nchammas

zero323 · 2017-02-27T16:29:29Z

Jenkins, retest this please

SparkQA · 2017-02-27T17:09:29Z

Test build #73517 has finished for PR 16793 at commit e014867.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-27T18:01:18Z

Test build #73524 has finished for PR 16793 at commit 17e6820.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-02-28T19:29:32Z

python/pyspark/sql/dataframe.py

-        if not isinstance(to_replace, (float, int, long, basestring, list, tuple, dict)):
+        # Helper functions
+        def all_of(types):
+            """Given a type or tuple of types


The formatting of this docstring seems odd here. Also I'd clarify that all_of returns a function which you can use for the check rather than it doing the check its self.

holdenk · 2017-03-06T16:10:22Z

python/pyspark/sql/dataframe.py

            rep_dict = to_replace
+            if value is not None:
+                warnings.warn("to_replace is a dict, but value is not None. "


Does this need to be split?

holdenk · 2017-03-06T16:12:42Z

python/pyspark/sql/dataframe.py


-        if not isinstance(value, (float, int, long, basestring, list, tuple)):
-            raise ValueError("value should be a float, int, long, string, list, or tuple")
+        if (not isinstance(value, valid_types) and


This seems like a weird split.

SparkQA · 2017-03-08T00:06:08Z

Test build #74146 has finished for PR 16793 at commit 03303df.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-03-08T00:43:22Z

Test build #74153 has finished for PR 16793 at commit 03303df.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-03-08T01:26:14Z

Jenkins retest this please (47b2f68).

SparkQA · 2017-03-08T01:54:11Z

Test build #74163 has finished for PR 16793 at commit 03303df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-03-16T00:31:52Z

@holdenk Do you think it is realistic to see this merged into 2.2?

holdenk · 2017-04-03T22:20:11Z

Let me try and take a look tonight. It seems like there are some small formatting issues still at a quick glance but I feel like this should be close.

holdenk · 2017-04-05T18:46:30Z

LGTM

holdenk · 2017-04-05T18:47:54Z

Merged to master

zero323 · 2017-04-06T03:21:43Z

Thanks @holdenk

rxin · 2018-02-02T19:35:31Z

Sorry I object this change. Why would we put null as the default replace value, in a function called replace? That seems very counterintuitive and error prone.

rxin · 2018-02-02T19:39:53Z

Also the implementation doesn't match what was proposed in https://issues.apache.org/jira/browse/SPARK-19454

Having null value as the default in a function called replace is too risky and error prone.

HyukjinKwon · 2018-02-03T09:00:49Z

I think the actual root cause is because we happen to allow a dictionary for to_replace at the first place.

So, do you prefer to have?

def replace(self, to_replace, value, subset=None):
    ...

but in this case, we should do as below if to_replace is a dictionary.

 df.replace({"Alice": "Bob"}, 1).show()

HyukjinKwon · 2018-02-03T13:22:25Z

Otherwise, please give me few days .. let me give a shot with def replace(self, to_replace, *args, **kwargs):and see if I can resolve it if we are okay with that although I guess pydoc will show a less pretty doc ..

…na.replace in PySpark" This reverts commit 0fcde87. See the discussion in [SPARK-21658](https://issues.apache.org/jira/browse/SPARK-21658), [SPARK-19454](https://issues.apache.org/jira/browse/SPARK-19454) and apache#16793 Author: hyukjinkwon <[email protected]> Closes apache#20496 from HyukjinKwon/revert-SPARK-21658.

…na.replace in PySpark" This reverts commit 0fcde87. See the discussion in [SPARK-21658](https://issues.apache.org/jira/browse/SPARK-21658), [SPARK-19454](https://issues.apache.org/jira/browse/SPARK-19454) and #16793 Author: hyukjinkwon <[email protected]> Closes #20496 from HyukjinKwon/revert-SPARK-21658. (cherry picked from commit 551dff2) Signed-off-by: gatorsmile <[email protected]>

zero323 force-pushed the SPARK-19454 branch from 904db24 to a3a3127 Compare February 3, 2017 20:23

zero323 mentioned this pull request Feb 3, 2017

[SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataFrame.replace docstring #16792

Closed

zero323 force-pushed the SPARK-19454 branch from c06b97c to a7b6dba Compare February 4, 2017 19:19

zero323 force-pushed the SPARK-19454 branch from a7b6dba to f61b782 Compare February 4, 2017 19:30

DataFrame.replace improvements

a02e4ff

zero323 force-pushed the SPARK-19454 branch from f61b782 to a02e4ff Compare February 4, 2017 19:40

nchammas reviewed Feb 12, 2017

View reviewed changes

holdenk reviewed Feb 25, 2017

View reviewed changes

zero323 added 2 commits February 27, 2017 04:29

Improve internal docs for DataFrame.replace

db8f4c9

Add more tests for replace with dict

e014867

zero323 force-pushed the SPARK-19454 branch from f251a5f to e014867 Compare February 27, 2017 04:29

Add missing warnings import

17e6820

holdenk reviewed Mar 6, 2017

View reviewed changes

Adjust formatting

03303df

asfgit closed this in e277399 Apr 5, 2017

zero323 deleted the SPARK-19454 branch April 6, 2017 10:57

This was referenced Feb 3, 2018

[SPARK-21658][SQL][PYSPARK] Add default None for value in na.replace in PySpark #18895

Closed

Revert "[SPARK-21658][SQL][PYSPARK] Add default None for value in na.replace in PySpark" #20496

Closed

HyukjinKwon mentioned this pull request Feb 3, 2018

[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary #20499

Closed

[SPARK-19454][PYTHON][SQL] DataFrame.replace improvements #16793

[SPARK-19454][PYTHON][SQL] DataFrame.replace improvements #16793

Conversation

zero323 commented Feb 3, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Feb 3, 2017

SparkQA commented Feb 3, 2017

SparkQA commented Feb 4, 2017

SparkQA commented Feb 4, 2017

SparkQA commented Feb 4, 2017

SparkQA commented Feb 4, 2017

nchammas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zero323 Feb 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zero323 commented Feb 25, 2017

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zero323 commented Feb 25, 2017

zero323 commented Feb 27, 2017

SparkQA commented Feb 27, 2017

SparkQA commented Feb 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 8, 2017

SparkQA commented Mar 8, 2017

zero323 commented Mar 8, 2017

SparkQA commented Mar 8, 2017

zero323 commented Mar 16, 2017

holdenk commented Apr 3, 2017

holdenk commented Apr 5, 2017

holdenk commented Apr 5, 2017

zero323 commented Apr 6, 2017

rxin commented Feb 2, 2018

rxin commented Feb 2, 2018

HyukjinKwon commented Feb 3, 2018

HyukjinKwon commented Feb 3, 2018

zero323 commented Feb 3, 2017 •

edited

Loading

zero323 Feb 12, 2017 •

edited

Loading