[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary #20499

HyukjinKwon · 2018-02-03T18:33:13Z

What changes were proposed in this pull request?

This PR proposes to disallow default value None when 'to_replace' is not a dictionary.

It seems weird we set the default value of value to None and we ended up allowing the case as below:

>>> df.show()

+----+------+-----+
| age|height| name|
+----+------+-----+
|  10|    80|Alice|
...

>>> df.na.replace('Alice').show()

+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|null|
...

After

This PR targets to disallow the case above:

>>> df.na.replace('Alice').show()

...
TypeError: value is required when to_replace is not a dictionary.

while we still allow when to_replace is a dictionary:

>>> df.na.replace({'Alice': None}).show()

+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|null|
...

How was this patch tested?

Manually tested, tests were added in python/pyspark/sql/tests.py and doctests were fixed.

HyukjinKwon · 2018-02-03T18:34:50Z

python/pyspark/sql/tests.py

@@ -2245,11 +2245,6 @@ def test_replace(self):
               .replace(False, True).first())
        self.assertTupleEqual(row, (True, True))

-        # replace list while value is not given (default to None)


Seems we should disallow the case. Please see #16793 (comment)

gatorsmile · 2018-02-03T18:35:46Z

@HyukjinKwon We need a separate JIRA and target it to 2.3

HyukjinKwon · 2018-02-03T18:38:58Z

cc @rxin, @gatorsmile, @holdenk, @zero323 and @viirya, this is an alternative of reverting its alias matching, and a fix to address #16793 (comment). Could you guys take a look and see if makes sense?

HyukjinKwon · 2018-02-03T18:39:56Z

We need a separate JIRA and target it to 2.3

Sure.

gatorsmile · 2018-02-03T18:42:22Z

Thanks!

Also cc @ueshin @cloud-fan

HyukjinKwon · 2018-02-03T18:44:14Z

The linked JIRA targets 2.3.0 and it was an alternative of reverting #20496 (comment) .. Let me rebase it here anyway ..

SparkQA · 2018-02-03T19:09:09Z

Test build #87037 has finished for PR 20499 at commit 13bdc24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-03T19:21:25Z

Test build #87038 has finished for PR 20499 at commit 198bda4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-02-03T22:54:42Z

I think this is what originally proposed in the JIRA:

It is possible to use dict as to_replace, but we cannot skip or use None as the value value (although it is ignored). This requires passing "magic" values:

df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
df.replace({"Alice": "Bob"}, 1)

viirya · 2018-02-03T23:25:57Z

python/pyspark/sql/dataframe.py

+            elif isinstance(to_replace, dict):
+                value = None  # When to_replace is a dictionary, value can be omitted.
+            else:
+                raise TypeError("value is required when to_replace is not a dictionary.")


There are some old check below like:

if not isinstance(value, valid_types) and value is not None \ and not isinstance(to_replace, dict): raise ValueError("If to_replace is not a dict, value should be " "a bool, float, int, long, string, list, tuple or None. " "Got {0}".format(type(value)))

Should we clean up it too?

Btw, can't we just remove value is not None in above to let None disallowed when to_replace is not a dict?

Hm, I think that check is still valid. The newly added logic here focuses on checking missing arguments whereas the below logics focus on checking if arguments are valid types.

Will try to add a explicit test for #20499 (comment) case with few comment changes.

For #20499 (comment), I just tried to check. Seems we should keep that None to support:

>>> df.na.replace('Alice', None).show()

+----+------+----+ | age|height|name| +----+------+----+ | 10| 80|null| ...

If we remove that condition above, seems we will hit:

... ValueError: If to_replace is not a dict, value should be a bool, float, int, long, string, list, tuple or None. Got <type 'NoneType'>

SparkQA · 2018-02-04T07:07:09Z

Test build #87042 has finished for PR 20499 at commit 3da7ba6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-02-04T15:07:19Z

Test build #87051 has finished for PR 20499 at commit 5c18594.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-02-05T05:15:02Z

python/pyspark/sql/dataframe.py

@@ -1557,6 +1557,9 @@ def replace(self, to_replace, value=None, subset=None):
            For example, if `value` is a string, and subset contains a non-string column,
            then the non-string column is simply ignored.

+        .. note:: `value` can only be omitted when `to_replace` is a dictionary. Otherwise,
+            it is required.


Shall we just describe this in value's param doc?

viirya · 2018-02-05T05:19:13Z

python/pyspark/sql/tests.py

@@ -2186,7 +2186,7 @@ def test_replace(self):
        # replace with subset specified with one column replaced, another column not in subset
        # stays unchanged.
        row = self.spark.createDataFrame(
-            [(u'Alice', 10, 10.0)], schema).replace(10, 20, subset=['name', 'age']).first()
+            [(u'Alice', 10, 10.0)], schema).replace(10, value=20, subset=['name', 'age']).first()


Are the above two test changes necessary?

I don't think it's necessary but let me keep them since at least it tests different combinations of valid cases.

viirya · 2018-02-05T05:29:31Z

python/pyspark/sql/dataframe.py

+                "%s arguments." % len([to_replace] + list(args) + list(kwargs.values())))
+
+        is_unexpected_kwargs = \
+            len(kwargs) == 2 and ("value" not in kwargs or "subset" not in kwargs)


df.na.replace({'Alice': 'Bob'}, foo = 'bar').show()

Seems this case can't be detected?

SparkQA · 2018-02-05T12:43:20Z

Test build #87068 has finished for PR 20499 at commit 1849f59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-02-06T08:20:42Z

I think that behavior is shipped in 2.2, right? Then we may need to add a note in migration guide.

HyukjinKwon · 2018-02-06T08:24:14Z

Yup, sounds good.

viirya · 2018-02-06T08:28:35Z

python/pyspark/sql/tests.py

@@ -2175,7 +2175,7 @@ def test_replace(self):

        # replace with subset specified by a string of a column name w/ actual change
        row = self.spark.createDataFrame(
-            [(u'Alice', 10, 80.1)], schema).replace(10, 20, subset='age').first()
+            [(u'Alice', 10, 80.1)], schema).replace(10, 'age', value=20).first()


Will this conflict with the convention of function argument in Python?

Usually, I think the arguments before keyword arg are resolved by position. But now age is resolved to subset which is the third argument behind value.

Since the function signature is changed, this may not be a big issue.

viirya · 2018-02-06T08:34:19Z

python/pyspark/sql/dataframe.py

+
+        # It deals with a problem when 'value' is set to None and 'to_replace' is a dictionary.
+        # Validate if arguments are missing or not.
+        is_more_than_two = len(args) + len(kwargs) > 2


I read this few times and still feel that this is kind of verbose. But seems there is no better way to check if an optional parameter is set or not in Python.

viirya · 2018-02-06T08:44:20Z

Seems RC3 is near to be cut, do we want to get this in 2.3?

cloud-fan · 2018-02-06T08:49:08Z

is it a bug fix or a new feature?

HyukjinKwon · 2018-02-06T09:12:33Z

I think it's a bug fix. For the context,

the dictionary support was added but value was required (and ignored) at the first place SPARK-6876:
```
df.replace({"Alice": "Bob"}, value=1).show()
```
value=None was added to allow the case below SPARK-19454:
```
df.replace({"Alice": "Bob"}).show()
```
but it happened to allow the case below too:
```
df.replace("Alice").show()
```
df.na.replace is an alias of df.replace. So, it has the same default value value=None SPARK-21658:
```
df.na.replace({"Alice": "Bob"}).show()
```
An design issue was raised in SPARK-21658 and [SPARK-19454][PYTHON][SQL] DataFrame.replace improvements #16793 (comment) (SPARK-19454).

To cut it short, I think the issue raised is basically that the case below is weird:
```
df.replace("Alice").show()
```
SPARK-21658 was reverted.
This PR tries to throw an exception in 4. case, and also deals with SPARK-21658

cloud-fan · 2018-02-06T12:20:56Z

Looks like an existing issue since Spark 2.2, I don't think this should block 2.3.

cloud-fan · 2018-02-06T12:21:43Z

python/pyspark/sql/dataframe.py

@@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None):
            return DataFrame(self._jdf.na().fill(value, self._jseq(subset)), self.sql_ctx)

    @since(1.4)
-    def replace(self, to_replace, value=None, subset=None):
+    def replace(self, to_replace, *args, **kwargs):


what's the expectation? if to_replace is a dict, value should be ignored?

Yes, if value is explicitly given, I thought ignoring value as we did from the first place.

I see the problem now. If to_replace is dict, then value should be ignored and we should provide a default value. If to_replace is not dict, then value is required and we should not provide a default value.

Can we use an invalid value as the default value for value? Then we can throw exception if the value is not set by user.

Yea, I think that summarises the issue

Can we use an invalid value as the default value for value? Then we can throw exception if the value is not set by user.

Yea, we could define a class / instance to indeicate no value like NumPy does -
https://github.com/numpy/numpy/blob/master/numpy/_globals.py#L76 . I was thinking resembling this way too but this is kind of a new approach to Spark and this is a single case so far.

To get to the point, yea, we could maybe use an invalid value and unset/ignore it if to_replace is a dictionary. For example, I can assign {}. But then the problem is docstring by pydoc and API documentation. It will show something like:

Help on method replace in module pyspark.sql.dataframe: replace(self, to_replace, value={}, subset=None) method of pyspark.sql.dataframe.DataFrame instance Returns a new :class:`DataFrame` replacing a value with another value. ...

This is pretty confusing. Up to my knowledge, we can't really override this signature in the doc - I tried few times before, and I failed if I remember this correctly.

Maybe, this is good enough but I didn't want to start it by such because the issue @rxin raised sounds like because it has a default value, to be more strictly.

To be honest, seems Pandas's replace also has None for default value -
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html#pandas.DataFrame.replace.

So, to cut it short, yea, if less pretty doc is fine, I can try. That would reduce the change a lot.

what's the docstring for def replace(self, to_replace, *args, **kwargs)?

It's just as is:

def replace(self, to_replace, *args, **kwargs)

but this is better than replace(self, to_replace, value={}, subset=None) IMHO.

I prefer def replace(self, to_replace, value=_NoValue, subset=None).

def replace(self, to_replace, *args, **kwargs) loses the information about value and subset.

Yea, to me either way works to me. Let me try to look around this a bit more and give a shot to show how it looks like.

HyukjinKwon · 2018-02-06T12:55:48Z

Sure, let me unset the target version.

rxin · 2018-02-07T18:26:47Z

I'd fix this in 2.3, and 2.2.1 as well. It's just bad API design for 2.2.

HyukjinKwon · 2018-02-08T04:37:48Z

Will update this tonight

cloud-fan · 2018-02-08T07:29:49Z

LGTM, waiting for more feedbacks.

SparkQA · 2018-02-08T07:58:21Z

Test build #87196 has finished for PR 20499 at commit 9f49b05.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class _NoValueType(object):

SparkQA · 2018-02-08T08:02:32Z

Test build #87198 has finished for PR 20499 at commit a349d07.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class _NoValueType(object):

SparkQA · 2018-02-08T08:05:02Z

Test build #87201 has finished for PR 20499 at commit 885b4d0.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-08T08:12:40Z

retest this please

SparkQA · 2018-02-08T08:37:46Z

Test build #87205 has finished for PR 20499 at commit 885b4d0.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-08T08:43:28Z

retest this please

SparkQA · 2018-02-08T09:21:53Z

Test build #87206 has finished for PR 20499 at commit 885b4d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

LGTM

…ce when 'to_replace' is not a dictionary ## What changes were proposed in this pull request? This PR proposes to disallow default value None when 'to_replace' is not a dictionary. It seems weird we set the default value of `value` to `None` and we ended up allowing the case as below: ```python >>> df.show() ``` ``` +----+------+-----+ | age|height| name| +----+------+-----+ | 10| 80|Alice| ... ``` ```python >>> df.na.replace('Alice').show() ``` ``` +----+------+----+ | age|height|name| +----+------+----+ | 10| 80|null| ... ``` **After** This PR targets to disallow the case above: ```python >>> df.na.replace('Alice').show() ``` ``` ... TypeError: value is required when to_replace is not a dictionary. ``` while we still allow when `to_replace` is a dictionary: ```python >>> df.na.replace({'Alice': None}).show() ``` ``` +----+------+----+ | age|height|name| +----+------+----+ | 10| 80|null| ... ``` ## How was this patch tested? Manually tested, tests were added in `python/pyspark/sql/tests.py` and doctests were fixed. Author: hyukjinkwon <[email protected]> Closes #20499 from HyukjinKwon/SPARK-19454-followup. (cherry picked from commit 4b4ee26) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2018-02-09T06:22:47Z

thanks, merging to master/2.3!

Can you send a new PR for 2.2? it conflicts...

HyukjinKwon · 2018-02-09T06:42:47Z

Yup, I should fix the guide for 2.2 anyway :-) Will open a backport tonight KST.

…ce when 'to_replace' is not a dictionary ## What changes were proposed in this pull request? This PR proposes to disallow default value None when 'to_replace' is not a dictionary. It seems weird we set the default value of `value` to `None` and we ended up allowing the case as below: ```python >>> df.show() ``` ``` +----+------+-----+ | age|height| name| +----+------+-----+ | 10| 80|Alice| ... ``` ```python >>> df.na.replace('Alice').show() ``` ``` +----+------+----+ | age|height|name| +----+------+----+ | 10| 80|null| ... ``` **After** This PR targets to disallow the case above: ```python >>> df.na.replace('Alice').show() ``` ``` ... TypeError: value is required when to_replace is not a dictionary. ``` while we still allow when `to_replace` is a dictionary: ```python >>> df.na.replace({'Alice': None}).show() ``` ``` +----+------+----+ | age|height|name| +----+------+----+ | 10| 80|null| ... ``` ## How was this patch tested? Manually tested, tests were added in `python/pyspark/sql/tests.py` and doctests were fixed. Author: hyukjinkwon <[email protected]> Closes apache#20499 from HyukjinKwon/SPARK-19454-followup.

HyukjinKwon commented Feb 3, 2018

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-19454][PYTHON][FOLLOWUP] Disallow default value None when 'to_replace' is not a dictionary~~ [SPARK-23328][PYTHON][FOLLOWUP] Disallow default value None when 'to_replace' is not a dictionary Feb 3, 2018

HyukjinKwon force-pushed the SPARK-19454-followup branch from 13bdc24 to 198bda4 Compare February 3, 2018 18:47

HyukjinKwon changed the title ~~[SPARK-23328][PYTHON][FOLLOWUP] Disallow default value None when 'to_replace' is not a dictionary~~ [SPARK-23328][PYTHON][FOLLOWUP] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary Feb 3, 2018

HyukjinKwon changed the title ~~[SPARK-23328][PYTHON][FOLLOWUP] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary~~ [SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary Feb 3, 2018

viirya reviewed Feb 3, 2018

View reviewed changes

viirya reviewed Feb 5, 2018

View reviewed changes

viirya reviewed Feb 6, 2018

View reviewed changes

cloud-fan reviewed Feb 6, 2018

View reviewed changes

HyukjinKwon force-pushed the SPARK-19454-followup branch from 1849f59 to 9f49b05 Compare February 8, 2018 07:21

New approach for no value concept taken after NumPy

a349d07

HyukjinKwon force-pushed the SPARK-19454-followup branch from 9f49b05 to a349d07 Compare February 8, 2018 07:22

Add a note to migration guide

885b4d0

viirya approved these changes Feb 9, 2018

View reviewed changes

asfgit closed this in 4b4ee26 Feb 9, 2018

HyukjinKwon deleted the SPARK-19454-followup branch October 16, 2018 12:44

[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary #20499

[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary #20499

Conversation

HyukjinKwon commented Feb 3, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

gatorsmile commented Feb 3, 2018 • edited Loading

HyukjinKwon commented Feb 3, 2018

HyukjinKwon commented Feb 3, 2018

gatorsmile commented Feb 3, 2018

HyukjinKwon commented Feb 3, 2018

SparkQA commented Feb 3, 2018

SparkQA commented Feb 3, 2018

viirya commented Feb 3, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 4, 2018 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Feb 4, 2018

SparkQA commented Feb 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 5, 2018

viirya commented Feb 6, 2018

HyukjinKwon commented Feb 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Feb 6, 2018

cloud-fan commented Feb 6, 2018

HyukjinKwon commented Feb 6, 2018 • edited Loading

cloud-fan commented Feb 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Feb 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Feb 6, 2018

rxin commented Feb 7, 2018 • edited Loading

HyukjinKwon commented Feb 8, 2018

cloud-fan commented Feb 8, 2018

SparkQA commented Feb 8, 2018

SparkQA commented Feb 8, 2018

SparkQA commented Feb 8, 2018

HyukjinKwon commented Feb 8, 2018

SparkQA commented Feb 8, 2018

HyukjinKwon commented Feb 8, 2018

SparkQA commented Feb 8, 2018

viirya left a comment

Choose a reason for hiding this comment

cloud-fan commented Feb 9, 2018

HyukjinKwon commented Feb 9, 2018

HyukjinKwon commented Feb 3, 2018 •

edited

Loading

gatorsmile commented Feb 3, 2018 •

edited

Loading

viirya commented Feb 3, 2018 •

edited

Loading

HyukjinKwon Feb 4, 2018 •

edited

Loading

HyukjinKwon commented Feb 6, 2018 •

edited

Loading

HyukjinKwon Feb 6, 2018 •

edited

Loading

rxin commented Feb 7, 2018 •

edited

Loading