Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary #20499

Closed
wants to merge 2 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Feb 3, 2018

What changes were proposed in this pull request?

This PR proposes to disallow default value None when 'to_replace' is not a dictionary.

It seems weird we set the default value of value to None and we ended up allowing the case as below:

>>> df.show()
+----+------+-----+
| age|height| name|
+----+------+-----+
|  10|    80|Alice|
...
>>> df.na.replace('Alice').show()
+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|null|
...

After

This PR targets to disallow the case above:

>>> df.na.replace('Alice').show()
...
TypeError: value is required when to_replace is not a dictionary.

while we still allow when to_replace is a dictionary:

>>> df.na.replace({'Alice': None}).show()
+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|null|
...

How was this patch tested?

Manually tested, tests were added in python/pyspark/sql/tests.py and doctests were fixed.

@@ -2245,11 +2245,6 @@ def test_replace(self):
.replace(False, True).first())
self.assertTupleEqual(row, (True, True))

# replace list while value is not given (default to None)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we should disallow the case. Please see #16793 (comment)

@gatorsmile
Copy link
Member

gatorsmile commented Feb 3, 2018

@HyukjinKwon We need a separate JIRA and target it to 2.3

@HyukjinKwon
Copy link
Member Author

cc @rxin, @gatorsmile, @holdenk, @zero323 and @viirya, this is an alternative of reverting its alias matching, and a fix to address #16793 (comment). Could you guys take a look and see if makes sense?

@HyukjinKwon
Copy link
Member Author

We need a separate JIRA and target it to 2.3

Sure.

@gatorsmile
Copy link
Member

Thanks!

Also cc @ueshin @cloud-fan

@HyukjinKwon HyukjinKwon changed the title [SPARK-19454][PYTHON][FOLLOWUP] Disallow default value None when 'to_replace' is not a dictionary [SPARK-23328][PYTHON][FOLLOWUP] Disallow default value None when 'to_replace' is not a dictionary Feb 3, 2018
@HyukjinKwon
Copy link
Member Author

The linked JIRA targets 2.3.0 and it was an alternative of reverting #20496 (comment) .. Let me rebase it here anyway ..

@HyukjinKwon HyukjinKwon changed the title [SPARK-23328][PYTHON][FOLLOWUP] Disallow default value None when 'to_replace' is not a dictionary [SPARK-23328][PYTHON][FOLLOWUP] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary Feb 3, 2018
@SparkQA
Copy link

SparkQA commented Feb 3, 2018

Test build #87037 has finished for PR 20499 at commit 13bdc24.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-23328][PYTHON][FOLLOWUP] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary [SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary Feb 3, 2018
@SparkQA
Copy link

SparkQA commented Feb 3, 2018

Test build #87038 has finished for PR 20499 at commit 198bda4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Feb 3, 2018

I think this is what originally proposed in the JIRA:

It is possible to use dict as to_replace, but we cannot skip or use None as the value value (although it is ignored). This requires passing "magic" values:

df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
df.replace({"Alice": "Bob"}, 1)

elif isinstance(to_replace, dict):
value = None # When to_replace is a dictionary, value can be omitted.
else:
raise TypeError("value is required when to_replace is not a dictionary.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some old check below like:

        if not isinstance(value, valid_types) and value is not None \
                and not isinstance(to_replace, dict):
            raise ValueError("If to_replace is not a dict, value should be "
                             "a bool, float, int, long, string, list, tuple or None. "
                             "Got {0}".format(type(value)))

Should we clean up it too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, can't we just remove value is not None in above to let None disallowed when to_replace is not a dict?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Feb 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think that check is still valid. The newly added logic here focuses on checking missing arguments whereas the below logics focus on checking if arguments are valid types.

Will try to add a explicit test for #20499 (comment) case with few comment changes.

For #20499 (comment), I just tried to check. Seems we should keep that None to support:

>>> df.na.replace('Alice', None).show()
+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|null|
...

If we remove that condition above, seems we will hit:

...
ValueError: If to_replace is not a dict, value should be a bool, float, int, long, string, list, tuple or None. Got <type 'NoneType'>

@SparkQA
Copy link

SparkQA commented Feb 4, 2018

Test build #87042 has finished for PR 20499 at commit 3da7ba6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2018

Test build #87051 has finished for PR 20499 at commit 5c18594.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1557,6 +1557,9 @@ def replace(self, to_replace, value=None, subset=None):
For example, if `value` is a string, and subset contains a non-string column,
then the non-string column is simply ignored.

.. note:: `value` can only be omitted when `to_replace` is a dictionary. Otherwise,
it is required.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we just describe this in value's param doc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -2186,7 +2186,7 @@ def test_replace(self):
# replace with subset specified with one column replaced, another column not in subset
# stays unchanged.
row = self.spark.createDataFrame(
[(u'Alice', 10, 10.0)], schema).replace(10, 20, subset=['name', 'age']).first()
[(u'Alice', 10, 10.0)], schema).replace(10, value=20, subset=['name', 'age']).first()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the above two test changes necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary but let me keep them since at least it tests different combinations of valid cases.

"%s arguments." % len([to_replace] + list(args) + list(kwargs.values())))

is_unexpected_kwargs = \
len(kwargs) == 2 and ("value" not in kwargs or "subset" not in kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

df.na.replace({'Alice': 'Bob'}, foo = 'bar').show()

Seems this case can't be detected?

@SparkQA
Copy link

SparkQA commented Feb 5, 2018

Test build #87068 has finished for PR 20499 at commit 1849f59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Feb 6, 2018

I think that behavior is shipped in 2.2, right? Then we may need to add a note in migration guide.

@HyukjinKwon
Copy link
Member Author

Yup, sounds good.

@@ -2175,7 +2175,7 @@ def test_replace(self):

# replace with subset specified by a string of a column name w/ actual change
row = self.spark.createDataFrame(
[(u'Alice', 10, 80.1)], schema).replace(10, 20, subset='age').first()
[(u'Alice', 10, 80.1)], schema).replace(10, 'age', value=20).first()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this conflict with the convention of function argument in Python?

Usually, I think the arguments before keyword arg are resolved by position. But now age is resolved to subset which is the third argument behind value.

Since the function signature is changed, this may not be a big issue.


# It deals with a problem when 'value' is set to None and 'to_replace' is a dictionary.
# Validate if arguments are missing or not.
is_more_than_two = len(args) + len(kwargs) > 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this few times and still feel that this is kind of verbose. But seems there is no better way to check if an optional parameter is set or not in Python.

@viirya
Copy link
Member

viirya commented Feb 6, 2018

Seems RC3 is near to be cut, do we want to get this in 2.3?

@cloud-fan
Copy link
Contributor

is it a bug fix or a new feature?

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Feb 6, 2018

I think it's a bug fix. For the context,

  1. the dictionary support was added but value was required (and ignored) at the first place SPARK-6876:

    df.replace({"Alice": "Bob"}, value=1).show()
  2. value=None was added to allow the case below SPARK-19454:

    df.replace({"Alice": "Bob"}).show()

    but it happened to allow the case below too:

    df.replace("Alice").show()
    
  3. df.na.replace is an alias of df.replace. So, it has the same default value value=None SPARK-21658:

    df.na.replace({"Alice": "Bob"}).show()
  4. An design issue was raised in SPARK-21658 and [SPARK-19454][PYTHON][SQL] DataFrame.replace improvements #16793 (comment) (SPARK-19454).

    To cut it short, I think the issue raised is basically that the case below is weird:

    df.replace("Alice").show()
  5. SPARK-21658 was reverted.

  6. This PR tries to throw an exception in 4. case, and also deals with SPARK-21658

@cloud-fan
Copy link
Contributor

Looks like an existing issue since Spark 2.2, I don't think this should block 2.3.

@@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None):
return DataFrame(self._jdf.na().fill(value, self._jseq(subset)), self.sql_ctx)

@since(1.4)
def replace(self, to_replace, value=None, subset=None):
def replace(self, to_replace, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the expectation? if to_replace is a dict, value should be ignored?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if value is explicitly given, I thought ignoring value as we did from the first place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the problem now. If to_replace is dict, then value should be ignored and we should provide a default value. If to_replace is not dict, then value is required and we should not provide a default value.

Can we use an invalid value as the default value for value? Then we can throw exception if the value is not set by user.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think that summarises the issue

Can we use an invalid value as the default value for value? Then we can throw exception if the value is not set by user.

Yea, we could define a class / instance to indeicate no value like NumPy does -
https://github.com/numpy/numpy/blob/master/numpy/_globals.py#L76 . I was thinking resembling this way too but this is kind of a new approach to Spark and this is a single case so far.

To get to the point, yea, we could maybe use an invalid value and unset/ignore it if to_replace is a dictionary. For example, I can assign {}. But then the problem is docstring by pydoc and API documentation. It will show something like:

Help on method replace in module pyspark.sql.dataframe:

replace(self, to_replace, value={}, subset=None) method of pyspark.sql.dataframe.DataFrame instance
    Returns a new :class:`DataFrame` replacing a value with another value.
...

This is pretty confusing. Up to my knowledge, we can't really override this signature in the doc - I tried few times before, and I failed if I remember this correctly.

Maybe, this is good enough but I didn't want to start it by such because the issue @rxin raised sounds like because it has a default value, to be more strictly.

To be honest, seems Pandas's replace also has None for default value -
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html#pandas.DataFrame.replace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, to cut it short, yea, if less pretty doc is fine, I can try. That would reduce the change a lot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the docstring for def replace(self, to_replace, *args, **kwargs)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just as is:

def replace(self, to_replace, *args, **kwargs)

but this is better than replace(self, to_replace, value={}, subset=None) IMHO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer def replace(self, to_replace, value=_NoValue, subset=None).

def replace(self, to_replace, *args, **kwargs) loses the information about value and subset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, to me either way works to me. Let me try to look around this a bit more and give a shot to show how it looks like.

@HyukjinKwon
Copy link
Member Author

Sure, let me unset the target version.

@rxin
Copy link
Contributor

rxin commented Feb 7, 2018

I'd fix this in 2.3, and 2.2.1 as well. It's just bad API design for 2.2.

@HyukjinKwon
Copy link
Member Author

Will update this tonight

@cloud-fan
Copy link
Contributor

LGTM, waiting for more feedbacks.

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87196 has finished for PR 20499 at commit 9f49b05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _NoValueType(object):

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87198 has finished for PR 20499 at commit a349d07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class _NoValueType(object):

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87201 has finished for PR 20499 at commit 885b4d0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87205 has finished for PR 20499 at commit 885b4d0.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 8, 2018

Test build #87206 has finished for PR 20499 at commit 885b4d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

asfgit pushed a commit that referenced this pull request Feb 9, 2018
…ce when 'to_replace' is not a dictionary

## What changes were proposed in this pull request?

This PR proposes to disallow default value None when 'to_replace' is not a dictionary.

It seems weird we set the default value of `value` to `None` and we ended up allowing the case as below:

```python
>>> df.show()
```
```
+----+------+-----+
| age|height| name|
+----+------+-----+
|  10|    80|Alice|
...
```

```python
>>> df.na.replace('Alice').show()
```
```
+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|null|
...
```

**After**

This PR targets to disallow the case above:

```python
>>> df.na.replace('Alice').show()
```
```
...
TypeError: value is required when to_replace is not a dictionary.
```

while we still allow when `to_replace` is a dictionary:

```python
>>> df.na.replace({'Alice': None}).show()
```
```
+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|null|
...
```

## How was this patch tested?

Manually tested, tests were added in `python/pyspark/sql/tests.py` and doctests were fixed.

Author: hyukjinkwon <[email protected]>

Closes #20499 from HyukjinKwon/SPARK-19454-followup.

(cherry picked from commit 4b4ee26)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3!

Can you send a new PR for 2.2? it conflicts...

@asfgit asfgit closed this in 4b4ee26 Feb 9, 2018
@HyukjinKwon
Copy link
Member Author

Yup, I should fix the guide for 2.2 anyway :-) Will open a backport tonight KST.

robert3005 pushed a commit to palantir/spark that referenced this pull request Feb 12, 2018
…ce when 'to_replace' is not a dictionary

## What changes were proposed in this pull request?

This PR proposes to disallow default value None when 'to_replace' is not a dictionary.

It seems weird we set the default value of `value` to `None` and we ended up allowing the case as below:

```python
>>> df.show()
```
```
+----+------+-----+
| age|height| name|
+----+------+-----+
|  10|    80|Alice|
...
```

```python
>>> df.na.replace('Alice').show()
```
```
+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|null|
...
```

**After**

This PR targets to disallow the case above:

```python
>>> df.na.replace('Alice').show()
```
```
...
TypeError: value is required when to_replace is not a dictionary.
```

while we still allow when `to_replace` is a dictionary:

```python
>>> df.na.replace({'Alice': None}).show()
```
```
+----+------+----+
| age|height|name|
+----+------+----+
|  10|    80|null|
...
```

## How was this patch tested?

Manually tested, tests were added in `python/pyspark/sql/tests.py` and doctests were fixed.

Author: hyukjinkwon <[email protected]>

Closes apache#20499 from HyukjinKwon/SPARK-19454-followup.
@HyukjinKwon HyukjinKwon deleted the SPARK-19454-followup branch October 16, 2018 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants