-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary #20499
Conversation
@@ -2245,11 +2245,6 @@ def test_replace(self): | |||
.replace(False, True).first()) | |||
self.assertTupleEqual(row, (True, True)) | |||
|
|||
# replace list while value is not given (default to None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we should disallow the case. Please see #16793 (comment)
@HyukjinKwon We need a separate JIRA and target it to 2.3 |
cc @rxin, @gatorsmile, @holdenk, @zero323 and @viirya, this is an alternative of reverting its alias matching, and a fix to address #16793 (comment). Could you guys take a look and see if makes sense? |
Sure. |
Thanks! Also cc @ueshin @cloud-fan |
The linked JIRA targets 2.3.0 and it was an alternative of reverting #20496 (comment) .. Let me rebase it here anyway .. |
13bdc24
to
198bda4
Compare
Test build #87037 has finished for PR 20499 at commit
|
Test build #87038 has finished for PR 20499 at commit
|
I think this is what originally proposed in the JIRA:
df = sc.parallelize([("Alice", 1, 3.0)]).toDF()
df.replace({"Alice": "Bob"}, 1) |
python/pyspark/sql/dataframe.py
Outdated
elif isinstance(to_replace, dict): | ||
value = None # When to_replace is a dictionary, value can be omitted. | ||
else: | ||
raise TypeError("value is required when to_replace is not a dictionary.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some old check below like:
if not isinstance(value, valid_types) and value is not None \
and not isinstance(to_replace, dict):
raise ValueError("If to_replace is not a dict, value should be "
"a bool, float, int, long, string, list, tuple or None. "
"Got {0}".format(type(value)))
Should we clean up it too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, can't we just remove value is not None
in above to let None
disallowed when to_replace
is not a dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think that check is still valid. The newly added logic here focuses on checking missing arguments whereas the below logics focus on checking if arguments are valid types.
Will try to add a explicit test for #20499 (comment) case with few comment changes.
For #20499 (comment), I just tried to check. Seems we should keep that None
to support:
>>> df.na.replace('Alice', None).show()
+----+------+----+
| age|height|name|
+----+------+----+
| 10| 80|null|
...
If we remove that condition above, seems we will hit:
...
ValueError: If to_replace is not a dict, value should be a bool, float, int, long, string, list, tuple or None. Got <type 'NoneType'>
Test build #87042 has finished for PR 20499 at commit
|
Test build #87051 has finished for PR 20499 at commit
|
python/pyspark/sql/dataframe.py
Outdated
@@ -1557,6 +1557,9 @@ def replace(self, to_replace, value=None, subset=None): | |||
For example, if `value` is a string, and subset contains a non-string column, | |||
then the non-string column is simply ignored. | |||
|
|||
.. note:: `value` can only be omitted when `to_replace` is a dictionary. Otherwise, | |||
it is required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just describe this in value
's param doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
python/pyspark/sql/tests.py
Outdated
@@ -2186,7 +2186,7 @@ def test_replace(self): | |||
# replace with subset specified with one column replaced, another column not in subset | |||
# stays unchanged. | |||
row = self.spark.createDataFrame( | |||
[(u'Alice', 10, 10.0)], schema).replace(10, 20, subset=['name', 'age']).first() | |||
[(u'Alice', 10, 10.0)], schema).replace(10, value=20, subset=['name', 'age']).first() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the above two test changes necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary but let me keep them since at least it tests different combinations of valid cases.
python/pyspark/sql/dataframe.py
Outdated
"%s arguments." % len([to_replace] + list(args) + list(kwargs.values()))) | ||
|
||
is_unexpected_kwargs = \ | ||
len(kwargs) == 2 and ("value" not in kwargs or "subset" not in kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
df.na.replace({'Alice': 'Bob'}, foo = 'bar').show()
Seems this case can't be detected?
Test build #87068 has finished for PR 20499 at commit
|
I think that behavior is shipped in 2.2, right? Then we may need to add a note in migration guide. |
Yup, sounds good. |
python/pyspark/sql/tests.py
Outdated
@@ -2175,7 +2175,7 @@ def test_replace(self): | |||
|
|||
# replace with subset specified by a string of a column name w/ actual change | |||
row = self.spark.createDataFrame( | |||
[(u'Alice', 10, 80.1)], schema).replace(10, 20, subset='age').first() | |||
[(u'Alice', 10, 80.1)], schema).replace(10, 'age', value=20).first() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this conflict with the convention of function argument in Python?
Usually, I think the arguments before keyword arg are resolved by position. But now age
is resolved to subset
which is the third argument behind value
.
Since the function signature is changed, this may not be a big issue.
python/pyspark/sql/dataframe.py
Outdated
|
||
# It deals with a problem when 'value' is set to None and 'to_replace' is a dictionary. | ||
# Validate if arguments are missing or not. | ||
is_more_than_two = len(args) + len(kwargs) > 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this few times and still feel that this is kind of verbose. But seems there is no better way to check if an optional parameter is set or not in Python.
Seems RC3 is near to be cut, do we want to get this in 2.3? |
is it a bug fix or a new feature? |
I think it's a bug fix. For the context,
|
Looks like an existing issue since Spark 2.2, I don't think this should block 2.3. |
python/pyspark/sql/dataframe.py
Outdated
@@ -1532,7 +1532,7 @@ def fillna(self, value, subset=None): | |||
return DataFrame(self._jdf.na().fill(value, self._jseq(subset)), self.sql_ctx) | |||
|
|||
@since(1.4) | |||
def replace(self, to_replace, value=None, subset=None): | |||
def replace(self, to_replace, *args, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the expectation? if to_replace
is a dict, value
should be ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if value
is explicitly given, I thought ignoring value
as we did from the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the problem now. If to_replace
is dict, then value
should be ignored and we should provide a default value. If to_replace
is not dict, then value
is required and we should not provide a default value.
Can we use an invalid value as the default value for value
? Then we can throw exception if the value
is not set by user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think that summarises the issue
Can we use an invalid value as the default value for value? Then we can throw exception if the value is not set by user.
Yea, we could define a class / instance to indeicate no value like NumPy does -
https://github.com/numpy/numpy/blob/master/numpy/_globals.py#L76 . I was thinking resembling this way too but this is kind of a new approach to Spark and this is a single case so far.
To get to the point, yea, we could maybe use an invalid value and unset/ignore it if to_replace
is a dictionary. For example, I can assign {}
. But then the problem is docstring by pydoc and API documentation. It will show something like:
Help on method replace in module pyspark.sql.dataframe:
replace(self, to_replace, value={}, subset=None) method of pyspark.sql.dataframe.DataFrame instance
Returns a new :class:`DataFrame` replacing a value with another value.
...
This is pretty confusing. Up to my knowledge, we can't really override this signature in the doc - I tried few times before, and I failed if I remember this correctly.
Maybe, this is good enough but I didn't want to start it by such because the issue @rxin raised sounds like because it has a default value, to be more strictly.
To be honest, seems Pandas's replace
also has None
for default value -
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html#pandas.DataFrame.replace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, to cut it short, yea, if less pretty doc is fine, I can try. That would reduce the change a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the docstring for def replace(self, to_replace, *args, **kwargs)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just as is:
def replace(self, to_replace, *args, **kwargs)
but this is better than replace(self, to_replace, value={}, subset=None)
IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer def replace(self, to_replace, value=_NoValue, subset=None)
.
def replace(self, to_replace, *args, **kwargs)
loses the information about value
and subset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, to me either way works to me. Let me try to look around this a bit more and give a shot to show how it looks like.
Sure, let me unset the target version. |
I'd fix this in 2.3, and 2.2.1 as well. It's just bad API design for 2.2. |
Will update this tonight |
1849f59
to
9f49b05
Compare
9f49b05
to
a349d07
Compare
LGTM, waiting for more feedbacks. |
Test build #87196 has finished for PR 20499 at commit
|
Test build #87198 has finished for PR 20499 at commit
|
Test build #87201 has finished for PR 20499 at commit
|
retest this please |
Test build #87205 has finished for PR 20499 at commit
|
retest this please |
Test build #87206 has finished for PR 20499 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ce when 'to_replace' is not a dictionary ## What changes were proposed in this pull request? This PR proposes to disallow default value None when 'to_replace' is not a dictionary. It seems weird we set the default value of `value` to `None` and we ended up allowing the case as below: ```python >>> df.show() ``` ``` +----+------+-----+ | age|height| name| +----+------+-----+ | 10| 80|Alice| ... ``` ```python >>> df.na.replace('Alice').show() ``` ``` +----+------+----+ | age|height|name| +----+------+----+ | 10| 80|null| ... ``` **After** This PR targets to disallow the case above: ```python >>> df.na.replace('Alice').show() ``` ``` ... TypeError: value is required when to_replace is not a dictionary. ``` while we still allow when `to_replace` is a dictionary: ```python >>> df.na.replace({'Alice': None}).show() ``` ``` +----+------+----+ | age|height|name| +----+------+----+ | 10| 80|null| ... ``` ## How was this patch tested? Manually tested, tests were added in `python/pyspark/sql/tests.py` and doctests were fixed. Author: hyukjinkwon <[email protected]> Closes #20499 from HyukjinKwon/SPARK-19454-followup. (cherry picked from commit 4b4ee26) Signed-off-by: Wenchen Fan <[email protected]>
thanks, merging to master/2.3! Can you send a new PR for 2.2? it conflicts... |
Yup, I should fix the guide for 2.2 anyway :-) Will open a backport tonight KST. |
…ce when 'to_replace' is not a dictionary ## What changes were proposed in this pull request? This PR proposes to disallow default value None when 'to_replace' is not a dictionary. It seems weird we set the default value of `value` to `None` and we ended up allowing the case as below: ```python >>> df.show() ``` ``` +----+------+-----+ | age|height| name| +----+------+-----+ | 10| 80|Alice| ... ``` ```python >>> df.na.replace('Alice').show() ``` ``` +----+------+----+ | age|height|name| +----+------+----+ | 10| 80|null| ... ``` **After** This PR targets to disallow the case above: ```python >>> df.na.replace('Alice').show() ``` ``` ... TypeError: value is required when to_replace is not a dictionary. ``` while we still allow when `to_replace` is a dictionary: ```python >>> df.na.replace({'Alice': None}).show() ``` ``` +----+------+----+ | age|height|name| +----+------+----+ | 10| 80|null| ... ``` ## How was this patch tested? Manually tested, tests were added in `python/pyspark/sql/tests.py` and doctests were fixed. Author: hyukjinkwon <[email protected]> Closes apache#20499 from HyukjinKwon/SPARK-19454-followup.
What changes were proposed in this pull request?
This PR proposes to disallow default value None when 'to_replace' is not a dictionary.
It seems weird we set the default value of
value
toNone
and we ended up allowing the case as below:After
This PR targets to disallow the case above:
while we still allow when
to_replace
is a dictionary:How was this patch tested?
Manually tested, tests were added in
python/pyspark/sql/tests.py
and doctests were fixed.