-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19454][PYTHON][SQL] DataFrame.replace improvements #16793
Conversation
Test build #72318 has finished for PR 16793 at commit
|
Test build #72319 has finished for PR 16793 at commit
|
Test build #72390 has finished for PR 16793 at commit
|
Test build #72392 has finished for PR 16793 at commit
|
Test build #72389 has finished for PR 16793 at commit
|
Test build #72393 has finished for PR 16793 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the new tests were intimidating at first, but then I realized that they were all testing for existing functionality. Are they all necessary?
Fixing the value
API wart looks good to me. I can take a closer look at the reworked flow and all_of_
logic if necessary.
# replace with dict | ||
row = self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.1)], schema).replace({10: 11}).first() | ||
self.assertTupleEqual(row, (u'Alice', 11, 80.1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only test of "new" functionality (excluding error cases), correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are mostly a side effect of discussions related to #16792 Right now test coverage is low and we depend on a certain behavior of Py4j and Scala counterpart. Also I wanted to be sure that all the expected types are still accepted after the changes I've made.
So maybe not necessary, but I will argue it is a good idea to have these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (and I could be wrong) that @nchammas was suggesting it might make sense to have some more tests with dict, not that the other additional new tests are bad.
cc @holdenk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! :) I've done a quick first pass and I've got a few questions/comments - let me know what you think and I'll follow up with a more thorough read through :)
@@ -1307,43 +1307,66 @@ def replace(self, to_replace, value, subset=None): | |||
|null| null|null| | |||
+----+------+----+ | |||
""" | |||
if not isinstance(to_replace, (float, int, long, basestring, list, tuple, dict)): | |||
# Helper functions | |||
def all_of(types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe give this a doc-string to clarify what all_of does even though its not user facing better to have a docstring than not.
python/pyspark/sql/dataframe.py
Outdated
subset = [subset] | ||
|
||
if not isinstance(subset, (list, tuple)): | ||
raise ValueError("subset should be a list or tuple of column names") | ||
# Check if we won't pass mixed type generics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reads a bit awkwardly. How about "Verify we were not passed in mixed type generics."?
# replace with dict | ||
row = self.spark.createDataFrame( | ||
[(u'Alice', 10, 80.1)], schema).replace({10: 11}).first() | ||
self.assertTupleEqual(row, (u'Alice', 11, 80.1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (and I could be wrong) that @nchammas was suggesting it might make sense to have some more tests with dict, not that the other additional new tests are bad.
I am like Python - you have to be explicit :) I'll try to figure out some useful tests and get back to you. Thanks for the feedback @holdenk, @nchammas |
Jenkins, retest this please |
Test build #73517 has finished for PR 16793 at commit
|
Test build #73524 has finished for PR 16793 at commit
|
python/pyspark/sql/dataframe.py
Outdated
if not isinstance(to_replace, (float, int, long, basestring, list, tuple, dict)): | ||
# Helper functions | ||
def all_of(types): | ||
"""Given a type or tuple of types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The formatting of this docstring seems odd here. Also I'd clarify that all_of returns a function which you can use for the check rather than it doing the check its self.
python/pyspark/sql/dataframe.py
Outdated
rep_dict = to_replace | ||
if value is not None: | ||
warnings.warn("to_replace is a dict, but value is not None. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be split?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not.
python/pyspark/sql/dataframe.py
Outdated
|
||
if not isinstance(value, (float, int, long, basestring, list, tuple)): | ||
raise ValueError("value should be a float, int, long, string, list, or tuple") | ||
if (not isinstance(value, valid_types) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a weird split.
Test build #74146 has finished for PR 16793 at commit
|
Test build #74153 has finished for PR 16793 at commit
|
Jenkins retest this please (47b2f68). |
Test build #74163 has finished for PR 16793 at commit
|
@holdenk Do you think it is realistic to see this merged into 2.2? |
Let me try and take a look tonight. It seems like there are some small formatting issues still at a quick glance but I feel like this should be close. |
LGTM |
Merged to master |
Thanks @holdenk |
Sorry I object this change. Why would we put null as the default replace value, in a function called replace? That seems very counterintuitive and error prone. |
Also the implementation doesn't match what was proposed in https://issues.apache.org/jira/browse/SPARK-19454 Having null value as the default in a function called replace is too risky and error prone. |
I think the actual root cause is because we happen to allow a dictionary for So, do you prefer to have? def replace(self, to_replace, value, subset=None):
... but in this case, we should do as below if
|
Otherwise, please give me few days .. let me give a shot with |
…na.replace in PySpark" This reverts commit 0fcde87. See the discussion in [SPARK-21658](https://issues.apache.org/jira/browse/SPARK-21658), [SPARK-19454](https://issues.apache.org/jira/browse/SPARK-19454) and apache#16793 Author: hyukjinkwon <[email protected]> Closes apache#20496 from HyukjinKwon/revert-SPARK-21658.
…na.replace in PySpark" This reverts commit 0fcde87. See the discussion in [SPARK-21658](https://issues.apache.org/jira/browse/SPARK-21658), [SPARK-19454](https://issues.apache.org/jira/browse/SPARK-19454) and #16793 Author: hyukjinkwon <[email protected]> Closes #20496 from HyukjinKwon/revert-SPARK-21658. (cherry picked from commit 551dff2) Signed-off-by: gatorsmile <[email protected]>
What changes were proposed in this pull request?
value
argument ifto_replace
is adict
:How was this patch tested?
Existing unit tests, additional unit tests, manual testing.