Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MINOR][DOCS][PYTHON] Adding missing boolean type for replacement value in fillna #17688

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion python/pyspark/sql/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -1238,7 +1238,7 @@ def fillna(self, value, subset=None):
Value to replace null values with.
If the value is a dict, then `subset` is ignored and `value` must be a mapping
from column name (string) to replacement value. The replacement value must be
an int, long, float, or string.
an int, long, float, boolean, or string.
Copy link
Member

@HyukjinKwon HyukjinKwon Apr 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this indicates the replacement If the value is a dict whereas param value can't be a bool as below:

>>> from pyspark.sql import Row
>>> spark.createDataFrame([Row(a=None), Row(a=True)]).fillna({"a": True}).first()
Row(a=True)
>>> spark.createDataFrame([Row(a=None), Row(a=True)]).fillna(True).first()
Row(a=None)

I can't find def fill(value: Boolean) in functions.scala. Namely, (I guess) this will call it with int (a parent type). So,

>>> spark.createDataFrame([Row(a=None), Row(a=0)]).fillna(True).first()
Row(a=1)
>>> spark.createDataFrame([Row(a=None), Row(a=0)]).fillna(False).first()
Row(a=0)

So, the current status looks correct to me.

BTW, probably, we should throw an exception in

if not isinstance(value, (float, int, long, basestring, dict)):
    raise ValueError("value should be a float, int, long, string, or dict")

in Python boolean is a int - https://www.python.org/dev/peps/pep-0285/

  1. Should bool inherit from int?

=> Yes.

>>> isinstance(True, int)
True

However, this looks just a documentation fix and I guess there are many instances similar with it. I think it is fine with not fixing it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I'd say that the eventual improvement would maybe be offering fill for bool for symetry with the rest of the types but its not necessary here rather than type checking for bool on the input.

:param subset: optional list of column names to consider.
Columns specified in subset that do not have matching data type are ignored.
For example, if `value` is a string, and subset contains a non-string column,
Expand Down
4 changes: 4 additions & 0 deletions python/pyspark/sql/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -1711,6 +1711,10 @@ def test_fillna(self):
self.assertEqual(row.age, None)
self.assertEqual(row.height, None)

# fillna with dictionary for boolean types
row = self.spark.createDataFrame([Row(a=None), Row(a=True)]).fillna({"a": True}).first()
self.assertEqual(row.a, True)

def test_bitwise_operations(self):
from pyspark.sql import functions
row = Row(a=170, b=75)
Expand Down