Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic Index support. #21

Merged
merged 15 commits into from
Feb 22, 2019
Merged

Basic Index support. #21

merged 15 commits into from
Feb 22, 2019

Conversation

ueshin
Copy link
Collaborator

@ueshin ueshin commented Feb 4, 2019

This pr introduces a metadata to manage index info and support basic index functions.

  • SparkSession: from_pandas for creating DataFrame with index metadata.
  • DataFrame: set_index and reset_index.
  • Column: reset_index.

The basic idea is from the doc attached in #9, and from pyarrow's implementation to roundtrip to pandas.

Closes #9.

@ueshin ueshin changed the title Basic Index support. [WIP] Basic Index support. Feb 4, 2019
@ueshin ueshin changed the title [WIP] Basic Index support. Basic Index support. Feb 8, 2019
@thunterdb
Copy link
Contributor

I have a pending review, will complete it later this week. Great work!

Copy link
Contributor

@thunterdb thunterdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @ueshin. This is a very powerful PR. I have a few comments but I will prioritize the review of this PR, it adds a lot of welcome functionality.

pandorable_sparky/groups.py Show resolved Hide resolved
Manages column names and index information
"""

def __init__(self, columns, index_info=[]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the assertions, you should put some documentation about what these arguments should be. Also, how about column_fields instead of columns? The latter is mostly used for the spark.sql.Column usually.

Also, does it work with sub-fields?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure from looking at this code what index_info is supposed to contain

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, don't use [] as a default (it also has a warning in intellij), use None and then later index_info = index_info or []
https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! good to know about the mutable default arguments.

Also, does it work with sub-fields?

What do you mean? Could you elaborate?

pandorable_sparky/metadata.py Outdated Show resolved Hide resolved
return self._index_info

@property
def _index_columns(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are accessing these properties outside the class, they should not be private. Just call them index_columns

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it index_fields to follow column_fields.


class Metadata(object):
"""
Manages column names and index information
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add more information here about how it works and what it modifies in spark.

From reading the code in this file, I am a bit confused how it works, documentation would be very good.

rename = lambda i: 'level_{}'.format(i)
else:
rename = lambda i: \
'index' if 'index' not in self._metadata.columns else 'level_{}'.fomat(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here

pandorable_sparky/structures.py Outdated Show resolved Hide resolved
return df

def reset_index(self, level=None, drop=False, inplace=False):
"""For DataFrame with multi-level index, return new DataFrame with labeling information in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas keeps the name of the columns somehow when we set the index. Can we do the same?

For example, you have in pandas:

df = pd.DataFrame({"x":[1], "y":[2]})
df.set_index("x").reset_index() == df

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes,

>>> pdf = pd.DataFrame({"x": [1], "y": [2]})
>>> df = spark.from_pandas(pdf)
>>> df.set_index("x").reset_index().toPandas().equals(df.toPandas())
True

pandorable_sparky/structures.py Show resolved Hide resolved
@@ -104,6 +103,7 @@ def _wrap_functions():
if isinstance(oldfun, types.FunctionType):
fun = wrap_column_function(oldfun)
setattr(F, fname, fun)
setattr(F, '_spark_' + fname, oldfun)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need that one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for internal use to avoid checking columns to be anchored.

Copy link
Contributor

@thunterdb thunterdb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin thanks a lot for this PR, I am going to merge it.

index = pdf.index
if isinstance(index, pd.MultiIndex):
if index.names is None:
index_info = [('__index_level_{}__'.format(i), None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@thunterdb thunterdb merged commit a3e456e into databricks:master Feb 22, 2019
@ueshin ueshin deleted the indexing branch February 25, 2019 04:28
HyukjinKwon pushed a commit that referenced this pull request Jul 31, 2020
`ks.Series.hasnans` and `ks.Index.hasnans` seems not work properly for non-DoubleType.


```python
>>> ks.Series([True, True, np.nan]).hasnans
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve 'isnan(`0`)' due to data type mismatch: argument 1 requires (double or float) type, however, '`0`' is of boolean type.;;
'Aggregate [max((isnull(0#12) OR isnan(0#12))) AS max(((0 IS NULL) OR isnan(0)))#21]
+- Project [__index_level_0__#11L, 0#12, monotonically_increasing_id() AS __natural_order__#15L]
   +- LogicalRDD [__index_level_0__#11L, 0#12], false
```

This PR fixed it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants