-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic Index support. #21
Conversation
I have a pending review, will complete it later this week. Great work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @ueshin. This is a very powerful PR. I have a few comments but I will prioritize the review of this PR, it adds a lot of welcome functionality.
pandorable_sparky/metadata.py
Outdated
Manages column names and index information | ||
""" | ||
|
||
def __init__(self, columns, index_info=[]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the assertions, you should put some documentation about what these arguments should be. Also, how about column_fields
instead of columns
? The latter is mostly used for the spark.sql.Column
usually.
Also, does it work with sub-fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure from looking at this code what index_info is supposed to contain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, don't use []
as a default (it also has a warning in intellij), use None and then later index_info = index_info or []
https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! good to know about the mutable default arguments.
Also, does it work with sub-fields?
What do you mean? Could you elaborate?
pandorable_sparky/metadata.py
Outdated
return self._index_info | ||
|
||
@property | ||
def _index_columns(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are accessing these properties outside the class, they should not be private. Just call them index_columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made it index_fields
to follow column_fields
.
|
||
class Metadata(object): | ||
""" | ||
Manages column names and index information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add more information here about how it works and what it modifies in spark.
From reading the code in this file, I am a bit confused how it works, documentation would be very good.
pandorable_sparky/structures.py
Outdated
rename = lambda i: 'level_{}'.format(i) | ||
else: | ||
rename = lambda i: \ | ||
'index' if 'index' not in self._metadata.columns else 'level_{}'.fomat(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing here
return df | ||
|
||
def reset_index(self, level=None, drop=False, inplace=False): | ||
"""For DataFrame with multi-level index, return new DataFrame with labeling information in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas keeps the name of the columns somehow when we set the index. Can we do the same?
For example, you have in pandas:
df = pd.DataFrame({"x":[1], "y":[2]})
df.set_index("x").reset_index() == df
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes,
>>> pdf = pd.DataFrame({"x": [1], "y": [2]})
>>> df = spark.from_pandas(pdf)
>>> df.set_index("x").reset_index().toPandas().equals(df.toPandas())
True
@@ -104,6 +103,7 @@ def _wrap_functions(): | |||
if isinstance(oldfun, types.FunctionType): | |||
fun = wrap_column_function(oldfun) | |||
setattr(F, fname, fun) | |||
setattr(F, '_spark_' + fname, oldfun) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need that one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is for internal use to avoid checking columns to be anchored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin thanks a lot for this PR, I am going to merge it.
index = pdf.index | ||
if isinstance(index, pd.MultiIndex): | ||
if index.names is None: | ||
index_info = [('__index_level_{}__'.format(i), None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
`ks.Series.hasnans` and `ks.Index.hasnans` seems not work properly for non-DoubleType. ```python >>> ks.Series([True, True, np.nan]).hasnans Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: cannot resolve 'isnan(`0`)' due to data type mismatch: argument 1 requires (double or float) type, however, '`0`' is of boolean type.;; 'Aggregate [max((isnull(0#12) OR isnan(0#12))) AS max(((0 IS NULL) OR isnan(0)))#21] +- Project [__index_level_0__#11L, 0#12, monotonically_increasing_id() AS __natural_order__#15L] +- LogicalRDD [__index_level_0__#11L, 0#12], false ``` This PR fixed it.
This pr introduces a metadata to manage index info and support basic index functions.
SparkSession
:from_pandas
for creatingDataFrame
with index metadata.DataFrame
:set_index
andreset_index
.Column
:reset_index
.The basic idea is from the doc attached in #9, and from pyarrow's implementation to roundtrip to pandas.
Closes #9.