Basic Index support. #21

ueshin · 2019-02-04T11:09:58Z

This pr introduces a metadata to manage index info and support basic index functions.

SparkSession: from_pandas for creating DataFrame with index metadata.
DataFrame: set_index and reset_index.
Column: reset_index.

The basic idea is from the doc attached in #9, and from pyarrow's implementation to roundtrip to pandas.

Closes #9.

thunterdb · 2019-02-15T02:28:49Z

I have a pending review, will complete it later this week. Great work!

thunterdb

Sorry for the delay @ueshin. This is a very powerful PR. I have a few comments but I will prioritize the review of this PR, it adds a lot of welcome functionality.

pandorable_sparky/groups.py

thunterdb · 2019-02-13T18:04:56Z

pandorable_sparky/metadata.py

+    Manages column names and index information
+    """
+
+    def __init__(self, columns, index_info=[]):


given the assertions, you should put some documentation about what these arguments should be. Also, how about column_fields instead of columns? The latter is mostly used for the spark.sql.Column usually.

Also, does it work with sub-fields?

I am not sure from looking at this code what index_info is supposed to contain

Also, don't use [] as a default (it also has a warning in intellij), use None and then later index_info = index_info or []
https://docs.python-guide.org/writing/gotchas/#mutable-default-arguments

Thanks! good to know about the mutable default arguments.

Also, does it work with sub-fields?

What do you mean? Could you elaborate?

pandorable_sparky/metadata.py

thunterdb · 2019-02-13T18:13:34Z

pandorable_sparky/metadata.py

+        return self._index_info
+
+    @property
+    def _index_columns(self):


you are accessing these properties outside the class, they should not be private. Just call them index_columns

Made it index_fields to follow column_fields.

thunterdb · 2019-02-13T18:14:31Z

pandorable_sparky/metadata.py

+
+class Metadata(object):
+    """
+    Manages column names and index information


We need to add more information here about how it works and what it modifies in spark.

From reading the code in this file, I am a bit confused how it works, documentation would be very good.

thunterdb · 2019-02-20T23:48:15Z

pandorable_sparky/structures.py

+            rename = lambda i: 'level_{}'.format(i)
+        else:
+            rename = lambda i: \
+                'index' if 'index' not in self._metadata.columns else 'level_{}'.fomat(i)


same thing here

pandorable_sparky/structures.py

thunterdb · 2019-02-20T23:53:24Z

pandorable_sparky/structures.py

+            return df
+
+    def reset_index(self, level=None, drop=False, inplace=False):
+        """For DataFrame with multi-level index, return new DataFrame with labeling information in


Pandas keeps the name of the columns somehow when we set the index. Can we do the same?

For example, you have in pandas:

df = pd.DataFrame({"x":[1], "y":[2]}) df.set_index("x").reset_index() == df

Yes,

>>> pdf = pd.DataFrame({"x": [1], "y": [2]}) >>> df = spark.from_pandas(pdf) >>> df.set_index("x").reset_index().toPandas().equals(df.toPandas()) True

pandorable_sparky/structures.py

thunterdb · 2019-02-21T00:01:06Z

pandorable_sparky/utils.py

@@ -104,6 +103,7 @@ def _wrap_functions():
        if isinstance(oldfun, types.FunctionType):
            fun = wrap_column_function(oldfun)
            setattr(F, fname, fun)
+            setattr(F, '_spark_' + fname, oldfun)


why do we need that one?

It is for internal use to avoid checking columns to be anchored.

thunterdb

@ueshin thanks a lot for this PR, I am going to merge it.

thunterdb · 2019-02-22T16:25:20Z

pandorable_sparky/metadata.py

+        index = pdf.index
+        if isinstance(index, pd.MultiIndex):
+            if index.names is None:
+                index_info = [('__index_level_{}__'.format(i), None)


`ks.Series.hasnans` and `ks.Index.hasnans` seems not work properly for non-DoubleType. ```python >>> ks.Series([True, True, np.nan]).hasnans Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: cannot resolve 'isnan(`0`)' due to data type mismatch: argument 1 requires (double or float) type, however, '`0`' is of boolean type.;; 'Aggregate [max((isnull(0#12) OR isnan(0#12))) AS max(((0 IS NULL) OR isnan(0)))#21] +- Project [__index_level_0__#11L, 0#12, monotonically_increasing_id() AS __natural_order__#15L] +- LogicalRDD [__index_level_0__#11L, 0#12], false ``` This PR fixed it.

ueshin added 3 commits February 4, 2019 19:57

Basic index support.

da5fb7f

Add set_index and reset_index.

595bc88

Add reset_index to PandasLikeSeries.

0d15cd5

ueshin changed the title ~~Basic Index support.~~ [WIP] Basic Index support. Feb 4, 2019

ueshin added 5 commits February 4, 2019 22:16

Skip some tests for old pandas.

319f3ab

Add tests from dask and fix.

94c4d16

Fix.

776e05b

Fix.

a643b84

Add more tests.

44bd432

ueshin changed the title ~~[WIP] Basic Index support.~~ Basic Index support. Feb 8, 2019

Use string_types instead of basestring.

da4a7b7

gatorsmile requested a review from thunterdb February 15, 2019 01:05

thunterdb reviewed Feb 21, 2019

View reviewed changes

ueshin added 6 commits February 21, 2019 16:50

Rename and simple fix.

e05bc03

Fix selection.

622e749

Add docs and fix.

3861637

Fix.

2c7b61a

Update the doc for from_pandas().

8541c28

Fix.

b9fe2fe

thunterdb approved these changes Feb 22, 2019

View reviewed changes

thunterdb merged commit a3e456e into databricks:master Feb 22, 2019

ueshin deleted the indexing branch February 25, 2019 04:28

ueshin mentioned this pull request May 23, 2019

Add percentiles to describe #378

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic Index support. #21

Basic Index support. #21

ueshin commented Feb 4, 2019 •

edited

Loading

thunterdb commented Feb 15, 2019

thunterdb left a comment

thunterdb Feb 13, 2019

thunterdb Feb 13, 2019

thunterdb Feb 13, 2019

ueshin Feb 21, 2019

thunterdb Feb 13, 2019

ueshin Feb 21, 2019

thunterdb Feb 13, 2019

thunterdb Feb 20, 2019

thunterdb Feb 20, 2019

ueshin Feb 21, 2019

thunterdb Feb 21, 2019

ueshin Feb 21, 2019

thunterdb left a comment

thunterdb Feb 22, 2019

Basic Index support. #21

Basic Index support. #21

Conversation

ueshin commented Feb 4, 2019 • edited Loading

thunterdb commented Feb 15, 2019

thunterdb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thunterdb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin commented Feb 4, 2019 •

edited

Loading