`unique()` has wrong return type #555

pwais · 2019-07-12T08:11:23Z

Not sure why #249 decided to return a Series instead of a numpy array... ?

>>> import findspark
>>> findspark.init()
>>> import databricks.koalas as ks
>>> import pandas as pd

>>> findspark.__version__
'1.3.0'
>>> pd.__version__
'0.24.2'
>>> ks.__version__
'0.12.0'

>>> pdf = pd.DataFrame({'a': [1, 2]})
>>> kdf = ks.DataFrame(pdf)
... (spark output removed) ...
>>> kdf['a'].unique()
0    1                                                                          
1    2
Name: a, dtype: int64
>>> pdf['a'].unique()
array([1, 2])
>>> list(pdf['a'].unique())
[1, 2]
>>> list(kdf['a'].unique())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 2407, in __getitem__
    return Series(self._scol.__getitem__(key), anchor=self._kdf, index=self._index_map)
  File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 273, in __init__
    data=data, index=index, dtype=dtype, name=name, copy=copy, fastpath=fastpath)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/series.py", line 198, in __init__
    elif isinstance(data, (ABCSeries, ABCSparseSeries)):
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/generic.py", line 9, in _check
    return getattr(inst, attr, '_typ') in comp
  File "/opt/spark/python/pyspark/sql/column.py", line 682, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

>>> a = []
>>> a.extend(kdf['a'].unique())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 2407, in __getitem__
    return Series(self._scol.__getitem__(key), anchor=self._kdf, index=self._index_map)
  File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 273, in __init__
    data=data, index=index, dtype=dtype, name=name, copy=copy, fastpath=fastpath)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/series.py", line 198, in __init__
    elif isinstance(data, (ABCSeries, ABCSparseSeries)):
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/generic.py", line 9, in _check
    return getattr(inst, attr, '_typ') in comp
  File "/opt/spark/python/pyspark/sql/column.py", line 682, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

>>> import numpy as np
>>> np.__version__
'1.15.2'
>>> np.array(pdf['a'].unique())
array([1, 2])
>>> np.array(kdf['a'].unique())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 2407, in __getitem__
    return Series(self._scol.__getitem__(key), anchor=self._kdf, index=self._index_map)
  File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 273, in __init__
    data=data, index=index, dtype=dtype, name=name, copy=copy, fastpath=fastpath)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/series.py", line 198, in __init__
    elif isinstance(data, (ABCSeries, ABCSparseSeries)):
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/generic.py", line 9, in _check
    return getattr(inst, attr, '_typ') in comp
  File "/opt/spark/python/pyspark/sql/column.py", line 682, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

The text was updated successfully, but these errors were encountered:

HyukjinKwon · 2019-07-12T09:24:52Z

Not sure why #249 decided to return a Series instead of a numpy array... ?

This is because of https://github.com/databricks/koalas#guardrails-to-prevent-users-from-shooting-themselves-in-the-foot

pwais · 2019-07-12T19:43:37Z

I guess you could claim that unique() may "materialize the entire working set in a single node's memory," but so could iloc (cc #556 ). I'm not so sure the current solution is a guardrail tho.. what does that traceback even mean?

If RDD::toLocalIterator() was faster, you could probably use that in many cases and let the driver read as much as it can before OOM. The "guardrail" could then be a cap on how much the iterator can read (i.e. an exception after reading 2GB of data). Such design would be more harmonious with how Spark uses explicit memory limits in the conf instead of software interface limits in the implementation.

rxin · 2019-08-05T19:17:06Z

The problem with returning a numpy array is that if somebody runs unique on say user id, that'd blow up the driver immediately.

pwais · 2019-08-05T23:23:59Z

Hi @rxin ! Thanks for your attention here. I've sadly had to dump koalas entirely due to this issue, #556 , and others, so I hope this discussion results in actionable improvements to koalas.

So here's the root problem in my mind: koalas uses a facade for unique(), as introduced in #249 . The outstanding problem is that the facade is far from feature-complete-- the koalas Series does not fulfill basic functions of the numpy API, e.g. converting to a list. Such conversion could likely be facilitated with collect(), the safertoLocalIterator(), or another intermediate facade that provides a numpy-like API to a pyspark DataFrame. Neglecting this basic feature is not really a "guardrail", especially not with the error message shown above.

While @rxin I hear you on driver memory limit problem, I would recommend pursuing a choice here based upon real usage (e.g. an evaluation of call sites in real programs). I bet you'd find that there are an overwhelming number of cases (particularly in existing pandas programs) where it is indeed affordable to e.g. realize all data locally via a unique(). Moreover, now that pandas / pyarrow offer thorough support for larger-than-memory data, users will almost certainly begin to adapt to (and program around) the same sort of memory limit issues that pyspark RDD::collect() and friends entail. (And unfortunately for them, when they're out of RAM and swap, there's no setting to simply raise the memory limit). Perhaps we'll even see pandas release its own facade (instead of realized numpy arrays) in future versions. I'm sure Wes would consider it.

I believe that #249 shipped as-is because koalas today lacks substantial integration testing. Given the lofty goal of koalas, it would be desirable to have some full-blown existing pandas programs in the test suite. (Perhaps right now the focus has been on converting existing pyspark programs?). There are a variety of pandas-based ETL scripts in tensorflow/models that seem viable candidates, for example:

tensorflow/models is a great target because almost any ETL those people are using would almost certainly run ~8x faster using a modern multi-core machine with pyspark in local mode. That speed-up is critical in finding ETL errors faster-- single-threaded ETL in machine learning is not a competitive practice today.

Pandas-Bokeh might also be a nice candidate, since today their pyspark-pandas interop is simply to read the entire pyspark DataFrame into memory: https://github.com/PatrikHlobil/Pandas-Bokeh/blob/master/pandas_bokeh/plot.py#L1740

Seeing how koalas performs on a test suite of real pandas programs would also be very useful for helping potential koalas users set expectations. My own use case is similar to the Pandas-Bokeh histogram example ( https://github.com/PatrikHlobil/Pandas-Bokeh#histogram ), except I really need pyspark for the expensive (and embarrassingly parallelizable) task of sampling examples for each bucket: https://github.com/pwais/au2018/blob/aba8936159c328777a2d9548443e927a78043bd2/au/plotting.py#L128

I mainly tried out koalas because I saw the video of you @rxin on stage with stuffed koalas everywhere.

rxin · 2019-08-05T23:27:47Z

Thanks for the feedback. But you can convert from the facade to what you want by just calling to_numpy, can't you?

The problem with your argument (if in most cases they would fit in memory, don't worry about it and just return a local series) is that then we should make all methods returning a pandas DataFrame or Series, because in vast majority of cases data do fit in memory. So there's no point doing anything, since pandas would just work fine there.

rxin · 2019-08-05T23:37:12Z

BTW one alternative is by default return a np array, similar to pandas, and there's an option or a different method that returns a series.

pwais · 2019-08-05T23:45:29Z

I didn't know to_numpy() was an option, especially from the error message. I hear you on using it as a possible mitigation, but I thought the goal of koalas was to not require user changes to existing code. If my program has a large number of pandas operations, and I have to add to_numpy() all over the place, I'm much more likely to just dump koalas and do something els. Moreover, if to_numpy() causes an OOM, then I've just wasted more time.

then we should make all methods returning a pandas DataFrame or Series, because in vast majority of cases data do fit in memory. So there's no point doing anything, since pandas would just work fine there.

And that's probably why people use pandas instead of koalas? Pandas-Bokeh is an apposite example where the author mostly punted on pyspark compatibility.

@rxin your announcement on stage was that koalas is users don't like dealing with the pyspark DataFrame API because it's different than pandas. At least those differences are explicit. In this case, koalas is lacking key things like implicit list conversion for the Series type. Your feedback here suggests that this deficiency is in-line with expectations for koalas. I invite you to listen more closely to users.

rxin · 2019-08-06T00:01:49Z

No I hear you loud and clear, but you are pointing out a fundamental issue that we know when we started the project.

That is, pandas's API is not designed to scale at all. It simply has too many APIs that do sort of random things that even the pandas authors themselves regret adding. Those are incredibly powerful though, and I don't think we will ever be able to support all of them, in identical signature.

unique is a good example here. unique returns a numpy array, which is not a data structure that holds all of the data in memory. Implicit list conversion is another example: implicitly converting large amount of data to a list in memory on the driver is also an anti-pattern. It makes it too easy for users to shoot themselves in the foot. This unfortunately will create also subtle differences between koalas and pandas, just like the ones that exist with all other "pandas on X" libraries.

I don't think I pitched it koalas as running your pandas code unmodified? If I did, sorry about that. A better way to set expectation is that it reduces the friction when going from pandas to Spark, because most common functions have identical signatures.

This has been my largest fear with koalas. I'm never sure how big of a deal these differences matter in practice. But since this was released in Apr, feedbacks have been pretty positive so we will continue pushing and reduce the gap. I do think there will always be pandas programs that don't work on koalas and require changes.

rxin · 2019-08-06T00:03:27Z

One addition: we should at the very least improve the error message, so people will find out about to_numpy without reading all the docs.

smalory · 2019-09-27T07:24:27Z

The way I see it; pandas took a very strange decision to return a numpy array in this case as opposed to a pandas Series. It feels inconsistent. I don't think this discussion would exist were pandas unique to return a pandas Series, since koalas unique returns a koalas Series and so the consistency would hold.

Because of this, to have a truly consistent API with pandas, there would need to be a spark drop-in for numpy arrays.

I completely agree that the error message needs to be clarified. However, I think that returning a koalas Series, at least by default, is the right thing to do for consistency.

rxin · 2019-09-27T18:26:46Z

Thanks for the comment @smalory.

@ueshin can we fix the error message?

…es` and `Index`. (#836) `DataFrame.__iter__()` returns an iterator of columns. And explicitly disable `__iter__()` with a proper error message for `Series` and `Index`. ```py >>> list(ks.Series([1,2,3])) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/series.py", line 3131, in __iter__ return _MissingPandasLikeSeries.__iter__(self) File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/missing/__init__.py", line 24, in unsupported_function reason=reason) databricks.koalas.exceptions.PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead. ``` Resolves #555.

ueshin mentioned this issue Sep 27, 2019

Support __iter__() for DataFrame and explicitly disable for Series and Index. #836

Merged

HyukjinKwon closed this as completed in #836 Sep 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`unique()` has wrong return type #555

`unique()` has wrong return type #555

pwais commented Jul 12, 2019 •

edited

Loading

HyukjinKwon commented Jul 12, 2019

pwais commented Jul 12, 2019 •

edited

Loading

rxin commented Aug 5, 2019

pwais commented Aug 5, 2019

rxin commented Aug 5, 2019 •

edited

Loading

rxin commented Aug 5, 2019

pwais commented Aug 5, 2019

rxin commented Aug 6, 2019

rxin commented Aug 6, 2019

smalory commented Sep 27, 2019

rxin commented Sep 27, 2019

unique() has wrong return type #555

unique() has wrong return type #555

Comments

pwais commented Jul 12, 2019 • edited Loading

HyukjinKwon commented Jul 12, 2019

pwais commented Jul 12, 2019 • edited Loading

rxin commented Aug 5, 2019

pwais commented Aug 5, 2019

rxin commented Aug 5, 2019 • edited Loading

rxin commented Aug 5, 2019

pwais commented Aug 5, 2019

rxin commented Aug 6, 2019

rxin commented Aug 6, 2019

smalory commented Sep 27, 2019

rxin commented Sep 27, 2019

`unique()` has wrong return type #555

`unique()` has wrong return type #555

pwais commented Jul 12, 2019 •

edited

Loading

pwais commented Jul 12, 2019 •

edited

Loading

rxin commented Aug 5, 2019 •

edited

Loading