Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unique() has wrong return type #555

Closed
pwais opened this issue Jul 12, 2019 · 11 comments · Fixed by #836
Closed

unique() has wrong return type #555

pwais opened this issue Jul 12, 2019 · 11 comments · Fixed by #836

Comments

@pwais
Copy link

pwais commented Jul 12, 2019

cc #233

Not sure why #249 decided to return a Series instead of a numpy array... ?

>>> import findspark
>>> findspark.init()
>>> import databricks.koalas as ks
>>> import pandas as pd

>>> findspark.__version__
'1.3.0'
>>> pd.__version__
'0.24.2'
>>> ks.__version__
'0.12.0'

>>> pdf = pd.DataFrame({'a': [1, 2]})
>>> kdf = ks.DataFrame(pdf)
... (spark output removed) ...
>>> kdf['a'].unique()
0    1                                                                          
1    2
Name: a, dtype: int64
>>> pdf['a'].unique()
array([1, 2])
>>> list(pdf['a'].unique())
[1, 2]
>>> list(kdf['a'].unique())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 2407, in __getitem__
    return Series(self._scol.__getitem__(key), anchor=self._kdf, index=self._index_map)
  File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 273, in __init__
    data=data, index=index, dtype=dtype, name=name, copy=copy, fastpath=fastpath)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/series.py", line 198, in __init__
    elif isinstance(data, (ABCSeries, ABCSparseSeries)):
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/generic.py", line 9, in _check
    return getattr(inst, attr, '_typ') in comp
  File "/opt/spark/python/pyspark/sql/column.py", line 682, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

>>> a = []
>>> a.extend(kdf['a'].unique())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 2407, in __getitem__
    return Series(self._scol.__getitem__(key), anchor=self._kdf, index=self._index_map)
  File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 273, in __init__
    data=data, index=index, dtype=dtype, name=name, copy=copy, fastpath=fastpath)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/series.py", line 198, in __init__
    elif isinstance(data, (ABCSeries, ABCSparseSeries)):
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/generic.py", line 9, in _check
    return getattr(inst, attr, '_typ') in comp
  File "/opt/spark/python/pyspark/sql/column.py", line 682, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

>>> import numpy as np
>>> np.__version__
'1.15.2'
>>> np.array(pdf['a'].unique())
array([1, 2])
>>> np.array(kdf['a'].unique())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 2407, in __getitem__
    return Series(self._scol.__getitem__(key), anchor=self._kdf, index=self._index_map)
  File "/usr/local/lib/python3.6/dist-packages/databricks/koalas/series.py", line 273, in __init__
    data=data, index=index, dtype=dtype, name=name, copy=copy, fastpath=fastpath)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/series.py", line 198, in __init__
    elif isinstance(data, (ABCSeries, ABCSparseSeries)):
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/dtypes/generic.py", line 9, in _check
    return getattr(inst, attr, '_typ') in comp
  File "/opt/spark/python/pyspark/sql/column.py", line 682, in __nonzero__
    raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

@HyukjinKwon
Copy link
Member

Not sure why #249 decided to return a Series instead of a numpy array... ?

This is because of https://github.com/databricks/koalas#guardrails-to-prevent-users-from-shooting-themselves-in-the-foot

@pwais
Copy link
Author

pwais commented Jul 12, 2019

I guess you could claim that unique() may "materialize the entire working set in a single node's memory," but so could iloc (cc #556 ). I'm not so sure the current solution is a guardrail tho.. what does that traceback even mean?

If RDD::toLocalIterator() was faster, you could probably use that in many cases and let the driver read as much as it can before OOM. The "guardrail" could then be a cap on how much the iterator can read (i.e. an exception after reading 2GB of data). Such design would be more harmonious with how Spark uses explicit memory limits in the conf instead of software interface limits in the implementation.

@rxin
Copy link
Contributor

rxin commented Aug 5, 2019

The problem with returning a numpy array is that if somebody runs unique on say user id, that'd blow up the driver immediately.

@pwais
Copy link
Author

pwais commented Aug 5, 2019

Hi @rxin ! Thanks for your attention here. I've sadly had to dump koalas entirely due to this issue, #556 , and others, so I hope this discussion results in actionable improvements to koalas.

So here's the root problem in my mind: koalas uses a facade for unique(), as introduced in #249 . The outstanding problem is that the facade is far from feature-complete-- the koalas Series does not fulfill basic functions of the numpy API, e.g. converting to a list. Such conversion could likely be facilitated with collect(), the safertoLocalIterator(), or another intermediate facade that provides a numpy-like API to a pyspark DataFrame. Neglecting this basic feature is not really a "guardrail", especially not with the error message shown above.

While @rxin I hear you on driver memory limit problem, I would recommend pursuing a choice here based upon real usage (e.g. an evaluation of call sites in real programs). I bet you'd find that there are an overwhelming number of cases (particularly in existing pandas programs) where it is indeed affordable to e.g. realize all data locally via a unique(). Moreover, now that pandas / pyarrow offer thorough support for larger-than-memory data, users will almost certainly begin to adapt to (and program around) the same sort of memory limit issues that pyspark RDD::collect() and friends entail. (And unfortunately for them, when they're out of RAM and swap, there's no setting to simply raise the memory limit). Perhaps we'll even see pandas release its own facade (instead of realized numpy arrays) in future versions. I'm sure Wes would consider it.

I believe that #249 shipped as-is because koalas today lacks substantial integration testing. Given the lofty goal of koalas, it would be desirable to have some full-blown existing pandas programs in the test suite. (Perhaps right now the focus has been on converting existing pyspark programs?). There are a variety of pandas-based ETL scripts in tensorflow/models that seem viable candidates, for example:

tensorflow/models is a great target because almost any ETL those people are using would almost certainly run ~8x faster using a modern multi-core machine with pyspark in local mode. That speed-up is critical in finding ETL errors faster-- single-threaded ETL in machine learning is not a competitive practice today.

Pandas-Bokeh might also be a nice candidate, since today their pyspark-pandas interop is simply to read the entire pyspark DataFrame into memory: https://github.com/PatrikHlobil/Pandas-Bokeh/blob/master/pandas_bokeh/plot.py#L1740

Seeing how koalas performs on a test suite of real pandas programs would also be very useful for helping potential koalas users set expectations. My own use case is similar to the Pandas-Bokeh histogram example ( https://github.com/PatrikHlobil/Pandas-Bokeh#histogram ), except I really need pyspark for the expensive (and embarrassingly parallelizable) task of sampling examples for each bucket: https://github.com/pwais/au2018/blob/aba8936159c328777a2d9548443e927a78043bd2/au/plotting.py#L128

I mainly tried out koalas because I saw the video of you @rxin on stage with stuffed koalas everywhere.

@rxin
Copy link
Contributor

rxin commented Aug 5, 2019

Thanks for the feedback. But you can convert from the facade to what you want by just calling to_numpy, can't you?

The problem with your argument (if in most cases they would fit in memory, don't worry about it and just return a local series) is that then we should make all methods returning a pandas DataFrame or Series, because in vast majority of cases data do fit in memory. So there's no point doing anything, since pandas would just work fine there.

@rxin
Copy link
Contributor

rxin commented Aug 5, 2019

BTW one alternative is by default return a np array, similar to pandas, and there's an option or a different method that returns a series.

@pwais
Copy link
Author

pwais commented Aug 5, 2019

I didn't know to_numpy() was an option, especially from the error message. I hear you on using it as a possible mitigation, but I thought the goal of koalas was to not require user changes to existing code. If my program has a large number of pandas operations, and I have to add to_numpy() all over the place, I'm much more likely to just dump koalas and do something els. Moreover, if to_numpy() causes an OOM, then I've just wasted more time.

then we should make all methods returning a pandas DataFrame or Series, because in vast majority of cases data do fit in memory. So there's no point doing anything, since pandas would just work fine there.

And that's probably why people use pandas instead of koalas? Pandas-Bokeh is an apposite example where the author mostly punted on pyspark compatibility.

@rxin your announcement on stage was that koalas is users don't like dealing with the pyspark DataFrame API because it's different than pandas. At least those differences are explicit. In this case, koalas is lacking key things like implicit list conversion for the Series type. Your feedback here suggests that this deficiency is in-line with expectations for koalas. I invite you to listen more closely to users.

@rxin
Copy link
Contributor

rxin commented Aug 6, 2019

No I hear you loud and clear, but you are pointing out a fundamental issue that we know when we started the project.

That is, pandas's API is not designed to scale at all. It simply has too many APIs that do sort of random things that even the pandas authors themselves regret adding. Those are incredibly powerful though, and I don't think we will ever be able to support all of them, in identical signature.

unique is a good example here. unique returns a numpy array, which is not a data structure that holds all of the data in memory. Implicit list conversion is another example: implicitly converting large amount of data to a list in memory on the driver is also an anti-pattern. It makes it too easy for users to shoot themselves in the foot. This unfortunately will create also subtle differences between koalas and pandas, just like the ones that exist with all other "pandas on X" libraries.

I don't think I pitched it koalas as running your pandas code unmodified? If I did, sorry about that. A better way to set expectation is that it reduces the friction when going from pandas to Spark, because most common functions have identical signatures.

This has been my largest fear with koalas. I'm never sure how big of a deal these differences matter in practice. But since this was released in Apr, feedbacks have been pretty positive so we will continue pushing and reduce the gap. I do think there will always be pandas programs that don't work on koalas and require changes.

@rxin
Copy link
Contributor

rxin commented Aug 6, 2019

One addition: we should at the very least improve the error message, so people will find out about to_numpy without reading all the docs.

@smalory
Copy link

smalory commented Sep 27, 2019

The way I see it; pandas took a very strange decision to return a numpy array in this case as opposed to a pandas Series. It feels inconsistent. I don't think this discussion would exist were pandas unique to return a pandas Series, since koalas unique returns a koalas Series and so the consistency would hold.

Because of this, to have a truly consistent API with pandas, there would need to be a spark drop-in for numpy arrays.

I completely agree that the error message needs to be clarified. However, I think that returning a koalas Series, at least by default, is the right thing to do for consistency.

@rxin
Copy link
Contributor

rxin commented Sep 27, 2019

Thanks for the comment @smalory.

@ueshin can we fix the error message?

HyukjinKwon pushed a commit that referenced this issue Sep 28, 2019
…es` and `Index`. (#836)

`DataFrame.__iter__()` returns an iterator of columns.

And explicitly disable `__iter__()` with a proper error message for `Series` and `Index`.

```py
>>> list(ks.Series([1,2,3]))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/series.py", line 3131, in __iter__
    return _MissingPandasLikeSeries.__iter__(self)
  File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/missing/__init__.py", line 24, in unsupported_function
    reason=reason)
databricks.koalas.exceptions.PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
```

Resolves #555.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants