-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unique()
has wrong return type
#555
Comments
This is because of https://github.com/databricks/koalas#guardrails-to-prevent-users-from-shooting-themselves-in-the-foot |
I guess you could claim that If |
The problem with returning a numpy array is that if somebody runs unique on say user id, that'd blow up the driver immediately. |
Hi @rxin ! Thanks for your attention here. I've sadly had to dump So here's the root problem in my mind: While @rxin I hear you on driver memory limit problem, I would recommend pursuing a choice here based upon real usage (e.g. an evaluation of call sites in real programs). I bet you'd find that there are an overwhelming number of cases (particularly in existing I believe that #249 shipped as-is because
Pandas-Bokeh might also be a nice candidate, since today their Seeing how I mainly tried out |
Thanks for the feedback. But you can convert from the facade to what you want by just calling to_numpy, can't you? The problem with your argument (if in most cases they would fit in memory, don't worry about it and just return a local series) is that then we should make all methods returning a pandas DataFrame or Series, because in vast majority of cases data do fit in memory. So there's no point doing anything, since pandas would just work fine there. |
BTW one alternative is by default return a np array, similar to pandas, and there's an option or a different method that returns a series. |
I didn't know
And that's probably why people use @rxin your announcement on stage was that |
No I hear you loud and clear, but you are pointing out a fundamental issue that we know when we started the project. That is, pandas's API is not designed to scale at all. It simply has too many APIs that do sort of random things that even the pandas authors themselves regret adding. Those are incredibly powerful though, and I don't think we will ever be able to support all of them, in identical signature. unique is a good example here. unique returns a numpy array, which is not a data structure that holds all of the data in memory. Implicit list conversion is another example: implicitly converting large amount of data to a list in memory on the driver is also an anti-pattern. It makes it too easy for users to shoot themselves in the foot. This unfortunately will create also subtle differences between koalas and pandas, just like the ones that exist with all other "pandas on X" libraries. I don't think I pitched it koalas as running your pandas code unmodified? If I did, sorry about that. A better way to set expectation is that it reduces the friction when going from pandas to Spark, because most common functions have identical signatures. This has been my largest fear with koalas. I'm never sure how big of a deal these differences matter in practice. But since this was released in Apr, feedbacks have been pretty positive so we will continue pushing and reduce the gap. I do think there will always be pandas programs that don't work on koalas and require changes. |
One addition: we should at the very least improve the error message, so people will find out about to_numpy without reading all the docs. |
The way I see it; pandas took a very strange decision to return a numpy array in this case as opposed to a pandas Series. It feels inconsistent. I don't think this discussion would exist were pandas Because of this, to have a truly consistent API with pandas, there would need to be a spark I completely agree that the error message needs to be clarified. However, I think that returning a koalas Series, at least by default, is the right thing to do for consistency. |
…es` and `Index`. (#836) `DataFrame.__iter__()` returns an iterator of columns. And explicitly disable `__iter__()` with a proper error message for `Series` and `Index`. ```py >>> list(ks.Series([1,2,3])) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/series.py", line 3131, in __iter__ return _MissingPandasLikeSeries.__iter__(self) File "/Users/ueshin/workspace/databricks-koalas/master/databricks/koalas/missing/__init__.py", line 24, in unsupported_function reason=reason) databricks.koalas.exceptions.PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead. ``` Resolves #555.
cc #233
Not sure why #249 decided to return a
Series
instead of anumpy
array... ?The text was updated successfully, but these errors were encountered: