-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fetch_openml: Add an option to which returns a DataFrame #11818
Comments
An extra challenge may be supporting SparseDataFrame output. But first we need to confirm that Scikit-learn generally supports SparseDataFrames.... |
@jorisvandenbossche would know more, but as far as I understand At least for text processing with a typical sparse document-term matrix, in my experience the performance is not even remotely comparable to using CSR (as in not usable). IMO, sparse xarrays (pydata/xarray#1375) would be more what would be needed for text processing, but that's not really in scope. Though maybe |
I am not sure there is necessarily a difference (I didn't write the docs :-)), it's just that pandas uses NaN as the default "fill value" instead of 0, but you can easily choose 0 when having mostly zero data.
Compared to CSR, I assume the main (inherent) limitation comes from the fact you have a huge number of 1D sparse arrays, instead of 1 2D sparse array (and of course, in addition, the implementation in pandas might certainly be not the most optimized as well). So I think SparseDataFrame is simply not really fitting the needs for use cases with a huge number of features. That said, I would not consider the sparse functionality as the most stable part of pandas (I never used it in practice myself, but there is currently some refactor being done: pandas-dev/pandas#22325 (review)). So I think a good first step would already be to support dense DataFrames. |
Some datasets have binary and numerical features incorrectly marked as categorical, so we need not to rely on the metadata, but heuristically convert the stuff, I just call |
We shouldn't be using heuristics here. What if the two values are |
It is better to process them as booleans in this case, rather than floats, isn't it? |
Why? How would you know that? That depends entirely on the semantics of the data and your model, right? What if the test set has 3.5? is that a different category or just a bit larger than 3.4? |
We don't get test sets separately, we get whole datasets and split them ourselves. We decide basing on all the data. And there is no difference between |
This is not a good assumption for machine learning and not the assumption scikit-learn is based on. |
We are currently speaking not about ML in general, but about fetching datasets from openml. |
Fair. But what if the dataset has three values? Then you don't do it? That's pretty unexpected behavior. |
Btw, #12502 is somewhat related. |
Yes. But why should I be surprised? The whole point of having machine-readable datasets is to apply models to them without any human intervention, so I shouldn't even know that something have changed, my library will do the preprocessing for me. |
You're suggesting to actually discard the machine-readable information, i.e. the meta-data. |
misinformation |
I was also surprised to find that OpenML did not have more standards about
the representation of Booleans. There seems to be a convention around the
use of TRUE / FALSE and I think we should be encoding these by default. If
truly numeric values are being represented as categorical I think this is a
limitation of ARFF in not distinguishing between ordinals and unordered
categoricals. OpenML should probably store metadata for these cases.
|
Just to give some feedback on this as a user. Tried to load https://www.openml.org/d/1461 which is heterogeneous dataset with When using In terms of usability of OpenML datasets returning DataFrames would be really nice. |
fetch_openml currently rejects STRING-valued attributes and ordinal-encodes all NOMINAL attributes, in order to return an array or sparse matrix of floats by default.
We should have a parameter that instead returns a DataFrame of features as the 'data' entry in the returned Bunch. This would (by default) keep nominals as
pd.Categorical
and strings as objects. Columns would have names determined from the ARFF attribute names / OpenML metadata. Perhaps we would also set the DataFrame's index corresponding to theis_row_identifier
attribute in OpenML.See #10733 for the general issue of an API for returning DataFrames in
sklearn.datasets
.The text was updated successfully, but these errors were encountered: