fetch_openml: Add an option to which returns a DataFrame #11818

jnothman · 2018-08-15T08:02:55Z

fetch_openml currently rejects STRING-valued attributes and ordinal-encodes all NOMINAL attributes, in order to return an array or sparse matrix of floats by default.

We should have a parameter that instead returns a DataFrame of features as the 'data' entry in the returned Bunch. This would (by default) keep nominals as pd.Categorical and strings as objects. Columns would have names determined from the ARFF attribute names / OpenML metadata. Perhaps we would also set the DataFrame's index corresponding to the is_row_identifier attribute in OpenML.

See #10733 for the general issue of an API for returning DataFrames in sklearn.datasets.

The text was updated successfully, but these errors were encountered:

jnothman · 2018-08-15T08:22:03Z

An extra challenge may be supporting SparseDataFrame output. But first we need to confirm that Scikit-learn generally supports SparseDataFrames....

rth · 2018-08-15T08:40:00Z

An extra challenge may be supporting SparseDataFrame output. But first we need to confirm that Scikit-learn generally supports SparseDataFrames....

@jorisvandenbossche would know more, but as far as I understand SparseDataFrames are not really sparse sense of encoding only non null values, but are roughly equivalent to compressed dense dataframes (cf docs).

At least for text processing with a typical sparse document-term matrix, in my experience the performance is not even remotely comparable to using CSR (as in not usable). IMO, sparse xarrays (pydata/xarray#1375) would be more what would be needed for text processing, but that's not really in scope.

Though maybe SparseDataFrame would do just fine e.g. for one hot encoded data with relatively small number of features..

jorisvandenbossche · 2018-08-15T09:03:11Z

as far as I understand SparseDataFrames are not really sparse sense of encoding only non null values, but are roughly equivalent to compressed dense dataframes (cf docs).

I am not sure there is necessarily a difference (I didn't write the docs :-)), it's just that pandas uses NaN as the default "fill value" instead of 0, but you can easily choose 0 when having mostly zero data.
And then it is very similar to sparse coo format, with the main difference it is limited to 1D (so when having a sparse DataFrame, each column stores a 1D sparse 'array').

At least for text processing with a typical sparse document-term matrix, in my experience the performance is not even remotely comparable to using CSR (as in not usable).

Compared to CSR, I assume the main (inherent) limitation comes from the fact you have a huge number of 1D sparse arrays, instead of 1 2D sparse array (and of course, in addition, the implementation in pandas might certainly be not the most optimized as well).

So I think SparseDataFrame is simply not really fitting the needs for use cases with a huge number of features.

That said, I would not consider the sparse functionality as the most stable part of pandas (I never used it in practice myself, but there is currently some refactor being done: pandas-dev/pandas#22325 (review)).

So I think a good first step would already be to support dense DataFrames.

KOLANICH · 2018-11-27T18:56:35Z

Some datasets have binary and numerical features incorrectly marked as categorical, so we need not to rely on the metadata, but heuristically convert the stuff, I just call set and if the len(...) is 2, than it should be bool, then pds.loc[fn] = (pds.loc[fn] == cat[1]).

amueller · 2018-11-27T19:49:57Z

We shouldn't be using heuristics here. What if the two values are "2.7" and "3.4" as strings. Your heuristic would make them boolean, but they might be floats. It's impossible to know without metadata. Scikit-learn tries not to be too magic (which means it sometimes requires the user to be very explicit).

KOLANICH · 2018-11-27T20:07:25Z

Your heuristic would make them boolean, but they might be floats.

It is better to process them as booleans in this case, rather than floats, isn't it?

amueller · 2018-11-27T20:17:20Z

Why? How would you know that? That depends entirely on the semantics of the data and your model, right? What if the test set has 3.5? is that a different category or just a bit larger than 3.4?

KOLANICH · 2018-11-27T20:35:03Z

What if the test set has 3.5

We don't get test sets separately, we get whole datasets and split them ourselves. We decide basing on all the data.

And there is no difference between 0 and 1 and 2.7+0*0.7 and 2.7+1*0.7, this is just a linear dependency, any sensible model would have captured this.

amueller · 2018-11-27T20:48:02Z

We don't get test sets separately, we get whole datasets and split them ourselves. We decide basing on all the data.

This is not a good assumption for machine learning and not the assumption scikit-learn is based on.

KOLANICH · 2018-11-27T20:50:03Z

This is not a good assumption for machine learning and not the concepts scikit-learn is based on.

We are currently speaking not about ML in general, but about fetching datasets from openml.

amueller · 2018-11-27T20:50:59Z

Fair.

But what if the dataset has three values? Then you don't do it? That's pretty unexpected behavior.

amueller · 2018-11-27T20:53:01Z

Btw, #12502 is somewhat related.

KOLANICH · 2018-11-27T21:00:07Z

But what if the dataset has three values? Then you don't do it? That's pretty unexpected behavior.

Yes. But why should I be surprised? The whole point of having machine-readable datasets is to apply models to them without any human intervention, so I shouldn't even know that something have changed, my library will do the preprocessing for me.

amueller · 2018-11-27T21:06:12Z

You're suggesting to actually discard the machine-readable information, i.e. the meta-data.

KOLANICH · 2018-11-27T21:35:23Z

misinformation

jnothman · 2018-11-28T12:51:41Z

I was also surprised to find that OpenML did not have more standards about the representation of Booleans. There seems to be a convention around the use of TRUE / FALSE and I think we should be encoding these by default. If truly numeric values are being represented as categorical I think this is a limitation of ARFF in not distinguishing between ordinals and unordered categoricals. OpenML should probably store metadata for these cases.

rth · 2019-02-04T14:44:57Z

Just to give some feedback on this as a user. Tried to load https://www.openml.org/d/1461 which is heterogeneous dataset with fetch_openml. It can be represented nicely with a DataFrame and can be read from the orignal csv with one line of pd.read_csv.

When using fetch_openml function, categorical features are encoded as ordinals and than cast to float (since we want an array). From my perspective on this dataset, this prevent one from doing anything useful with it. The python-openml package also doesn't support loading data as DataFrame until openml/openml-python#548 is merged.

In terms of usability of OpenML datasets returning DataFrames would be really nice.

jnothman added Enhancement Moderate Anything that requires some knowledge of conventions and best practices help wanted labels Aug 15, 2018

jnothman mentioned this issue Aug 15, 2018

fetch_openml: Add an option to ignore some features, especially STRING type #11819

Closed

jnothman changed the title ~~Add an option to fetch_openml which returns a DataFrame~~ fetch_openml: Add an option to which returns a DataFrame Aug 15, 2018

jorisvandenbossche mentioned this issue Aug 21, 2018

[WIP] fetch_openml: ability to return DataFrame #11875

Closed

amueller mentioned this issue Nov 27, 2018

Save target column name into the Bunch for fetch_openml #12684

Closed

This was referenced Feb 8, 2019

Redesign of datasets API from functional to OO #13120

Closed

Redesign of the sklearn.datasets API #13122

Closed

adrinjalali mentioned this issue Feb 8, 2019

RFC Dataset API #13123

Closed

thomasjpfan mentioned this issue May 18, 2019

[MRG] Adds fetch_openml pandas dataframe support #13902

Merged

glemaitre closed this as completed in #13902 Jul 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fetch_openml: Add an option to which returns a DataFrame #11818

fetch_openml: Add an option to which returns a DataFrame #11818

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

rth commented Aug 15, 2018

jorisvandenbossche commented Aug 15, 2018

KOLANICH commented Nov 27, 2018

amueller commented Nov 27, 2018

KOLANICH commented Nov 27, 2018 •

edited

Loading

amueller commented Nov 27, 2018

KOLANICH commented Nov 27, 2018 •

edited

Loading

amueller commented Nov 27, 2018 •

edited

Loading

KOLANICH commented Nov 27, 2018

amueller commented Nov 27, 2018

amueller commented Nov 27, 2018

KOLANICH commented Nov 27, 2018 •

edited

Loading

amueller commented Nov 27, 2018

KOLANICH commented Nov 27, 2018

jnothman commented Nov 28, 2018 via email

rth commented Feb 4, 2019

fetch_openml: Add an option to which returns a DataFrame #11818

fetch_openml: Add an option to which returns a DataFrame #11818

Comments

jnothman commented Aug 15, 2018

jnothman commented Aug 15, 2018

rth commented Aug 15, 2018

jorisvandenbossche commented Aug 15, 2018

KOLANICH commented Nov 27, 2018

amueller commented Nov 27, 2018

KOLANICH commented Nov 27, 2018 • edited Loading

amueller commented Nov 27, 2018

KOLANICH commented Nov 27, 2018 • edited Loading

amueller commented Nov 27, 2018 • edited Loading

KOLANICH commented Nov 27, 2018

amueller commented Nov 27, 2018

amueller commented Nov 27, 2018

KOLANICH commented Nov 27, 2018 • edited Loading

amueller commented Nov 27, 2018

KOLANICH commented Nov 27, 2018

jnothman commented Nov 28, 2018 via email

rth commented Feb 4, 2019

KOLANICH commented Nov 27, 2018 •

edited

Loading

KOLANICH commented Nov 27, 2018 •

edited

Loading

amueller commented Nov 27, 2018 •

edited

Loading

KOLANICH commented Nov 27, 2018 •

edited

Loading