Case Study: Criteo dataset #295

mrocklin · 2018-07-15T17:05:40Z

The Criteo dataset is a 1TB dump of features around advertisements and whether or not someone clicked on the ad. It has a both dense and categorical/sparse data. I believe that the data is freely available on Azure.

There are some things that we might want to do with this dataset that are representative of other problems:

Logistic regression on large sparse data. This could use existing algorithms like L-BGFS or ADMM or it could use the more recent Incremental SGD work. It would be useful to compare the effectiveness of the algorithms above
We could also add hyper parameter optimization
Gradient boosted trees, presumably with the dask-xgboost connection. This raises a couple of questions. Can XGBoost support categorical data or scipy.sparse arrays? Or perhaps we have to provide a column of integers

As always, it might be a good start to just download a little bit of the criteo dataset (I think that each day of data is available separately) and work with sklearn directly to establish a baseline.

This came out of conversation with @ogrisel

stsievert · 2018-07-17T00:02:59Z

The dataset is available on Azure (source). This dataset has 13 integer features with 26 categorical features (the categorical features are hashed for anonymization) and is stored as 24 individual files (one for each day) (source).

I would rather use some SGD implementation rather than L-BGFS or ADMM because IIRC those algorithm use the entire dataset in every optimization step.

mrocklin · 2018-07-17T00:34:21Z

I suspect that for initial use we'll want to play with a single day. I played with this once and it looks like I kept around a conversion script to parquet:

import dask.dataframe as dd
from dask.distributed import Client
client = Client()

categories = ['category_%d' % i for i in range(26)]
columns = ['click'] + ['numeric_%d' % i for i in range(13)] + categories

df = dd.read_csv('day_0', sep='\t', names=columns, header=None) 

encoding = {c: 'bytes' for c in categories}
fixed = {c: 8 for c in categories}
df.to_parquet('day-0-bytes.parquet', object_encoding=encoding,
              fixed_text=fixed, compression='SNAPPY')

mrocklin · 2018-07-17T00:40:41Z

Although even a single day was a bit of a pain to work with on a single machine if memory serves. It'll be fun to eventually scale up when we're ready.

Presumably we'll start with whatever seems easiest to get decent results from, (as you say SGD) but will eventually want to try a few things. I imagine that some of the preprocessing steps will be the same.

I also recall that when I tried things I got really poor scores. Someone told me this was because clicks were rare and I wasn't weighting them properly.

I suspect that this will force a few topics:

Preprocessing
Conversion to sparse matrices (Test that dask collections can hold scipy.sparse arrays dask#3738 might help)
Some incremental learning with SGD as you suggest
Maybe early stopping?
Maybe hyper-parameter optimization?

This is also the part of the process (building a pipeline and model) about which I know very little. I'm looking forward to seeing what you end up doing. My guess is that it will be easier to iterate with a small amount of data, probably by just picking off a small bit of the parquet data with pandas and using scikit-learn natively.

stsievert · 2018-07-17T18:09:13Z

And they're on Kaggle too: https://www.kaggle.com/c/criteo-display-ad-challenge. This will allow good evaluation of our model, with a predefined score function, cross entropy loss. Top scores on Kaggle are around 0.44 (lower is better).

probably by just picking off a small bit of the parquet data with pandas and using scikit-learn natively.

That's my plan for today. I'll try to develop a preprocessing pipeline and reasonable model and test it an a subset of the data.

stsievert · 2018-07-20T00:52:24Z

Here’s a gist of a simple pipeline I put together: https://gist.github.com/stsievert/30702575de95328f199ab1d7e50795ef

Notes:

this dataset is unbalanced, with only 3% positive examples.
I take a (very) small fraction of the dataset, I think about 0.002%.
I do some preprocessing to avoid outliers. I should probably make it less complex.
The scores are very variable. I suspect we need more data.

TomAugspurger · 2018-07-20T14:51:44Z

Thanks for this. I'm playing around with it locally now.

Thinking about scaling this to larger datasets:

dd.get_dummies doesn't support sparse. Opened SparseDataFrame empty slice coerces to loses dtype and fill_value pandas-dev/pandas#21993 for that, will fix it later today (have to work around a pandas bug)
Did you look at dask_ml.preprocessing.Categorizer and dask_ml.preprocessing.DummyEncoder? With this sparse of data, I think it'll be important to use (pandas) categoricals so that the transformer doing the one-hot / dummy encoding has a consistent shape on train / test sets.
I wonder if any of http://contrib.scikit-learn.org/imbalanced-learn/stable/api.html are easily scalable to larger datasets (cc. @glemaitre if he has any quick thoughts here)
You might want to have your LogisticProbs.score return the negative log loss. I think as currently written, if you tried to do hyperparameter optimization it would go the wrong direction.

I'll look into the sparse dask get_dummies now, and check back later.

TomAugspurger · 2018-07-20T14:52:32Z

I do some preprocessing to avoid outliers. I should probably make it less complex.

We have QuantileTransformer. Does that suffice? Hmm, perhaps not, as that only works with dask arrays of known shape, which we may not have at that point...

glemaitre · 2018-07-20T15:29:31Z

wonder if any of http://contrib.scikit-learn.org/imbalanced-learn/stable/api.html are easily scalable to larger datasets (cc. @glemaitre if he has any quick thoughts here)

If you have a nearest neighbors algorithm which is distributed then it should be possible to make a distributed SMOTE. The under-sampling also rely on NN so it would be most probably possible to scale.

TomAugspurger · 2018-07-20T15:55:10Z

Thanks. We don't yet have a nearest neighbors algorithm yet, but it's a TODO (also blocking HDBSCAN).

@stsievert regarding pd.get_dummies, you may want to just avoid sparse=True. Turns out it's slow (pandas-dev/pandas#21997), and I'm not entirely sure how much memory it's actually saving... profiling now.

mrocklin · 2018-07-20T15:58:28Z

@TomAugspurger is there another approach that would you suggest? Constructing a scipy.sparse matrix manually? Or are there Scikit-Learn transformers to help with this?

…

On Fri, Jul 20, 2018 at 11:55 AM, Tom Augspurger ***@***.***> wrote: Thanks. We don't yet have a nearest neighbors algorithm yet, but it's a TODO (also blocking HDBSCAN). ------------------------------ @stsievert <https://github.com/stsievert> regarding pd.get_dummies, you may want to just avoid sparse=True. Turns out it's slow ( pandas-dev/pandas#21997 <pandas-dev/pandas#21997>), and I'm not entirely sure how much memory it's actually saving... profiling now. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#295 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszHF3RwBM-YNdRNBme7F8usWU_Q3Gks5uIf1fgaJpZM4VQUPu> .

TomAugspurger · 2018-07-20T16:25:34Z

For the case study, I recommend we go down all paths simultaneously, to find things that are broken :)

Try with dd.get_dummies(sparse=False) and just pay the memory cost. You'll convert to dense anyway, since we have a mixture of sparse and dense (though you do a bit a filtering before that point, so this will have higher memory usage I think).
Implement dd.get_dummies(sparse=True) (coming soon) and see where things go
Try OrdinalEncoder on the categorical features (should work today with scikit-learn & dask-ml), though this likely won't perform as well with a linear model
Use scikit-learn dev & the new OneHotEncoder (dask version coming later today)

mrocklin · 2018-07-20T16:27:02Z

If memory serves there are hundreds of thousands of categories. Dense may be tricky?

…

On Fri, Jul 20, 2018 at 12:25 PM, Tom Augspurger ***@***.***> wrote: For the case study, I recommend we go down all paths simultaneously, to find things that are broken :) 1. Try with dd.get_dummies(sparse=False) and just pay the memory cost. You'll convert to dense anyway, since we have a mixture of sparse and dense (though you do a bit a filtering before that point, so this will have higher memory usage I think). 2. Implement dd.get_dummies(sparse=True) (coming soon) and see where things go 3. Try OrdinalEncoder on the categorical features (should work today with scikit-learn & dask-ml), though this likely won't perform as well with a linear model 4. Use scikit-learn dev & the new OneHotEncoder (dask version coming later today) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#295 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszETizEUr-EASxO-W1_0qzW2qELE0ks5uIgR-gaJpZM4VQUPu> .

TomAugspurger · 2018-07-20T16:41:47Z

Ah, there may be many not observed in the sample. In that case, we'll push down the sparse side of things.

jakirkham · 2018-07-20T17:14:15Z

Regarding nearest neighbors, there is some brute force stuff in dask-distance, which might be useful. It probably could benefit from tree approaches as well if people are interested in that problem.

stsievert · 2018-07-20T21:26:48Z

I do some preprocessing to avoid outliers. I should probably make it less complex.

We have QuantileTransformer. Does that suffice?

I'm using sklearn's QuantileTransformer. We are passing fit_intercept=True to LogisticRegression, so I think I'm okay with it.

a nearest neighbors algorithm which is distributed

There's some work with UMAP (cc @lmcinnes) to integrate Dask with approximate nearest neighbors algorithm: lmcinnes/umap#62, which talks about modification of https://github.com/lmcinnes/pynndescent.

Did you look at dask_ml....

My goal so far has been to get a working pipeline with small data. Some bugs were in the way and I wanted to make sure I had a working pandas/sklearn implementation first.

TomAugspurger · 2018-07-26T21:06:25Z

Here's an update to @stsievert's notebook that uses a different pre-processing scheme. I haven't played with the actual estimator at all yet (and don't have short-term plans to).

https://gist.github.com/bdb99a9d1e226cc2130016b9d9d1bad4

That uses a few new features

new OneHotEncoder (in dask-ml master. Requires dask master)
new ColumnTransformer (requires ENH: ColumnTransformer #315, which is blocked by [MRG+1] Make _hstack a staticmethod on ColumnTransformer scikit-learn/scikit-learn#11689 and Add (lazy) shape property to dataframe and series dask#3212)

For high-cardinality categorical columns, we just integer code the values.
For lower-cardinality categorical columns, we one-hot encode them.
For numeric columns, we just fill the missing values.

This spawned a few todos:

Implement log_loss / neg_log_loss (may just do that this afternoon)
Add DataFrame.to_dask_array dask#3090 would be useful instead of the values_extractor. This would also help in general with implementing support for dask dataframe in some estimators that only handle dask arrays (e.g. QuantileTransformer).
There's a noticeable delay between calling pipe.fit(X, y) and things showing up on the dashboard. I think we're submitting a very large number of tasks (4600 calls to getitem alone). See what's going on here.
Our pre-processing transformers should gracefully handle missing values like scikit-learn is starting to with 0.20.

Once all the dependent PRs are in, I'll probably try it out on a cluster to see how things look.

mrocklin · 2018-07-26T22:08:59Z

Nice notebook!

I'm curious, what does the score mean in this case? Given the imbalance that @stsievert noticed (3% click rate) I wonder if a score of 0.95 is good or not :)

Are there ways that we should be handling things given that the output is so imbalanced? For example when we're sampling we might choose a more balanced set. I imagine that there are also fancy weighting things one can do, though I don't have the experience to know what these are.

TomAugspurger · 2018-07-27T01:52:05Z

The benchmark accuracy (guessing all 0s) is 0.97, so 0.95 is impressively bad :)

I'll implement a log-loss score, and then we can start tuning the hyperparameters of the pipeline.

After that we can investigate strategies for the imbalanced dataset (#317)

TomAugspurger · 2018-09-18T18:57:09Z

For posterity, I blogged about the feature preprocessing part of this on https://tomaugspurger.github.io/sklearn-dask-tabular.html. I plan to work on the hyperparamter optimization part next (not sure when). https://gist.github.com/TomAugspurger/4a058f00b32fc049ab5f2860d03fd579 has the pieces.

mrocklin · 2018-10-09T00:04:11Z

@TomAugspurger I enjoyed playing with your criteo example notebook. Thanks for pushing that up.

I was starting with a different variant of the dataset that I had stored where the text columns were still stored as text rather than categoricals. Rather than globally convert I decided to stick with this to see what came out of it. I replaced the custom HashingEncoder and the OneHotEncoder with just a HashingVectorizer. Things ran, although the result predicts no clicks on the training set itself with the default hyperparameters.

mrocklin · 2018-10-10T18:45:22Z

OK, here is a gist with my latest attempt: https://gist.github.com/233810e6813e7fd5b5a40f08bde02758

This depends on a bunch of small PRs recently submitted. I add notes on future work at the bottom of the notebook. I'll reproduce them here.

Future Work

We currently only do a single pass over our data with Incremental.fit. We probably need to replace this with a system that will do multiple passes until we converge.
We probably want to choose the class_weights value more carefully, and there are likely some other parameters to this pipeline that could use tuning
Currently dask-ml has mechanisms for hyper-parameter optimization and incremental training like what we're doing here, but they don't work on full pipelines.
It's tricky working with both dask arrays and dask dataframes in column transformer pipelines. We have to use dask arrays in order to support scipy.sparse matrices from the HashingVectorizer, but we prefer dealing with column names provided by dask dataframes.
Should HashingVectorizer support None values? HashingVectorizer fails when data has None values scikit-learn/scikit-learn#12347

ogrisel · 2018-10-11T09:50:36Z

An alternative to setting class_weights would be to rebalance the training set to undersample the majority class such that the positive class is represented around 30% of the time for instance. If we do several passes of the training data, each pass could scan all the positive but a different subset of the negatives each time.

To compute the performance metrics on the held out test split (e.g. the last month) such as a precision and recall curve, one should use the full test set without any resampling.

Note that scikit-learn does not have the API to do resampling in a pipeline yet so I would advise to do this resampling loop manually for know.

ogrisel · 2018-10-11T10:31:35Z

You might also want to tweak the value of the regularizer of the SGDClassifier, that is decreasing the default value of alpha.

Sandy4321 · 2020-01-26T19:44:56Z

if somebody tried to run locally ?
like
https://github.com/rambler-digital-solutions/criteo-1tb-benchmark

mrocklin added the Case Study Large-scale example as stress tests label Jul 15, 2018

stsievert mentioned this issue Jul 27, 2018

Implement examples with popular datasets #321

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Case Study: Criteo dataset #295

Case Study: Criteo dataset #295

mrocklin commented Jul 15, 2018 •

edited

Loading

stsievert commented Jul 17, 2018

mrocklin commented Jul 17, 2018

mrocklin commented Jul 17, 2018

stsievert commented Jul 17, 2018

stsievert commented Jul 20, 2018

TomAugspurger commented Jul 20, 2018

TomAugspurger commented Jul 20, 2018 •

edited

Loading

glemaitre commented Jul 20, 2018

TomAugspurger commented Jul 20, 2018

mrocklin commented Jul 20, 2018 via email

TomAugspurger commented Jul 20, 2018

mrocklin commented Jul 20, 2018 via email

TomAugspurger commented Jul 20, 2018

jakirkham commented Jul 20, 2018

stsievert commented Jul 20, 2018 •

edited

Loading

TomAugspurger commented Jul 26, 2018

mrocklin commented Jul 26, 2018

TomAugspurger commented Jul 27, 2018

TomAugspurger commented Sep 18, 2018

mrocklin commented Oct 9, 2018

mrocklin commented Oct 10, 2018

ogrisel commented Oct 11, 2018

ogrisel commented Oct 11, 2018

Sandy4321 commented Jan 26, 2020

Case Study: Criteo dataset #295

Case Study: Criteo dataset #295

Comments

mrocklin commented Jul 15, 2018 • edited Loading

stsievert commented Jul 17, 2018

mrocklin commented Jul 17, 2018

mrocklin commented Jul 17, 2018

stsievert commented Jul 17, 2018

stsievert commented Jul 20, 2018

TomAugspurger commented Jul 20, 2018

TomAugspurger commented Jul 20, 2018 • edited Loading

glemaitre commented Jul 20, 2018

TomAugspurger commented Jul 20, 2018

mrocklin commented Jul 20, 2018 via email

TomAugspurger commented Jul 20, 2018

mrocklin commented Jul 20, 2018 via email

TomAugspurger commented Jul 20, 2018

jakirkham commented Jul 20, 2018

stsievert commented Jul 20, 2018 • edited Loading

TomAugspurger commented Jul 26, 2018

mrocklin commented Jul 26, 2018

TomAugspurger commented Jul 27, 2018

TomAugspurger commented Sep 18, 2018

mrocklin commented Oct 9, 2018

mrocklin commented Oct 10, 2018

Future Work

ogrisel commented Oct 11, 2018

ogrisel commented Oct 11, 2018

Sandy4321 commented Jan 26, 2020

mrocklin commented Jul 15, 2018 •

edited

Loading

TomAugspurger commented Jul 20, 2018 •

edited

Loading

stsievert commented Jul 20, 2018 •

edited

Loading