Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Case Study: Criteo dataset #295

Open
mrocklin opened this issue Jul 15, 2018 · 24 comments
Open

Case Study: Criteo dataset #295

mrocklin opened this issue Jul 15, 2018 · 24 comments
Labels
Case Study Large-scale example as stress tests

Comments

@mrocklin
Copy link
Member

mrocklin commented Jul 15, 2018

The Criteo dataset is a 1TB dump of features around advertisements and whether or not someone clicked on the ad. It has a both dense and categorical/sparse data. I believe that the data is freely available on Azure.

There are some things that we might want to do with this dataset that are representative of other problems:

  1. Logistic regression on large sparse data. This could use existing algorithms like L-BGFS or ADMM or it could use the more recent Incremental SGD work. It would be useful to compare the effectiveness of the algorithms above
  2. We could also add hyper parameter optimization
  3. Gradient boosted trees, presumably with the dask-xgboost connection. This raises a couple of questions. Can XGBoost support categorical data or scipy.sparse arrays? Or perhaps we have to provide a column of integers

As always, it might be a good start to just download a little bit of the criteo dataset (I think that each day of data is available separately) and work with sklearn directly to establish a baseline.

This came out of conversation with @ogrisel

@mrocklin mrocklin added the Case Study Large-scale example as stress tests label Jul 15, 2018
@stsievert
Copy link
Member

The dataset is available on Azure (source). This dataset has 13 integer features with 26 categorical features (the categorical features are hashed for anonymization) and is stored as 24 individual files (one for each day) (source).

I would rather use some SGD implementation rather than L-BGFS or ADMM because IIRC those algorithm use the entire dataset in every optimization step.

@mrocklin
Copy link
Member Author

I suspect that for initial use we'll want to play with a single day. I played with this once and it looks like I kept around a conversion script to parquet:

import dask.dataframe as dd
from dask.distributed import Client
client = Client()

categories = ['category_%d' % i for i in range(26)]
columns = ['click'] + ['numeric_%d' % i for i in range(13)] + categories

df = dd.read_csv('day_0', sep='\t', names=columns, header=None) 

encoding = {c: 'bytes' for c in categories}
fixed = {c: 8 for c in categories}
df.to_parquet('day-0-bytes.parquet', object_encoding=encoding,
              fixed_text=fixed, compression='SNAPPY')

@mrocklin
Copy link
Member Author

Although even a single day was a bit of a pain to work with on a single machine if memory serves. It'll be fun to eventually scale up when we're ready.

Presumably we'll start with whatever seems easiest to get decent results from, (as you say SGD) but will eventually want to try a few things. I imagine that some of the preprocessing steps will be the same.

I also recall that when I tried things I got really poor scores. Someone told me this was because clicks were rare and I wasn't weighting them properly.

I suspect that this will force a few topics:

  1. Preprocessing
  2. Conversion to sparse matrices (Test that dask collections can hold scipy.sparse arrays dask#3738 might help)
  3. Some incremental learning with SGD as you suggest
  4. Maybe early stopping?
  5. Maybe hyper-parameter optimization?

This is also the part of the process (building a pipeline and model) about which I know very little. I'm looking forward to seeing what you end up doing. My guess is that it will be easier to iterate with a small amount of data, probably by just picking off a small bit of the parquet data with pandas and using scikit-learn natively.

@stsievert
Copy link
Member

And they're on Kaggle too: https://www.kaggle.com/c/criteo-display-ad-challenge. This will allow good evaluation of our model, with a predefined score function, cross entropy loss. Top scores on Kaggle are around 0.44 (lower is better).

probably by just picking off a small bit of the parquet data with pandas and using scikit-learn natively.

That's my plan for today. I'll try to develop a preprocessing pipeline and reasonable model and test it an a subset of the data.

@stsievert
Copy link
Member

Here’s a gist of a simple pipeline I put together: https://gist.github.com/stsievert/30702575de95328f199ab1d7e50795ef

Notes:

  • this dataset is unbalanced, with only 3% positive examples.
  • I take a (very) small fraction of the dataset, I think about 0.002%.
  • I do some preprocessing to avoid outliers. I should probably make it less complex.
  • The scores are very variable. I suspect we need more data.

@TomAugspurger
Copy link
Member

Thanks for this. I'm playing around with it locally now.

Thinking about scaling this to larger datasets:

  1. dd.get_dummies doesn't support sparse. Opened SparseDataFrame empty slice coerces to loses dtype and fill_value pandas-dev/pandas#21993 for that, will fix it later today (have to work around a pandas bug)
  2. Did you look at dask_ml.preprocessing.Categorizer and dask_ml.preprocessing.DummyEncoder? With this sparse of data, I think it'll be important to use (pandas) categoricals so that the transformer doing the one-hot / dummy encoding has a consistent shape on train / test sets.
  3. I wonder if any of http://contrib.scikit-learn.org/imbalanced-learn/stable/api.html are easily scalable to larger datasets (cc. @glemaitre if he has any quick thoughts here)
  4. You might want to have your LogisticProbs.score return the negative log loss. I think as currently written, if you tried to do hyperparameter optimization it would go the wrong direction.

I'll look into the sparse dask get_dummies now, and check back later.

@TomAugspurger
Copy link
Member

TomAugspurger commented Jul 20, 2018

I do some preprocessing to avoid outliers. I should probably make it less complex.

We have QuantileTransformer. Does that suffice? Hmm, perhaps not, as that only works with dask arrays of known shape, which we may not have at that point...

@glemaitre
Copy link
Contributor

wonder if any of http://contrib.scikit-learn.org/imbalanced-learn/stable/api.html are easily scalable to larger datasets (cc. @glemaitre if he has any quick thoughts here)

If you have a nearest neighbors algorithm which is distributed then it should be possible to make a distributed SMOTE. The under-sampling also rely on NN so it would be most probably possible to scale.

@TomAugspurger
Copy link
Member

Thanks. We don't yet have a nearest neighbors algorithm yet, but it's a TODO (also blocking HDBSCAN).


@stsievert regarding pd.get_dummies, you may want to just avoid sparse=True. Turns out it's slow (pandas-dev/pandas#21997), and I'm not entirely sure how much memory it's actually saving... profiling now.

@mrocklin
Copy link
Member Author

mrocklin commented Jul 20, 2018 via email

@TomAugspurger
Copy link
Member

For the case study, I recommend we go down all paths simultaneously, to find things that are broken :)

  1. Try with dd.get_dummies(sparse=False) and just pay the memory cost. You'll convert to dense anyway, since we have a mixture of sparse and dense (though you do a bit a filtering before that point, so this will have higher memory usage I think).
  2. Implement dd.get_dummies(sparse=True) (coming soon) and see where things go
  3. Try OrdinalEncoder on the categorical features (should work today with scikit-learn & dask-ml), though this likely won't perform as well with a linear model
  4. Use scikit-learn dev & the new OneHotEncoder (dask version coming later today)

@mrocklin
Copy link
Member Author

mrocklin commented Jul 20, 2018 via email

@TomAugspurger
Copy link
Member

Ah, there may be many not observed in the sample. In that case, we'll push down the sparse side of things.

@jakirkham
Copy link
Member

Regarding nearest neighbors, there is some brute force stuff in dask-distance, which might be useful. It probably could benefit from tree approaches as well if people are interested in that problem.

@stsievert
Copy link
Member

stsievert commented Jul 20, 2018

I do some preprocessing to avoid outliers. I should probably make it less complex.

We have QuantileTransformer. Does that suffice?

I'm using sklearn's QuantileTransformer. We are passing fit_intercept=True to LogisticRegression, so I think I'm okay with it.

a nearest neighbors algorithm which is distributed

There's some work with UMAP (cc @lmcinnes) to integrate Dask with approximate nearest neighbors algorithm: lmcinnes/umap#62, which talks about modification of https://github.com/lmcinnes/pynndescent.

Did you look at dask_ml....

My goal so far has been to get a working pipeline with small data. Some bugs were in the way and I wanted to make sure I had a working pandas/sklearn implementation first.

@TomAugspurger
Copy link
Member

Here's an update to @stsievert's notebook that uses a different pre-processing scheme. I haven't played with the actual estimator at all yet (and don't have short-term plans to).

https://gist.github.com/bdb99a9d1e226cc2130016b9d9d1bad4

That uses a few new features

For high-cardinality categorical columns, we just integer code the values.
For lower-cardinality categorical columns, we one-hot encode them.
For numeric columns, we just fill the missing values.

This spawned a few todos:

  1. Implement log_loss / neg_log_loss (may just do that this afternoon)
  2. Add DataFrame.to_dask_array dask#3090 would be useful instead of the values_extractor. This would also help in general with implementing support for dask dataframe in some estimators that only handle dask arrays (e.g. QuantileTransformer).
  3. There's a noticeable delay between calling pipe.fit(X, y) and things showing up on the dashboard. I think we're submitting a very large number of tasks (4600 calls to getitem alone). See what's going on here.
  4. Our pre-processing transformers should gracefully handle missing values like scikit-learn is starting to with 0.20.

Once all the dependent PRs are in, I'll probably try it out on a cluster to see how things look.

@mrocklin
Copy link
Member Author

Nice notebook!

I'm curious, what does the score mean in this case? Given the imbalance that @stsievert noticed (3% click rate) I wonder if a score of 0.95 is good or not :)

Are there ways that we should be handling things given that the output is so imbalanced? For example when we're sampling we might choose a more balanced set. I imagine that there are also fancy weighting things one can do, though I don't have the experience to know what these are.

@TomAugspurger
Copy link
Member

The benchmark accuracy (guessing all 0s) is 0.97, so 0.95 is impressively bad :)

I'll implement a log-loss score, and then we can start tuning the hyperparameters of the pipeline.

After that we can investigate strategies for the imbalanced dataset (#317)

@TomAugspurger
Copy link
Member

For posterity, I blogged about the feature preprocessing part of this on https://tomaugspurger.github.io/sklearn-dask-tabular.html. I plan to work on the hyperparamter optimization part next (not sure when). https://gist.github.com/TomAugspurger/4a058f00b32fc049ab5f2860d03fd579 has the pieces.

@mrocklin
Copy link
Member Author

mrocklin commented Oct 9, 2018

@TomAugspurger I enjoyed playing with your criteo example notebook. Thanks for pushing that up.

I was starting with a different variant of the dataset that I had stored where the text columns were still stored as text rather than categoricals. Rather than globally convert I decided to stick with this to see what came out of it. I replaced the custom HashingEncoder and the OneHotEncoder with just a HashingVectorizer. Things ran, although the result predicts no clicks on the training set itself with the default hyperparameters.

@mrocklin
Copy link
Member Author

OK, here is a gist with my latest attempt: https://gist.github.com/233810e6813e7fd5b5a40f08bde02758

This depends on a bunch of small PRs recently submitted. I add notes on future work at the bottom of the notebook. I'll reproduce them here.

Future Work

  • We currently only do a single pass over our data with Incremental.fit. We probably need to replace this with a system that will do multiple passes until we converge.
  • We probably want to choose the class_weights value more carefully, and there are likely some other parameters to this pipeline that could use tuning
  • Currently dask-ml has mechanisms for hyper-parameter optimization and incremental training like what we're doing here, but they don't work on full pipelines.
  • It's tricky working with both dask arrays and dask dataframes in column transformer pipelines. We have to use dask arrays in order to support scipy.sparse matrices from the HashingVectorizer, but we prefer dealing with column names provided by dask dataframes.
  • Should HashingVectorizer support None values? HashingVectorizer fails when data has None values scikit-learn/scikit-learn#12347

@ogrisel
Copy link

ogrisel commented Oct 11, 2018

An alternative to setting class_weights would be to rebalance the training set to undersample the majority class such that the positive class is represented around 30% of the time for instance. If we do several passes of the training data, each pass could scan all the positive but a different subset of the negatives each time.

To compute the performance metrics on the held out test split (e.g. the last month) such as a precision and recall curve, one should use the full test set without any resampling.

Note that scikit-learn does not have the API to do resampling in a pipeline yet so I would advise to do this resampling loop manually for know.

@ogrisel
Copy link

ogrisel commented Oct 11, 2018

You might also want to tweak the value of the regularizer of the SGDClassifier, that is decreasing the default value of alpha.

@Sandy4321
Copy link

if somebody tried to run locally ?
like
https://github.com/rambler-digital-solutions/criteo-1tb-benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Case Study Large-scale example as stress tests
Projects
None yet
Development

No branches or pull requests

7 participants