-
-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Case Study: Criteo dataset #295
Comments
The dataset is available on Azure (source). This dataset has 13 integer features with 26 categorical features (the categorical features are hashed for anonymization) and is stored as 24 individual files (one for each day) (source). I would rather use some SGD implementation rather than L-BGFS or ADMM because IIRC those algorithm use the entire dataset in every optimization step. |
I suspect that for initial use we'll want to play with a single day. I played with this once and it looks like I kept around a conversion script to parquet: import dask.dataframe as dd
from dask.distributed import Client
client = Client()
categories = ['category_%d' % i for i in range(26)]
columns = ['click'] + ['numeric_%d' % i for i in range(13)] + categories
df = dd.read_csv('day_0', sep='\t', names=columns, header=None)
encoding = {c: 'bytes' for c in categories}
fixed = {c: 8 for c in categories}
df.to_parquet('day-0-bytes.parquet', object_encoding=encoding,
fixed_text=fixed, compression='SNAPPY') |
Although even a single day was a bit of a pain to work with on a single machine if memory serves. It'll be fun to eventually scale up when we're ready. Presumably we'll start with whatever seems easiest to get decent results from, (as you say SGD) but will eventually want to try a few things. I imagine that some of the preprocessing steps will be the same. I also recall that when I tried things I got really poor scores. Someone told me this was because clicks were rare and I wasn't weighting them properly. I suspect that this will force a few topics:
This is also the part of the process (building a pipeline and model) about which I know very little. I'm looking forward to seeing what you end up doing. My guess is that it will be easier to iterate with a small amount of data, probably by just picking off a small bit of the parquet data with pandas and using scikit-learn natively. |
And they're on Kaggle too: https://www.kaggle.com/c/criteo-display-ad-challenge. This will allow good evaluation of our model, with a predefined score function, cross entropy loss. Top scores on Kaggle are around 0.44 (lower is better).
That's my plan for today. I'll try to develop a preprocessing pipeline and reasonable model and test it an a subset of the data. |
Here’s a gist of a simple pipeline I put together: https://gist.github.com/stsievert/30702575de95328f199ab1d7e50795ef Notes:
|
Thanks for this. I'm playing around with it locally now. Thinking about scaling this to larger datasets:
I'll look into the sparse dask get_dummies now, and check back later. |
We have |
If you have a nearest neighbors algorithm which is distributed then it should be possible to make a distributed SMOTE. The under-sampling also rely on NN so it would be most probably possible to scale. |
Thanks. We don't yet have a nearest neighbors algorithm yet, but it's a TODO (also blocking HDBSCAN). @stsievert regarding |
@TomAugspurger is there another approach that would you suggest?
Constructing a scipy.sparse matrix manually? Or are there Scikit-Learn
transformers to help with this?
…On Fri, Jul 20, 2018 at 11:55 AM, Tom Augspurger ***@***.***> wrote:
Thanks. We don't yet have a nearest neighbors algorithm yet, but it's a
TODO (also blocking HDBSCAN).
------------------------------
@stsievert <https://github.com/stsievert> regarding pd.get_dummies, you
may want to just avoid sparse=True. Turns out it's slow (
pandas-dev/pandas#21997 <pandas-dev/pandas#21997>),
and I'm not entirely sure how much memory it's actually saving... profiling
now.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#295 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszHF3RwBM-YNdRNBme7F8usWU_Q3Gks5uIf1fgaJpZM4VQUPu>
.
|
For the case study, I recommend we go down all paths simultaneously, to find things that are broken :)
|
If memory serves there are hundreds of thousands of categories. Dense may
be tricky?
…On Fri, Jul 20, 2018 at 12:25 PM, Tom Augspurger ***@***.***> wrote:
For the case study, I recommend we go down all paths simultaneously, to
find things that are broken :)
1. Try with dd.get_dummies(sparse=False) and just pay the memory cost.
You'll convert to dense anyway, since we have a mixture of sparse and dense
(though you do a bit a filtering before that point, so this will have
higher memory usage I think).
2. Implement dd.get_dummies(sparse=True) (coming soon) and see where
things go
3. Try OrdinalEncoder on the categorical features (should work today
with scikit-learn & dask-ml), though this likely won't perform as well with
a linear model
4. Use scikit-learn dev & the new OneHotEncoder (dask version coming
later today)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#295 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszETizEUr-EASxO-W1_0qzW2qELE0ks5uIgR-gaJpZM4VQUPu>
.
|
Ah, there may be many not observed in the sample. In that case, we'll push down the sparse side of things. |
Regarding nearest neighbors, there is some brute force stuff in |
I'm using sklearn's QuantileTransformer. We are passing
There's some work with UMAP (cc @lmcinnes) to integrate Dask with approximate nearest neighbors algorithm: lmcinnes/umap#62, which talks about modification of https://github.com/lmcinnes/pynndescent.
My goal so far has been to get a working pipeline with small data. Some bugs were in the way and I wanted to make sure I had a working pandas/sklearn implementation first. |
Here's an update to @stsievert's notebook that uses a different pre-processing scheme. I haven't played with the actual estimator at all yet (and don't have short-term plans to). https://gist.github.com/bdb99a9d1e226cc2130016b9d9d1bad4 That uses a few new features
For high-cardinality categorical columns, we just integer code the values. This spawned a few todos:
Once all the dependent PRs are in, I'll probably try it out on a cluster to see how things look. |
Nice notebook! I'm curious, what does the score mean in this case? Given the imbalance that @stsievert noticed (3% click rate) I wonder if a score of 0.95 is good or not :) Are there ways that we should be handling things given that the output is so imbalanced? For example when we're sampling we might choose a more balanced set. I imagine that there are also fancy weighting things one can do, though I don't have the experience to know what these are. |
The benchmark accuracy (guessing all 0s) is 0.97, so 0.95 is impressively bad :) I'll implement a log-loss score, and then we can start tuning the hyperparameters of the pipeline. After that we can investigate strategies for the imbalanced dataset (#317) |
For posterity, I blogged about the feature preprocessing part of this on https://tomaugspurger.github.io/sklearn-dask-tabular.html. I plan to work on the hyperparamter optimization part next (not sure when). https://gist.github.com/TomAugspurger/4a058f00b32fc049ab5f2860d03fd579 has the pieces. |
@TomAugspurger I enjoyed playing with your criteo example notebook. Thanks for pushing that up. I was starting with a different variant of the dataset that I had stored where the text columns were still stored as text rather than categoricals. Rather than globally convert I decided to stick with this to see what came out of it. I replaced the custom |
OK, here is a gist with my latest attempt: https://gist.github.com/233810e6813e7fd5b5a40f08bde02758 This depends on a bunch of small PRs recently submitted. I add notes on future work at the bottom of the notebook. I'll reproduce them here. Future Work
|
An alternative to setting class_weights would be to rebalance the training set to undersample the majority class such that the positive class is represented around 30% of the time for instance. If we do several passes of the training data, each pass could scan all the positive but a different subset of the negatives each time. To compute the performance metrics on the held out test split (e.g. the last month) such as a precision and recall curve, one should use the full test set without any resampling. Note that scikit-learn does not have the API to do resampling in a pipeline yet so I would advise to do this resampling loop manually for know. |
You might also want to tweak the value of the regularizer of the |
if somebody tried to run locally ? |
The Criteo dataset is a 1TB dump of features around advertisements and whether or not someone clicked on the ad. It has a both dense and categorical/sparse data. I believe that the data is freely available on Azure.
There are some things that we might want to do with this dataset that are representative of other problems:
As always, it might be a good start to just download a little bit of the criteo dataset (I think that each day of data is available separately) and work with sklearn directly to establish a baseline.
This came out of conversation with @ogrisel
The text was updated successfully, but these errors were encountered: