We use the Amazon product data published by McAuley et al. at SIGIR 2015. You can obtain the data by following the provided instructions.
We make use of the Home_and_Kitchen
, Clothing_Shoes_and_Jewelry
, Pet_Supplies
and Sports_and_Outdoors
reviews and metadata data files (full, not the 5-core).
Product lists, topics (i.e., textual representations of categories) and ground truth relevance information is provided in this repository. In addition, we provide utilities (here and here) to convert the descriptions and reviews of the domains to TREC text format.
Here's an overview of all the files included in this repository used for dataset construction and model evaluation (see below for more information on how to use these files).
Home & Kitchen | Clothing, Shoes & Jewelry | Pet Supplies | Sports & Outdoors | |
---|---|---|---|---|
MD5 hashes | md5 | md5 | md5 | md5 |
Product lists | product_list | product_list | product_list | product_list |
Document-product associations | assocs | assocs | assocs | assocs |
Topics | topics | topics | topics | topics |
Relevance | qrel_test qrel_validation | qrel_test qrel_validation | qrel_test qrel_validation | qrel_test qrel_validation |
To replicate the experiments of the paper on learning latent vector spaces for product search, have a look the product_search.sh script.
First, obtain the data described above. Download the 8 files (metadata + reviews) to your local disk without decompressing them.
Here's an overview:
meta_Clothing_Shoes_and_Jewelry.json.gz
(268M)meta_Home_and_Kitchen.json.gz
(146M)meta_Pet_Supplies.json.gz
(40M)meta_Sports_and_Outdoors.json.gz
(175M)reviews_Clothing_Shoes_and_Jewelry.json.gz
(847M)reviews_Home_and_Kitchen.json.gz
(783M)reviews_Pet_Supplies.json.gz
(234M)reviews_Sports_and_Outdoors.json.gz
(594M)
Afterwards, we can construct models from 4-gram using 300-dimensional word representations and 128-dimensional entity representations as follows (the third argument is optional and defaults to cpu
):
[cvangysel@ilps SERT] ./product-search.sh \
<path-to-directory-with-gzipped-amazon-data> \
<path-to-nonexisting-temporary-directory> \
[cpu|gpu]
Processing clothing_shoes_and_jewelry.
Creating output directory.
Verifying corpus.
Extracting product descriptions and reviews.
Constructing LSE model on clothing_shoes_and_jewelry collection.
NDCG@100 (validation): 0.1790
NDCG@100 (test): 0.1479
Processing sports_and_outdoors.
Creating output directory.
Verifying corpus.
Extracting product descriptions and reviews.
Constructing LSE model on sports_and_outdoors collection.
NDCG@100 (validation): 0.1707
NDCG@100 (test): 0.1761
Processing home_and_kitchen.
Creating output directory.
Verifying corpus.
Extracting product descriptions and reviews.
Constructing LSE model on home_and_kitchen collection.
NDCG@100 (validation): 0.2151
NDCG@100 (test): 0.2423
Processing pet_supplies.
Creating output directory.
Verifying corpus.
Extracting product descriptions and reviews.
Constructing LSE model on pet_supplies collection.
NDCG@100 (validation): 0.2308
NDCG@100 (test): 0.2578
All done!
If you use SERT to produce results for your scientific publication, please refer to our CIKM 2016 paper on product search and our software overview paper:
@inproceedings{VanGysel2016products,
title={Learning Latent Vector Spaces for Product Search},
author={Van Gysel, Christophe and de Rijke, Maarten and Kanoulas, Evangelos},
booktitle={CIKM},
volume={2016},
pages={165--174},
year={2016},
organization={ACM}
}
@inproceedings{VanGysel2017sert,
title={Semantic Entity Retrieval Toolkit},
author={Van Gysel, Christophe and de Rijke, Maarten and Kanoulas, Evangelos},
booktitle={SIGIR 2017 Workshop on Neural Information Retrieval (Neu-IR'17)},
year={2017},
}