Skip to content

Latest commit

 

History

History
32 lines (18 loc) · 2.81 KB

README.md

File metadata and controls

32 lines (18 loc) · 2.81 KB

Datasets

This folder contains information and resources for different user interaction dataset used in our experiments with Transformer architectures for sequential and session-based recommendation

REES46 eCommerce dataset

This is a large dataset comprising 7 months (from October 2019 to April 2020) from a large multi-category online store. It contains more than 411 million interactions, 89 million sessions, 15 million users and 386 thousand items. You can find more stats on this dataset in this EDA notebook.
The raw dataset and more info can be found on Kaggle Datasets

Pre-processing

As this is a large dataset, its pre-processing was implemented mostly using PySpark. In general, the user interactions are split into sessions, and sessions are saved in parquet files. The parquet files are split by day, to allow incremental training and evaluation.

The notebooks for pre-processing can be found in the ecommerce_rees46/preprocessing/pyspark/ folder and must be executed in the order of the notebooks prefixes (01, 02, and 03).

G1 news dataset

This is a dataset with users interactions logs (page views) from the G1, the most popular news portal in Brazil, which was provided by Globo.com.

The dataset contains a sample of user interactions (page views) in the news portal from Oct. 1 to 16, 2017, including about 3 million clicks, distributed in more than 1 million sessions from 314,000 users who read more than 46,000 different news articles during that period.

The raw datasets and more info can be found on Kaggle Datasets.

Pre-processing

AS this dataset is not very large, the preprocessing for this dataset was implemented using Pandas. In general, the user interactions are split into sessions, and sessions are saved in parquet files. The parquet files are split by hour, to allow incremental training and evaluation.

The preprocessing notebook can be found in the news_g1/preprocessing/G1_news_preprocess.ipynb folder.

Features config files

The Transformers4Rec uses a features config file (YAML) to get to know which features are available for the model. The only required feature is the one that contains the sequence of item ids. But more features can be provided, like item metadata /content features and user contextual features, generally improving models accuracy. You can find examples of features config files for the REES46 eCommerce dataset and G1 news datasets.