SASNet is a novel approach to recommendation systems that leverages Large Language Models (LLMs) and a Siamese Transformer-based architecture. This project aims to address common challenges in recommendation systems, including overspecialization, the cold-start problem, and the need for in-depth item knowledge.
The project is organized into the following main directories:
data/
: Contains the dataset and embeddingssrc/
: Source code for the models and data processingtrain/
: Scripts for training the modelsresults/
: Stores evaluation results, plots, and graphsinference/
: Scripts for running inference on trained models
- Utilizes quantized LLM (Phi 3 3.8B-Q4-GPTQ) for rich embeddings
- Custom Siamese AttentionSetNet (SASNet) architecture
- Dynamic embedding generation for users and activities
The embeddings dataset can be downloaded from: [PLACEHOLDER_LINK]
The dataset is in parquet format and contains pre-computed embeddings for the Yelp reviews.
- Hulk3: SVM classifier using embeddings from Phi-3
- huLLK: SVM classifier using embeddings from sentence-transformer miniLM
- SiameseAttentionSetNet: Custom transformer architecture using embeddings from Phi-3
- SiameseAttentionSetNet + miniLM: Custom transformer architecture using embeddings from sentence-transformer miniLM (planned)
pip install sasnet
To train the models, use the Jupyter notebooks in the train/
directory. You can modify the following parameters at the start of the notebook:
SAMPLE_SIZE = 5 # Recommended range: 3-10
BATCH_SIZE = 512 # Recommended range: 128-1024
LLM_dim = 3072 # Fixed for Phi-3
hidden_dim = 256 # Recommended range: 128-512
num_heads = 4 # Recommended range: 2-8
ffn_dim = 1024 # Recommended range: 512-2048
dropout_rate = 0.2 # Recommended range: 0.1-0.5
n_classes = 3 # Fixed for this classification task
# Training parameters
EPOCHS = 3000 # Recommended range: 1000-5000
PATIENCE = 150 # Recommended range: 50-200
We use custom metrics to evaluate both the embeddings and the network performance:
We assess the quality of embeddings using two key metrics:
- Sentiment Analysis: Measures cosine similarity between "bad" and "excellent" embeddings.
- Category Understanding: Compares embeddings of different business categories.
The overall embedding score is calculated as:
s = (-0.25 * NMSE_c) + (0.25 * NPR_c) - (0.5 * NAE_s)
Where:
- NMSE_c: Normalized Mean Squared Error for category understanding
- NPR_c: Normalized Pearson's Correlation for category understanding
- NAE_s: Normalized Absolute Error for sentiment analysis
The network is evaluated using standard classification metrics:
- Accuracy
- F1-score
- LLM embeddings captured greater semantic similarity than state-of-the-art sentence embedders.
- SVM with LLM embeddings outperformed the TF-IDF+SVM baseline by 12.25%.
- SASNet performed 5% worse than the TF-IDF+SVM baseline on the current dataset.
Training progress and network architecture visualizations can be found in the results/plots/
directory. These are generated during the training process in the SiameseAttentionSetNet notebook.
We welcome contributions to the SASNet project. If you'd like to contribute, please:
- Fork the repository
- Create a new branch for your feature
- Implement your changes
- Submit a pull request
For major changes or new features, please open an issue first to discuss the proposed changes.
sasnet
is distributed under the terms of the MIT license.
[Add contact information for the project maintainers]
We would like to thank the Yelp Open Dataset for providing the data used in this research.
If you use this work in your research, please cite:
Amer, A., Çabuk, B., Chinello, F., & Rotov, D. (2024). SASNet: LLM Embeddings and Siamese Transformer Network for Recommendation Systems.