From dc6d635238c27e3844b77f5c3291013d8441659e Mon Sep 17 00:00:00 2001
From: rakri <78582691+rakri@users.noreply.github.com>
Date: Tue, 17 Sep 2024 10:15:32 +0530
Subject: [PATCH] First commit for amazon products shopping dataset

---
 dataset_preparation/amazon_products/readme.md | 5 +++++
 1 file changed, 5 insertions(+)
 create mode 100644 dataset_preparation/amazon_products/readme.md

diff --git a/dataset_preparation/amazon_products/readme.md b/dataset_preparation/amazon_products/readme.md
new file mode 100644
index 000000000..8e9a6e5a9
--- /dev/null
+++ b/dataset_preparation/amazon_products/readme.md
@@ -0,0 +1,5 @@
+This dataset contains around 2M vectors for amazon products. 
+The embeddings are generated using cohere-english-light model (https://huggingface.co/Cohere/Cohere-embed-english-light-v3.0)
+The base text used for generating embeddings is title + description of products
+The queries are modifications of randomly sampled products from the base: after sampling, we prompt GPT-3.5 to output a simple query phrase for which the product is a suitable result, and embed that phrase using the cohere model.
+We also choose brands from the appropriate category of the query and provide them as OR filters. The item price of the sampled item is used as indicative for a PRICE range filter.