From dc6d635238c27e3844b77f5c3291013d8441659e Mon Sep 17 00:00:00 2001 From: rakri <78582691+rakri@users.noreply.github.com> Date: Tue, 17 Sep 2024 10:15:32 +0530 Subject: [PATCH] First commit for amazon products shopping dataset --- dataset_preparation/amazon_products/readme.md | 5 +++++ 1 file changed, 5 insertions(+) create mode 100644 dataset_preparation/amazon_products/readme.md diff --git a/dataset_preparation/amazon_products/readme.md b/dataset_preparation/amazon_products/readme.md new file mode 100644 index 000000000..8e9a6e5a9 --- /dev/null +++ b/dataset_preparation/amazon_products/readme.md @@ -0,0 +1,5 @@ +This dataset contains around 2M vectors for amazon products. +The embeddings are generated using cohere-english-light model (https://huggingface.co/Cohere/Cohere-embed-english-light-v3.0) +The base text used for generating embeddings is title + description of products +The queries are modifications of randomly sampled products from the base: after sampling, we prompt GPT-3.5 to output a simple query phrase for which the product is a suitable result, and embed that phrase using the cohere model. +We also choose brands from the appropriate category of the query and provide them as OR filters. The item price of the sampled item is used as indicative for a PRICE range filter.