The dataset module handles the generation of the synthetic Wikipedia article-based dataset as well as the generation of queries for that dataset.
You can use the multirag-cli datagen
command to generate the synthetic Wikipedia-based dataset, which comes with the following command line interface:
usage: multirag-cli datagen [-h] [-c [CONFIG]] [-m [MIN_LENGTH]] [-n [NUM_CATEGORIES]] [-o [OUTPUT]] [-s [SAMPLES]]
Synthetic Dataset Generation
optional arguments:
-h, --help show this help message and exit
-c [CONFIG], --config [CONFIG]
Path to the config file. (default: multirag/dataset/categories.json)
-m [MIN_LENGTH], --min-length [MIN_LENGTH]
Minimum number of characters in each document. (default: 800)
-n [NUM_CATEGORIES], --num-categories [NUM_CATEGORIES]
Number of categories. (default: 25)
-o [OUTPUT], --output [OUTPUT]
Path to the output file. (default: articles.json)
-s [SAMPLES], --samples [SAMPLES]
Number of documents per category. (default: 50)
The synthetic dataset consists of the summaries of Wikipedia articles, that are grouped into categories.
The number of categories (default: 25) and documents per category (default: 50) can be specified by the user.
The number of categories specified as the command line argument should not exceed the number of categories defined in the configuration file.
We provide an example configuration file for the categories, that we used during the evaluation of MRAG, in multirag/dataset/categories.json
.
We describe the format of the configuration file in the section Configuration File Format.
Additionally the user can set a minimum of characters that each document contains (default: 800) as well as the path to the output JSON file, where the synthetic dataset is stored.
You can find the synthetic Wikipedia article dataset, that we used during the evaluation of MRAG, in the datasets directory. Generating a synthetic dataset with the standard paramaters takes roughly 10 minutes.
[
{
"label": "countries",
"starting_points": [
"Member states of the United Nations"
],
"explicit_category": "Member states of the United Nations"
},
]
label
is the name of the category. starting_points
references the Wikipedia articles that serve as a staring point for the article discovery.
The explicit_category
attribute helps with the scraping of Wikipedia articles. Every articles that contains explicit_category
as a substring will be accepted as an article of the category.
[
{
"title": "...",
"text": "...",
"label": "..."
},
]
Each document in the output file of the synthetic dataset has three attributes: title
, text
and label
.
The title
attribute reflects the title of the respective Wikipedia article.
text
contains the summary, the first section, of the Wikipedia articles and serves as the document during the following stages of the pipeline.
The value of that attribute contains at least the minimum number of characters, that were specified during the synthetic dataset generation (default: 800).
label
specifies the category to which that document
belongs to.
The text
attribute is used during the subsequent embedding stage, while all three attributes are used during the query generation for the synthetic dataset.
You can use the multirag-cli querygen
command to generate synthetic queries for the synthetic Wikipedia-based dataset, which comes with the following command line interface:
usage: multirag-cli querygen [-h] [-a ASPECTS [ASPECTS ...]] [-d [DATASET_PATH]] [-m] [-n [NUM_ATTEMPTS]] [-f [NUM_FUSION_QUERIES]] [-o [OUTPUT]] [-q [NUM_QUERIES]]
Synthetic Query Generation
optional arguments:
-h, --help show this help message and exit
-a ASPECTS [ASPECTS ...], --aspects ASPECTS [ASPECTS ...]
List of aspect numbers to incorporate into the queries. (default: [1, 2, 3, 4, 5, 6, 10, 15, 20, 25])
-d [DATASET_PATH], --dataset-path [DATASET_PATH]
Path to the dataset. (default: articles.json)
-m, --manual-review Enable manual review of failed queries. (default: False)
-n [NUM_ATTEMPTS], --num-attempts [NUM_ATTEMPTS]
Number of attempts to retry the generation of a query that previously failed. (default: 2)
-f [NUM_FUSION_QUERIES], --num-fusion-queries [NUM_FUSION_QUERIES]
Number of fusion queries to generate per standard query. (default: 4)
-o [OUTPUT], --output [OUTPUT]
Path to the output file. (default: queries.json)
-q [NUM_QUERIES], --num-queries [NUM_QUERIES]
Number of queries to generate. (default: 25)
The synthetic query generation is based on stories, which cover the various aspects of a query.
The number of queries (default: 25) as well as the number of aspects per query can be specified by the user.
The number of aspects is a list, where for each specific number of aspects that number of queries is generated.
So the total number of generated queries is --dataset-path
parameter (default: articles.json
).
The format for input file can be found in the output format section of the synthetic dataset generation.
It is possible to manually review the repeated query generation in case the first attempt fails.
The number of attempts to generate a specific query can be specified with the --num-attempts
parameter (default: 2).
The user can also control the number of prompts for the fusion queries (default 4). If the user specifies zero prompts, no fusion queries are generated.
Additionally the user can state the path to the output JSON file, where the synthetic queries are stored.
The query generator uses the OpenAI API for the query generation, which will cost you.
Please export your OpenAI API key using export OPENAI_API_KEY=<api key>
before using the synthetic query generator.
You can find the synthetic queries that we used with synthetic Wikipedia article dataset for the evaluation fo MRAG in the datasets directory. Generating synthetic queries with the standard paramaters takes roughly 30 minutes.
Fusion queries are based on RAG Fusion and are an optional mechanism that is used to further enhance the benefits of MRAG at the cost of additional tokens. A fusion query uses an LLM to create a fixed number of questions about the standard query. Each question is separately applied through an embedding model.
[
{
"topics": ["...", "..." ...],
"text: "...",
"fusion": ["...", "...", ...]
}
]
Each query has at least two attributes: topics
and text
.
The text
attribute contains the actual query text.
topics
contains the article titles that were used during the generation of the query.
topics
is used during the evaluation to check, whether all necessary articles were fetched.
The attribute fusion
is used only for fusion queries and contains the prompts, which are used to query an LLM to improve the results.
The prompts were formatted for better readability. The following example shows the prompt for a query generation with two aspects.
{
"role": "system",
"content":
"You are a very curious scholar asking about possible connections between different topics."
},
{
"role": "user",
"content":
"Please create a story about the attached 2 articles on the topics Impostor syndrome, Alhambra
(board game). It is very important that each of the attached articles is relevant to the story,
in a way that references the content of the article, not just its title. But please also
mention each title at least once. Please make sure that all of the attached articles are
relevant to your story, and that each article is referenced in at least two sentences! They do
not necessarily have to be referenced in the same order, but make sure no article is forgotten.
Important: Output only the story, no additional text. And do not use bullet points, or
paragraphs.
Articles:
---------
Impostor syndrome:
Impostor syndrome, also known as impostor phenomenon or impostorism, is a psychological
occurrence. Those who have it may doubt their skills, talents, or accomplishments. They may
have a persistent internalized fear of being exposed as frauds. Despite external evidence of
their competence, those experiencing this phenomenon do not believe they deserve their success
or luck. They may think that they are deceiving others because they feel as if they are not as
intelligent as they outwardly portray themselves to be.
Impostor syndrome can stem from and result in strained personal relationships and can hinder
people from achieving their full potential in their fields of interest. The term
\"impostorization\" shifts the source of the phenomenon away from the supposed impostor to
institutions whose policies, practices, or workplace cultures \"either make or intend to make
individuals question their intelligence, competence, and sense of belonging.\"
---------
Alhambra (board game):
Alhambra (German: Der Palast von Alhambra, literally \"The Palace of Alhambra\") is a 2003
tile-based German-style board game designed by Dirk Henn. It was originally published in
Germany by Queen Games in a language-interdependent version; an English-specific version was
released in North America by the now-defunct \u00dcberplay. The game is a Muslim-themed
update, set during the construction of the Alhambra palace in 14th century Granada, of the
1998 stock trading board game Stimmt So!, which in turn was an update of the 1992 mafia
influence board game Al Capone; the original version was subsequently released as Alhambra:
The Card Game. Upon its release, Alhambra won numerous awards, including the Spiel des Jahres
award. Its success has led to the release of numerous expansion packs and spin-off games, and
is becoming Queen Games' flagship franchise.
---------
Again, make sure that you reference all the following topics in your story: Impostor syndrome,
Alhambra (board game)"
}
The following example shows an example prompt for the generation of fusion questions for the same query that was generated with the prompt above.
{
"role": "system",
"content":
"You are a helpful assistant that generates multiple search queries based on a single input
query."
},
{
"role": "user",
"content":
"Generate multiple search queries related to: Alice sat at the bustling board game caf\u00e9,
her hands trembling as she arranged the tiles in her Alhambra game. Despite her keen strategic
mind, doubt gnawed at her heart. The pervasive feeling of impostor syndrome crept in,
whispering that her victories were mere luck, not skill. She felt like an outsider in the
world of competitive board gaming, always fearing being exposed as a fraud.
As she pondered her next move, a fellow player, Sarah, noticed Alice's hesitance. Sensing her
unease, Sarah offered a kind smile and words of encouragement. \"Don't underestimate yourself,
Alice. Your talent shines through in every game we play together,\" Sarah reassured her. Those
words struck a chord deep within Alice, a glimmer of belief piercing through the veil of
self-doubt.
Emboldened by Sarah's support, Alice made bold decisions in the game, surprising even herself
with her strategic prowess. With each successful move, the shadows of impostor syndrome faded,
replaced by a newfound confidence. As the game drew to a close, Alice emerged victorious, her
skills undeniable to all who witnessed her triumph.
The Alhambra board game became a symbolic battleground where Alice confronted her inner
impostor, emerging stronger and more self-assured than ever before. In that moment of victory,
she realized that true success is not about luck but about belief in one's abilities, a lesson
that transcended the boundaries of the game and resonated in her journey towards overcoming
impostor syndrome."
},
{
"role": "user",
"content": "OUTPUT (4 queries):"
}
The queries were formatted for better readability. The following example query was generated using the prompt in section Example Prompts.
Alice sat at the bustling board game caf\u00e9, her hands trembling as she arranged the tiles in her
Alhambra game. Despite her keen strategic mind, doubt gnawed at her heart. The pervasive feeling of
impostor syndrome crept in, whispering that her victories were mere luck, not skill. She felt like
an outsider in the world of competitive board gaming, always fearing being exposed as a fraud.
As she pondered her next move, a fellow player, Sarah, noticed Alice's hesitance. Sensing her unease,
Sarah offered a kind smile and words of encouragement. \"Don't underestimate yourself, Alice. Your
talent shines through in every game we play together,\" Sarah reassured her. Those words struck a
chord deep within Alice, a glimmer of belief piercing through the veil of self-doubt.
Emboldened by Sarah's support, Alice made bold decisions in the game, surprising even herself with
her strategic prowess. With each successful move, the shadows of impostor syndrome faded, replaced by
a newfound confidence. As the game drew to a close, Alice emerged victorious, her skills undeniable
to all who witnessed her triumph. The Alhambra board game became a symbolic battleground where Alice
confronted her inner impostor, emerging stronger and more self-assured than ever before. In that
moment of victory, she realized that true success is not about luck but about belief in one's
abilities, a lesson that transcended the boundaries of the game and resonated in her journey towards
overcoming impostor syndrome.
The following example questions are intended for the fusion approach of the query above.
1. Strategies for overcoming impostor syndrome in competitive environments
2. Impact of supportive mentors in addressing feelings of self-doubt and impostor syndrome
3. Psychological effects of impostor syndrome on performance in board games
4. Alhambra board game tips for building confidence and strategic thinking