diff --git a/supporting-blog-content/evaluating-search-relevance-part-2/phi3-as-relevance-judge.ipynb b/supporting-blog-content/evaluating-search-relevance-part-2/phi3-as-relevance-judge.ipynb index ce412781..1c227d23 100644 --- a/supporting-blog-content/evaluating-search-relevance-part-2/phi3-as-relevance-judge.ipynb +++ b/supporting-blog-content/evaluating-search-relevance-part-2/phi3-as-relevance-judge.ipynb @@ -72,7 +72,7 @@ "id": "cfda1967-8feb-400e-b125-dc8e2c349467", "metadata": {}, "source": [ - "Let's gradually build our code structure. First, the necessary imports:" + "First, the necessary imports:" ] }, { @@ -106,7 +106,7 @@ "source": [ "Now, let's create a class that will responsible for loading the `Phi-3` model and perform inference on its inputs. A few notes before we dive into the code:\n", "* Even though Phi-3 is a small language model (SLM) with a parameter count of 3.8B we load it with 4-bit quantization that makes it a good choice even for consumer-grade GPUs\n", - "* Following the example code provided in the corresponding HF page [here](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) we are also using text generation pipelines to perform inference. More optimized setups are possible but out of scope for this document\n", + "* Following the example code provided in the corresponding HF page [here](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) we are also using text generation pipelines to perform inference. More optimized setups are possible but out of scope for this notebook\n", "* Regular expressions are used to extract the answer from the LLM output. The `response_types` argument defines the set of acceptable classes (e.g. `Relevant`, `Not Relevant`)\n", "* There are two options for decoding:\n", " * `greedy decoding`, where sampling is disabled and the outputs are (more or less) deterministic\n", @@ -274,26 +274,6 @@ " 'You should provide your answer in the form of a boolean value: \"Relevant\" or \"Not Relevant\"\\n'\n", ")\n", "\n", - "\n", - "QA_PAIRWISE_RELEVANCE_PROMPT_TEMPLATE = (\n", - " \"You are an expert in information retrieval and your task is to estimate the relevance of a retrieved document to a query.\\n\"\n", - " \"More specifically, you will be provided with three pieces of information:\\n\"\n", - " \"- Query, which is the question we want to answer\\n\"\n", - " \"- Positive Document, which is a document that contains the correct answer to the query\\n\"\n", - " \"- Retrieved Document, which is the document we want to evaluate\\n\"\n", - " 'Your task is to predict \"Relevant\" if the Retrieved Document contains the information required to provide an answer to the Query, otherwise you should print \"Not Relevant\" \\n'\n", - " \"You can take advantage of the information in the Positive Document to identify the correct answer to the Query and then verify that the Retrieved Document contains that piece of information\\n\"\n", - " \"#####\\n\"\n", - " \"Here are your inputs:\\n\"\n", - " \"Query: {query_text}\\n\"\n", - " \"Positive Document: {positive_text}\\n\"\n", - " \"Retrieved Document: {retrieved_text}\\n\"\n", - " \"#####\\n\\n\"\n", - " \"Take a step back and reflect carefully on how best to solve your task\\n\"\n", - " 'You should provide your answer in the form of a boolean value: \"Relevant\" or \"Not Relevant\"\\n'\n", - " \"Good luck!\"\n", - ")\n", - "\n", "CHAIN_OF_THOUGHT_PROMPT_TEMPLATE = (\n", " \"You are an expert in information retrieval and your task is to decide if a retrieved \"\n", " \"document is relevant to a query or not. You will be provided with two pieces of information:\\n\"\n", @@ -338,6 +318,26 @@ " \"Here are the QUERY and DOCUMENT for you to evaluate:\\n\"\n", " \"QUERY: {query_text}\\n\"\n", " \"DOCUMENT: {retrieved_text}\\n\"\n", + ")\n", + "\n", + "\n", + "QA_PAIRWISE_RELEVANCE_PROMPT_TEMPLATE = (\n", + " \"You are an expert in information retrieval and your task is to estimate the relevance of a retrieved document to a query.\\n\"\n", + " \"More specifically, you will be provided with three pieces of information:\\n\"\n", + " \"- Query, which is the question we want to answer\\n\"\n", + " \"- Positive Document, which is a document that contains the correct answer to the query\\n\"\n", + " \"- Retrieved Document, which is the document we want to evaluate\\n\"\n", + " 'Your task is to predict \"Relevant\" if the Retrieved Document contains the information required to provide an answer to the Query, otherwise you should print \"Not Relevant\" \\n'\n", + " \"You can take advantage of the information in the Positive Document to identify the correct answer to the Query and then verify that the Retrieved Document contains that piece of information\\n\"\n", + " \"#####\\n\"\n", + " \"Here are your inputs:\\n\"\n", + " \"Query: {query_text}\\n\"\n", + " \"Positive Document: {positive_text}\\n\"\n", + " \"Retrieved Document: {retrieved_text}\\n\"\n", + " \"#####\\n\\n\"\n", + " \"Take a step back and reflect carefully on how best to solve your task\\n\"\n", + " 'You should provide your answer in the form of a boolean value: \"Relevant\" or \"Not Relevant\"\\n'\n", + " \"Good luck!\"\n", ")" ] }, @@ -346,7 +346,12 @@ "id": "b9f52ce9-198f-4219-b1e0-2d20ac13218f", "metadata": {}, "source": [ - "We also define a helper structure that allows us to store the requirements per case" + "We also define a helper structure containing:\n", + "* `prompt_inputs`, specifies the list of attributes that need to be set in the prompt template. These attributes have the same name in the training data\n", + "* `prompt_template`, the prompt template to use\n", + "* `response_types`, the names of the expected output classes.\n", + "* `metadata`, the extra attributes that need to be preserved\n", + "* `max_output_tokens`, the maximum number of tokens that the LLM outputs\n" ] }, { @@ -378,29 +383,9 @@ " \"metadata\": [\"qid\", \"retrieved_doc_id\", \"human_judgment\"],\n", " \"max_output_tokens\": 250,\n", " },\n", - "}" - ] - }, - { - "cell_type": "markdown", - "id": "9765e537-2d7c-4360-9166-685193297d48", - "metadata": {}, - "source": [ - "More specifically:\n", - "* `prompt_inputs`, specifies the list of attributes that need to be set in the prompt template. These attributes have the same name in the training data\n", - "* `prompt_template`, the prompt template to use\n", - "* `response_types`, the names of the expected output classes.\n", - "* `metadata`, the extra attributes that need to be preserved\n", - "* `max_output_tokens`, the maximum number of tokens that the LLM outputs" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "53065ac4-d280-4177-ac61-6bb91bed1e1e", - "metadata": {}, - "outputs": [], - "source": [ + "}\n", + "\n", + "\n", "def get_llm_evaluator(\n", " model_name_or_path: str, task_type: str, iterations: int, temperature: float\n", "):\n",