Skip to content

Commit

Permalink
Update how-does-q-work.md
Browse files Browse the repository at this point in the history
  • Loading branch information
nataliegref authored Sep 2, 2024
1 parent f1dbc4e commit 45274ea
Showing 1 changed file with 16 additions and 18 deletions.
34 changes: 16 additions & 18 deletions content/blog/how-does-q-work/how-does-q-work.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
---
title: How Does Q work
date: "2024-08-01T09:00:00.000Z"
description: "Curious about how Q, Qatium's AI assistant, operates under the hood? Dive into the technical details of this tool that leverages OpenAI's generative AI to handle user queries, troubleshoot issues, and manage network operations. Learn how the Retrieval-Augmented Generation (RAG) is used to combine instructions, help center data, and user context for precise responses. Discover how Q integrates predefined commands to execute network tasks, and explore the challenges faced, such as AI response variability, token costs, and handling large network data. This article offers a deep dive into the implementation and the technical decisions."
description: "Curious about how Q, Qatium's AI assistant, operates under the hood? Dive into the technical details of this tool that leverages OpenAI's generative AI to handle user queries, troubleshoot issues, and manage network operations. Learn how Retrieval-Augmented Generation (RAG) is used to combine instructions, help center data, and user context for precise responses. Discover how Q integrates predefined commands to execute network tasks, and explore the challenges faced, such as AI response variability, token costs, and handling large network data. This article offers a deep dive into the implementation and the technical decisions."
---

Q is the AI assistant in Qatium. They can answer user questions in natural language, help resolve issues, and even operate the network.

Q uses conversational generative AI from OpenAI to produce the text.

To maintain a consistent AI personality and focus usage on Qatium, we created an assistant. This involves appending a set of instructions to user questions before they are sent to the LLM.
In order to maintain a consistent AI personality and focus usage on Qatium, we created an assistant. This involves appending a set of instructions to user questions before they are sent to the LLM.

![Diagram that shows how a block of instructions is appended to the prompt before passing it to the LLM gen](./prompt-1.png)

Expand All @@ -21,9 +21,9 @@ Instructions include things like:

## How does Q know so much about Qatium?

Q knows the Qatium information they need to answer the user questions because we inject it in the prompt in a similar way we injected the instructions.
Q knows the Qatium information they need to answer the user questions because we also inject it in the prompt in a similar way we injected the instructions.

This technique is called RAG from retrieval augmented generation.
This technique is called retrieval augmented generation (RAG).

When the user writes a question we search the Qatium help center for pieces of content that may be used to answer that question (Retrieval).

Expand All @@ -42,7 +42,7 @@ Retrieval is a service provided by OpenAI too so we only need to periodically ex

## How Q knows about the user and the network?

You probably guessed it right, we also concatenate this information to the prompt too.
You probably guessed it already, we also concatenate this information to the prompt.

We call this part the Qatium context. It includes the user name, the network name, which layers are visible, the selected asset, any asset with warnings...

Expand Down Expand Up @@ -87,45 +87,43 @@ At this point the LLM generates the answer token by token. It starts to stream t

If the LLM didn't find a need for tools, the run will be completed and the question answered.

If it wants to use a tool, the run response would be: [required_action](https://platform.openai.com/docs/api-reference/runs/object#runs/object-required_action) with the tool(s) to be used and their params. Then we execute the code associated with that tool(s) and call [submit_tool_outputs](https://platform.openai.com/docs/api-reference/runs/submitToolOutputs) with the results of executing the commands.
If it wants to use a tool, the run response would be: [required_action](https://platform.openai.com/docs/api-reference/runs/object#runs/object-required_action) with the tool(s) to be used and their params. Then we execute the code associated with the relevant tool(s) and call [submit_tool_outputs](https://platform.openai.com/docs/api-reference/runs/submitToolOutputs) with the results of executing the command(s).

The AI will run again, the response can be a textual answer for the user or could request more tools.

## Limitations

### Indeterminism
### Nondeterministic

Generative AI may produce different results each time it is used. It's not possible to guarantee that the results provided to the user are always correct.

We are following a "best effort" approach. We manually test and iterate in our prompts and command descriptions until we see the results are right consistently.

LLMs are evolving fast and every new version improves reliability. My personal bet is that eventually this limitation would become insignificant.
LLMs are evolving fast and every new version improves reliability. My personal bet is that eventually this limitation will become insignificant.

### Cost

The final prompt is composed by instructions + documentation retrieval+ Qatium context + commands + user question. This may imply a lot of tokens to be used.
The final prompt is composed of instructions + documentation retrieval + Qatium context + commands + user question. This can lead to the use of a lot of tokens.

The costliest aspect is retrieving documentation. We saved half the tokens used by changing the maximum number of chunks from 20 to 5 and it didn't have a visible effect in the response quality.
The costliest aspect of the process is retrieving documentation. We saved half the tokens used by changing the maximum number of chunks from 20 to 5 and it didn't have a visible effect in the response quality.

When a second question is added to a thread, the previous messages count as input tokens too. To prevent an exponential growth of consumption we don't add again the commands and we only add a diff of the Qatium context. Based on some consumption measurements, I see that OpenAI does not accumulate retrieval data. So the consumption of the second question in a thread is very similar to the first, not growing exponentially.
When a second question is added to a thread, the previous messages also count as input tokens. To prevent an exponential growth of consumption we don't add the commands again and we only add a diff of the Qatium context. Based on some consumption measurements, I see that OpenAI does not accumulate retrieval data. So the consumption of the second question in a thread is very similar to the first, not growing exponentially.

### Model updates

Most of the updates on LLMs are better In every aspect. However, not all updates are improvements; for example, some versions of chatGPT were 'lazier' than their predecessors.
Most of the updates of LLMs are positive. However, not all updates are improvements. For example, some versions of chatGPT were 'lazier' than their predecessors.

Now there are several initiatives to benchmark and compare LLMs that help predict the overall results. But it's impossible to ensure that changes in behavior of the LLM won't worsen Q results.
There are now several initiatives to benchmark and compare LLMs that help predict the overall results. But it's impossible to ensure that changes in behavior of the LLM won't worsen Q's results.

Writing e2e tests could help us but they are challenging because:

- Cost - A big suite would use a lot of tokens
- Indeterminism - Sometimes the AI finds different ways to complete the task so the test expectations are more difficult to write
- Nondeterministic - Sometimes the AI finds different ways to complete the task so the test expectations are more difficult to write

### Context size and complexity

Qatium handles networks that sometimes weight 10s or even 100s MBs.
Qatium sometimes handles large networks (tens or even hundreds of MB).

If we provided the whole network to the AI for a full understanding of it, it would be extremely expensive and we may even reach the maximum size of the LLM context.

Simpler questions don't need that much information but requests like "Increase the min pressure in this DMA with the less amount of losses" require knowledge about the network topology. Even having the whole network it's possible that current AIs are not capable of providing good results to these kind of questions without extra help.

This is a problem we are still working on.
Simpler questions don't need all that information but requests like "Increase the minimum pressure in this DMA with the least amount of losses" require knowledge about the whole network topology. Even with the whole network, it's possible that current AIs are not capable of providing good results to these kind of questions without extra help. This is a problem we are still working on.

0 comments on commit 45274ea

Please sign in to comment.