Project presented in Mistral Fine Tuning Hackathon 2024.
Pakobot is an assistant that allows Spanish clinicians to obtain and refine an International Code for Diseases based in a decided diagnosis. Currently the official apps only allow string-based search. Using a chatbot interface, clinicians can nail down the desired code.
Pakobot doesn't provide the diagnosis; it just helps to find the correct code. Also, this is a highly experimental project. Do not use in profesional environments and confirm always the code in ICD page.
- Marta Josa Bordel: Synthetic data generation and communications.
- Carlos Vivar Rios: Concept, sythetic data generation, and fine-tuning.
Pakobot doesn't require any personal information from the patients. It just require the diagnosis terms that the clinician has identified. Therefore, no sensitive data is transmitted to the bot.
In simple terms, we want to check how far can we go just with fine tuning in terms of hallucinations. Also, the ultimate goal of pakobot is to be compiled into ONNX and be able to run serveless in the browser using transfromers.js
.
However, in this project we would like to explore the use of a RAG and perform automatic evaluation of the answers provided.
You can go to pakobot.streamlit.app and try the ft:open-mistral-7b:7ccb0f03:20240630:257f6a77
model if you have a Mistral API Key. Take in account that the project is not properly evaluated and most answer are wrong.
We recommend to use Docker. You can build the image with:
docker build -t pakobot .
Fill the .env.dist
with the right keys. And then run the gui with:
docker run -it -p 8501:8501 --env-file .env -t pakobot
This should make the tool availble in http://localhost:8501
graph LR
A[Data Extraction]
G[1st Round: Generate Questions]
H[2nd Round: Generate Answers]
J[Fine Tuning]
N[Streamlit App]
A --> G --> H --> J --> N
We used te ICD official API in order to retrieve the linearized terms. We obtained a dataframe per main category containing the following fields:
id
api-url
browser-url
title
synonyms
icd-code
description
parents
parents-title
parents-code
childrens
childrens-title
childrens-code
Sample for the data is provided in ./data/cie11
for the category 1435254666: "01 Algunas enfermedades infecciosas o parasitarias".
ICD-11 is licensed under the Creative Commons Attribution-NoDerivs 3.0 IGO license (CC BY-ND 3.0 IGO).
In order to generate the training data we performed two rounds of data synthesis using Nemotron 4 340B Instruct Model by Nvidia. The first round provided 20 questions per row or item in the ICD ontology. Then, a second round took each question and asked the model to provide a feasible answer based in the ground truth context. The result was stored in a jsonl
file and named as {ICD-ID}_{DF_Index}.jsonl
.
Nemotron-4 license allows the reuse of the output for training third-party models.
A sample of the data can be found in ./data
, while the scripts to generate the synthetica data are located in ./synthetic_data
.
After generating the synthetic data, we proceded to fine-tune the model using the MISTRAL API with the following parameters:
created_job = client.jobs.create(
model="open-mistral-7b",
training_files=[training_data.id],
hyperparameters=TrainingParameters(
training_steps=100,
learning_rate=0.0001,
)
)
The code for this can be found in ./fine-tuning
.
In order to demo the app we provide a chat interface using the library streamlit python library.
The evaluation was performed with a doctor but it requires further validation. One method we are exploring is to test the results against a RAG system where the ICD-DB has been stored.
This is a project under heavy development. Check out the roadmap and feel free to open an issue if you would like to contribute.
- Increase the number of conversations.
- Improve the prompts.
- English and French versions.
- Provide compatibility with ICD-9 and ICD-10 (Still in use in some areas such as Andalusia, Spain).
- Make use of
nemotron-4-340b-reward
to filter out the best conversations. - Implement an evaluation of the model performance based in RAG.
- Compile the model in ONNX and attempt its usage in the browser via
transformers.js