Amanuensis was awarded the BIG IDEA: Patient Safety Technology Prize at TreeHacks 2023, sponsored by the Pittsburgh Regional Health Initiative and the Patient Safety Technology Challenge. Learn more at www.patientsafetytech.com. We're grateful to the sponsors and judges for their consideration and recognition, and to the TreeHacks organizing team at Stanford University.
We're in search of bold new thinking. This is an invitation to solve the problem of medical error that harms millions of U.S. patients, kills approximately 250,000 patients, and costs billions of dollars every year. Weβre calling on TreeHacks teams to envision the best technology-enabled patient safety solution that has the potential to avert patient harm and save lives and will be awarding $2,000 to the top team. Your hack must align with one of the following five leading patient safety challenges facing health care across the continuum of care: Medication errors, procedural/surgical errors, errors during routine patient care (e.g. pressure ulcers, blood clots, falls), infections and diagnostic safety. Learn more about the problem and get access to resources to help your hack here.
AI-enabled physician assistant for automated clinical summarization and question generation. Empowering physicians to achieve accurate diagnoses and effective treatments. Project for TreeHacks 2023 at Stanford University.
The modern electronic health record (EHR) encompasses a treasure trove of information across patient demographics, medical history, clinical data, and other health system interactions (Jensen et al.). Although the EHR represents a valuable resource to track clinical care and retrospectively evaluate clinical decision-making, the data deluge of the EHR often obfuscates key pieces of information necessary for the physician to make an accurate diagnosis and devise an effective treatment plan (Noori and Magdamo et al.). Physicians may struggle to rapidly synthesize the lengthy medical histories of their patients; in the absence of data-driven strategies to extract relevant insights from the EHR, they are often forced to rely on intuition alone to generate patient questions. Further, the EHR search interface is rarely optimized for the physician search workflow, and manual search can be both time-consuming and error-prone.
The volume and complexity of the EHR can lead to missed opportunities for physicians to gather critical information pertinent to patient health, leading to medical errors or poor health outcomes. It is imperative to design tools and services to reduce the burden of manual EHR search on physicians and help them elicit the most relevant information from their patients.
Amanuensis is an AI-enabled physician assistant for automated clinical summarization and question generation. By arming physicians with relevant insights collected from the EHR as well as with patient responses to NLP-generated questions, we empower physicians to achieve more accurate diagnoses and effective treatment plans. The Amanuensis pipeline is as follows:
-
Clinical Summarization: Through our web application, physicians can access medical records of each of their patients, where they are first presented with a clinical summary: a concise, high-level overview of the patient's medical history, including key information such as diagnoses, medications, and allergies. This clinical summary is automatically generated by Amanuensis using Generative Pre-Trained Transformer 3 (GPT-3), an autoregressive language model with a 2048-token-long context and 175 billion parameters. The clinical summary may be reviewed by the physician to ensure that the summary is accurate and relevant to the patient's health.
-
Question Generation: Next, Amanuensis uses GPT-3 to automatically generate a list of questions that the physician can ask their patient to elicit more information and identify relevant information in the EHR that the physician may not have considered. The NLP-generated questions are automatically sent to the patient prior to their appointment (e.g., once the appointment is scheduled); then, the physician can review the patient's responses and use them to inform their clinical decision-making during the subsequent encounter. Importantly, we have tested Amanuensis on a large cohort of high-quality simulated EHRs generated by SyntheaTM.
By guiding doctors to elicit the most relevant information from their patients, Amanuensis can help physicians improve patient outcomes and reduce the incidences of all five types of medical errors: medication errors, patient care complications, procedure/surgery complications, infections, and diagnostic/treatment errors.
To both construct and validate Amanuensis, we used the SyntheaTM library to generate synthetic patients and associated EHRs (Walonoski et al.). SyntheaTM is an open-source software package that simulates the lifespans of synthetic patients using realistic models of disease progression and corresponding standards of care. These models rely on a diverse set of real-world data sources, including the United States Census Bureau demographics, Centers for Disease Control and Prevention (CDC) prevalence and incidence rates, and National Institutes of Health (NIH) reports. The SyntheaTM package was developed by an international research collaboration involving the MITRE Corporation and the HIKER Group, and is in turn based on the Publicly Available Data Approach to the Realistic Synthetic EHR framework (Dube and Gallagher). We customized the SyntheaTM synthetic data generation workflow to produce the following 18 data tables (see also the SyntheaTM data dictionary):
Table | Description |
---|---|
Allergies |
Patient allergy data. |
CarePlans |
Patient care plan data, including goals. |
Claims |
Patient claim data. |
ClaimsTransactions |
Transactions per line item per claim. |
Conditions |
Patient conditions or diagnoses. |
Devices |
Patient-affixed permanent and semi-permanent devices. |
Encounters |
Patient encounter data. |
ImagingStudies |
Patient imaging metadata. |
Immunizations |
Patient immunization data. |
Medications |
Patient medication data. |
Observations |
Patient observations including vital signs and lab reports. |
Organizations |
Provider organizations including hospitals. |
Patients |
Patient demographic data. |
PayerTransitions |
Payer transition data (i.e., changes in health insurance). |
Payers |
Payer organization data. |
Procedures |
Patient procedure data including surgeries. |
Providers |
Clinicians that provide patient care. |
Supplies |
Supplies used in the provision of care. |
To simulate an EHR system, we pre-processed all synthetic data (see code/construct_database.Rmd
) and standardized all fields. Next, we constructed a PostgreSQL database and keyed relevant tables together using primary and foreign keys constructed by hand. In total, our database contains 199,717 records from 20 patients across 262 different fields. However, it is important to note that our data generation pipeline is scalable to tens of thousands of patients (and we have tested this synthetic data generation capacity).
Finally, we coupled the PostgreSQL database with the RedwoodJS full stack web development framework to build a web application that allows:
- Physicians: Physicians to access the clinical summaries and questions generated by Amanuensis for each of their patients.
- Patients: Patients to access the questions generated by Amanuensis and respond to them via a web form.
To generate both clinical summaries and questions for each patient, we used the OpenAI GPT-3 API. In both cases, GPT-3 was prompted with a subset of the EHR record for a given patient inserted into a prompt template for GPT-readability. Other key features of our web application include:
- Authentication: Users can log in with their email addresses; physicians are automatically redirected to their dashboard upon login, while patients are redirected to a page where they can respond to the questions generated by Amanuensis.
- EHR Access: Physicians can also access the full synthetic EHR for each patient as well as view autogenerated graphs and data visualizations, which they can use to review the accuracy of the clinical summaries and questions generated by Amanuensis.
- Patient Response Collection: Prior to an appointment, Amanuensis will automatically collect the patient's responses to the NLP-generated questions and send them to the physician. During an appointment, physicians will be informed by these responses which will facilitate better clinical decision-making.
In the future, we hope to integrate Amanuensis into existing EHR systems (e.g., Epic, Cerner, etc.), providing physicians with a seamless, AI-powered assistant to help them make more informed clinical decisions. We also plan to enrich our NLP pipeline with real patient data rather than synthetic EHR records. In concert with gold-standard annotations generated by physicians, we intend to fine-tune our question generation and clinical summarization models on real-world data to improve the sophistication and fidelity of the generated text and enable more robust clinical reasoning capabilities.
First, create a new Anaconda environment. For example:
conda create --name amanuensis python=3.10
Then, install:
Generate synthetic patient data with SyntheaTM.
SyntheaTM requires Java 11 or newer. First, install the Java Development Kit (JDK).
Next, clone the SyntheaTM repo, then build and run the test suite:
git clone https://github.com/synthetichealth/synthea.git
cd synthea
./gradlew build check test
In the synthea
directory, modify ./src/main/resources/synthea.properties
. Set exporter.csv.export
and generate.only_alive_patients = true
. Output will then be generated in ./src/output/csv
.
Again in the synthea
directory, use the following command to generate the desired number of patients. The parameters are as follows:
-p
: number of patients.-s
: random seed.-a
: patient age range.
./run_synthea -p 20 -s 42 -a 0-100
As specified in the SyntheaTM wiki, the CSV exporter will generate files according to the CSV file data dictionary, which is specified here. Copy the generated files from synthea/src/output/csv
to amanuensis/patient_data
.
Next, construct the PostgreSQL database using code/construct_database.Rmd
. Run the following command in Terminal to install PostgreSQL.
brew install postgres
Check the version of PostgreSQL as follows.
psql --version
To start PostgreSQL, run the following command.
brew services start postgresql@14
To stop PostgreSQL, run the following command.
brew services stop postgresql@14
Open the psql
interactive terminal, which is designed to work with the PostgreSQL database.
psql postgres
Create a new database called amanuensis
. List all users and databases.
CREATE DATABASE amanuensis;
\du
\l
Next, run the code in code/construct_database.Rmd
to write the synthetic patient data (originally in CSV format) to the newly-created PostgreSQL database. Note that the keys must be specified according to the CSV file data dictionary.
After constructing the PostgreSQL database, in the psql
terminal, connect to the amanuensis
database. List all tables in the database.
\c amanuensis;
\d
To remove all tables from the database, the following SQL command can be used.
DROP SCHEMA public CASCADE;
CREATE SCHEMA public;
GRANT ALL ON SCHEMA public TO public;
The database can be dumped for transfer by running the following in the Terminal.
pg_dump --dbname amanuensis > ./patient_data/db_dump/db_dump.sql
For text generation, we can also use the BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining model by Luo et al., 2022. Thus, the dependencies mirror those of microsoft/BioGPT commit f186d88. These include:
- PyTorch version == 1.13.1.
- transformers, which provides APIs and tools to easily download and train state-of-the-art pretrained models.
Alternatively, BioGPT can be used without Hugging Face π€ by installing the below dependencies. See the BioGPT installation instructions for more information.
fairseq
, tested with version == 0.12.0. Install as follows:
git clone https://github.com/pytorch/fairseq
cd fairseq
git checkout v0.12.0
pip install .
python setup.py build_ext --inplace
cd ..
- Moses, install as follows:
git clone https://github.com/moses-smt/mosesdecoder.git
export MOSES=${PWD}/mosesdecoder
- fastBPE, install as follows:
git clone https://github.com/glample/fastBPE.git
export FASTBPE=${PWD}/fastBPE
cd fastBPE
g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast
sacremoses
, install as follows:
pip install sacremoses
sklearn
, install as follows:
pip install scikit-learn
Remember to set the environment variables MOSES
and FASTBPE
to the path of Moses and fastBPE respectively, as they will be required later. According to the conda documentation, Inside of a conda environment, environment variables can be viewed as follows:
conda env config vars list
conda env config vars set MOSES=${PWD}/mosesdecoder
conda env config vars set FASTBPE=${PWD}/fastBPE
conda activate amanuensis
Then, run the following.
import torch
from fairseq.models.transformer_lm import TransformerLanguageModel
m = TransformerLanguageModel.from_pretrained(
"checkpoints/Pre-trained-BioGPT",
"checkpoint.pt",
"data",
tokenizer='moses',
bpe='fastbpe',
bpe_codes="data/bpecodes",
min_len=100,
max_len_b=1024)
# m.cuda()
src_tokens = m.encode("The patient presents with a history of fever and abdominal cramps for the last 24 hours.")
generate = m.generate([src_tokens], beam=5)[0]
output = m.decode(generate[0]["tokens"])
print(output)
This project was completed during the TreeHacks 2023 hackathon at Stanford University.
-
Noori, A. et al. Development and Evaluation of a Natural Language Processing Annotation Tool to Facilitate Phenotyping of Cognitive Status in Electronic Health Records: Diagnostic Study. Journal of Medical Internet Research 24, e40384 (2022).
-
Jensen, P. B., Jensen, L. J. & Brunak, S. Mining electronic health records: towards better research applications and clinical care. Nat Rev Genet 13, 395β405 (2012).
-
Walonoski, J. et al. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association 25, 230β238 (2018).
-
Dube, K. & Gallagher, T. Approach and Method for Generating Realistic Synthetic Electronic Healthcare Records for Secondary Use. in Foundations of Health Information Engineering and Systems (eds. Gibbons, J. & MacCaull, W.) 69β86 (Springer, 2014). doi:10.1007/978-3-642-53956-5_6.