English-Thai Code-switched Machine Translation in Medical Domain

This repository contains the code, data, and evaluation scripts for our paper "On Creating an English-Thai Code-switched Machine Translation in Medical Domain".

Abstract

Machine translation (MT) in the medical domain plays a pivotal role in enhancing healthcare quality and disseminating medical knowledge. Despite advancements in English-Thai MT technology, common MT approaches often underperform in the medical field due to their inability to precisely translate medical terminologies. Our research prioritizes not merely improving translation accuracy but also maintaining medical terminology in English within the translated text through code-switched (CS) translation. We developed a method to produce CS medical translation data, fine-tuned a CS translation model with this data, and evaluated its performance against strong baselines, such as Google Neural Machine Translation (NMT) and GPT-3.5/GPT-4. Our model demonstrated competitive performance in automatic metrics and was highly favored in human preference evaluations. Our evaluation result also shows that medical professionals significantly prefer CS translations that maintain critical English terms accurately, even if it slightly compromises fluency. Our code and test set are publicly available https://github.com/preceptorai-org/NLLB_CS_EM_NLP2024.

Key Features

Code-Switched Translation: Our system specifically focuses on preserving English medical terminology within the translated Thai text, catering to the preferences of medical professionals.
Novel Data Generation: We introduce a unique masking-based approach to generate pseudo-CS medical translation data, addressing the lack of readily available resources.
Rigorous Evaluation: We conduct comprehensive evaluations using both automatic metrics (BLEU, chrF, METEOR, CER, WER, COMET, CS boundary F1) and human evaluations by medical professionals.
Open-Source Resources: We publicly release our code, test set, and evaluation scripts to facilitate further research in this critical domain.

Repository Structure

inference/ : Contains scripts for:
- Generating translations using various LLM-based translators.
- Implementing the Pseudo-translation Masking technique (Section 3.2 of the paper).
data_preprocess/:
- clean_human.py: Cleans and preprocesses the human-annotated dataset (Section 3.3).
- augment.py: Augments the training data using back-translation (Section 3.4).
- calculate_comet.py, filter_comet.ipynb: Filter generated translations based on COMET score (Section 3.4).
finetune/: Includes scripts for fine-tuning the NLLB model on the generated CS data (Section 4.1).

data/: Contains test set data used for evaluation

eval/: Contains scripts for evaluating translation models using various metrics (Section 4.2.1).
glicko/: Contains scripts for analyzing Glicko rating data from human evaluations (Section 4.2.2).

Requirements

We performed our experiments on Google Colaboratory with additionals dependencies listed below:

Python 3.8+
Libraries:
- requests
- httpx
- tqdm
- pandas
- seaborn
- matplotlib
- unbabel-comet
- nltk
- jiwer
- pythainlp

Released Weights

Citation

If you use this code or data in your research, please cite our paper:

@inproceedings{pengpun-etal-2024-on,
      title={On Creating an English-Thai Code-switched Machine Translation in Medical Domain}, 
      author={Parinthapat Pengpun and Krittamate Tiankanon and Amrest Chinkamol and Jiramet Kinchagawat and Pitchaya Chairuengjitjaras and Pasit Supholkhan and Pubordee Aussavavirojekul and Chiraphat Boonnag and Kanyakorn Veerakanjana and Hirunkul Phimsiri and Boonthicha Sae-jia and Nattawach Sataudom and Piyalitt Ittichaiwong and Peerat Limkonchotiwat},
      year={2024},
      eprint={2410.16221},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.16221}, 
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

English-Thai Code-switched Machine Translation in Medical Domain

Abstract

Key Features

Repository Structure

Requirements

Released Weights

Citation

License

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
data_preprocess		data_preprocess
docs		docs
eval		eval
finetune		finetune
glicko		glicko
inference		inference
LICENSE		LICENSE
README.md		README.md

License

preceptorai-org/NLLB_CS_EM_NLP2024

Folders and files

Latest commit

History

Repository files navigation

English-Thai Code-switched Machine Translation in Medical Domain

Abstract

Key Features

Repository Structure

Requirements

Released Weights

Citation

License

About

Resources

License

Stars

Watchers

Forks

Languages