Skip to content

Code for the GGPOnc corpus - A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines

Notifications You must be signed in to change notification settings

hpi-dhc/ggponc_preprocessing

 
 

Repository files navigation

GGPONC 1.0 - Pre-Processing and Basic Analysis

This repository contains the code to reproduce the results in: Florian Borchert, Christina Lohr, Luise Modersohn, Thomas Langer, Markus Follmann, Jan Philipp Sachs, Udo Hahn, Matthieu-P. Schapranow. GGPONC: A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines. ArXiv:2007.06400 [Cs]. [arXiv] [Code on GitHub] [data-request@DKG] accepted at [LOUHI@EMNLP'20) https://arxiv.org/abs/2007.06400

Prerequisites

Requesting text data

UMLS Terminology data

Software requirements

Configuration after downloading of this repository

  • Configure the project as a Maven project
    • In Eclipse: right click on project => Configure => Convert to Maven Project
    • Command line: mvn compile

Processing the data

Conversion of GGPOnc corpus XML file to plain text and preprocessing

  • Run mvn compile before executing mvn exec:java -Dexec.mainClass="de.hpi.guidelines.reader.GGPOncXMLReader" -Dexec.args="<Path to cpg-corpus-cms.xml>" or run GGPOncXMLReader.java (in package de.hpi.guidelines.reader) in Eclipse (Run As => Java Application)
  • Wait a minute
  • Look into the directory /output

Create PubMed abstract text files

  • We download PubMed data at February 21 2020, if you download PubMed data by esearch commands, you will receive a larger text corpus than our export. The file src/main/resources/usedPubMedIds_20200221.txt contains a list with the used PubMed identifiers from February 21 2020.
  • If you want to create the described data set from PubMed, import your extracted XML file and run the src/main/extractPubMedCaseAbstracts.java. This code is able to filter our used PubMed text data from your new created download.

Processing dictionaries

Filtered dictionaries from UMLS by JuFiT

  • We worked with JuFit v1.1 - you can find the right jar file in this repository.
  • If you want to work with the real JuFit, follow the steps below:
  • Run the Java Code RequestJuFiT.java (package de.julielab.dictionaryhandling) or the Python script extended_script_dictionaries/request-jufit.sh

Gene Dictionary

  • We used a list of gene names compiled from Entrez Gene and UniProt with the approach originating from Wermter et al.
  • Code of JULIELab/gene-name-mapping
  • The integration of this code in the GGPOnc Repository is coming soon.

Connect Dictionaries

  • For the usage of JCoRe Pipelines you will need one large file global_dictionary.txt
  • Run the script extended_script_dictionaries/createDics.py to create on large dictionary (before run: adapt path names in the script file)
  • Or run the Java Code CreateLargeDictionary.java (package de.julielab.dictionaryhandling) (before run: adapt path names in the script file)

JCoRe Pipeline

  • Unpack the *.zip files in jcore-pipelines, there are 2 pipelines:
    • dectectUMLSentries
    • detectStopwords
  • Create the folder data/files in the pipeline directories and put the data to be analyzed in the directory data/files (subdirectories are not read, be carefully with *.tar files)
  • Put the global dictionary file into jcore-pipelines/detectUMLSentries/resources
  • Adapt filename of the dictionary and the stopword dictionary in the following files:
    • desc/GazetteerAnnotator Template Descriptor with Configurable External Resource.xml
    • descAll/GazetteerAnnotator Template Descriptor with Configurable External Resource.xml
  • Open a terminal and root into one of the pipeline directories
  • Start the pipeline with java -jar ../jcore-pipeline-runner-base-0.4.1-SNAPSHOT-cli-assembly.jar run.xml
  • Results
    • offsets.tsv
    • data/outData/output-xmi
  • This JCoRe pipeline is derived of the JULIE Lab own jcore-pipeline-modules (see also https://zenodo.org/record/4066619#.X3sPVS8Rp-U)

Evaluation of Annotations

  • To calculate the inter-annotator-agreement between human annotators follow the instructions of bratiaa
  • To calculate precision and recall between automatically created annotations and the human annotated data run:
    • pip install bratutils
    • python src/main/python/umls_evaluation.py <path to gold annotations> <path to automatic annotations>

About

Code for the GGPOnc corpus - A Corpus of German Medical Text with Rich Metadata Based on Clinical Practice Guidelines

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Java 88.0%
  • Python 9.9%
  • Shell 2.1%