SemClinBr – a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks - Oliveira et al - 2020
📌 Paper
We developed a semantically annotated corpus using clinical texts from multiple medical specialties, document types, and institutions. The result of this work is the SemClinBr which has 1,000 clinical notes, labeled with 65,117 entities and 11,263 relations.
- A survey described the essential aspects and lessons learned regarding biomedical semantic annotation
- A fine-grained and replicable annotation schema was defined
- An optimized web-based annotation tool supported the annotation process
- The intrinsic and extrinsic evaluation of the corpus corroborate its application in different clinical NLP tasks
Aiming to structure a background to support the biomedical NLP field for pt-br language and address the gaps of broad scope/in-depth clinical corpora outside English scope, we developed a semantically annotated corpus to assist clinical NLP tasks, both in its evaluation and in its development.
To realize cohesive, reliable, unbiased, and fast annotations, most studies share the following common steps:
- double annotation → to reduce bias and improve reliability
- guidelines/scheme definition → to improve reliability and support annotators
- annotation agreement measures → to ensure reliability
- use of an annotation tool → to ease/speed up the annotation work
- annotation characterization (e.g., semantic type, relations) based on the desired task → for better scope definition
Our data are obtained from two different data sources:
- a corpus of 2,094,929 entries from a group of hospitals in Brazil between 2013 and 2018
- a corpus of 5,617 entries from a University Hospital between 2002 and 2007
The data configuration has structured data (i.e., gender, birth date, begin data, end date) and just one free-text data field
We selected almost 500 clinical notes from cardiology and nephrology specialties. Owing to the lack of corpora for pt-br, we randomly selected documents from other medical areas to complete 1,000 clinical narratives.
We opted to use the UMLS semantic types (STY) as our annotation tags (e.g., "Body Location or Region", "Sign or Symptom", "Clinical Attribute"). We add two more types in our tagset, the "Negation" and "Abbreviation" tags.
Sometimes, when we want to extract semantic meaning from clinical text, the semantic value of a concept alone is not enough to infer important events and situations. Hence, we incorporated the annotation of relations between clinical concepts to the guidelines. The relation annotation schema was partially derived from the UMLS Relationship Hierarchy. The RTYs included only the "associated_with" and "negation_of" RTYs.
We decided to build our own tool, which is reported on Oliveira et al. (2017).
Our corpus comprehended 100 UMLS semantic types representing the entities, 2 extra semantic types typifying Abbreviations and Negations, and 2 relationship types defining the relations between clinical entities.
The annotation process was 100% double-annotated and adjudicated, and lasted 14 months, resulting in a corpus composed of 1,000 documents (148,033 tokens), with 65,129 entities and 11,263 relations labeled.
The functionality of an annotated corpus can be tested by applying it in a downstream NLP task.
To train and assess Dalloux et al. (2020) method in the pt-br clinical scope, they used a segment of our corpus with the negation-related annotations. This includes not only the negation cue labeled with the "Negation" STY, but the concepts related to it using the relation "Negation_of" so that detecting the negation scope would be possible.
Souza et al. (2019) describe their preliminary work with promosing results on exploring CRF algorithms to perform NER in clinical pt-br texts.