From 76c81c49ee0866d911032447061fd3345fca54a3 Mon Sep 17 00:00:00 2001 From: Alex Milowski Date: Mon, 2 Dec 2024 16:52:33 -0800 Subject: [PATCH] Initial version --- README.md | 47 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..87314e9 --- /dev/null +++ b/README.md @@ -0,0 +1,47 @@ +# MLCommons Knowledge Graph Datasets + + +## Datasets + +This section contains a non-exhaustive list of datasets that should be +useful for evaluating *Relation Extraction (RE)* tasks with LLM models. The loose +criteria used was to find datasets where "document level" relations could +be extracted across the corpus as these should match how relation extraction +is currently being used for knowledge graph construction from various chunked +texts. + +### Datasets annotated for RE + +Datasets associated with Relation Extraction tasks and evaluation (in no particular order): + +* πŸ“Š [DocRED](https://github.com/thunlp/DocRED) - πŸ““ *"[DocRED: A Large-Scale Document-Level Relation Extraction Dataset](https://paperswithcode.com/paper/docred-a-large-scale-document-level-relation)"*, Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zheng-Hao Liu, Zhiyuan Liu, Lixin Huang, Jie zhou, Maosong Sun

DocRED is a relation extraction dataset constructed from Wikipedia and Wikidata. Each document in the dataset is human-annotated with named entity mentions, coreference information, intra- and inter-sentence relations, and supporting evidence. DocRED requires reading multiple sentences in a document to extract entities and infer their relations by synthesizing all information of the document. Along with the human-annotated data, the dataset provides large-scale distantly supervised data.

See also: [https://github.com/thunlp/DocRED](https://github.com/thunlp/DocRED)

+* πŸ“Š [TACRED](https://nlp.stanford.edu/projects/tacred/) - πŸ““ *"[Position-aware Attention and Supervised Data Improve Slot Filling](https://paperswithcode.com/paper/position-aware-attention-and-supervised-data)"*, Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, Christopher D. Manning

Organized relational knowledge in the form of "knowledge graphs" is important for many applications. However, the ability to populate knowledge bases with facts automatically extracted from documents has improved frustratingly slowly. This paper simultaneously addresses two issues that have held back prior work. We first propose an effective new model, which combines an LSTM sequence model with a form of entity position-aware attention that is better suited to relation extraction. Then we build TACRED, a large (119,474 examples) supervised relation extraction dataset obtained via crowdsourcing and targeted towards TAC KBP relations. The combination of better supervised data and a more appropriate high-capacity model enables much better relation extraction performance. When the model trained on this new dataset replaces the previous relation extraction component of the best TAC KBP 2015 slot filling system, its F1 score increases markedly from 22.2% to 26.7%.

+* πŸ“Š [ACE 2005 Multilingual Training Corpus](https://catalog.ldc.upenn.edu/LDC2006T06)

ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.

+* πŸ“Š [Adverse Drug Events (ADE) Corpus](https://huggingface.co/datasets/ade-benchmark-corpus/ade_corpus_v2) - πŸ““ *"[Development of a benchmark corpus to support the automatic extraction of drug-related adverse effects from medical case reports](https://pubmed.ncbi.nlm.nih.gov/22554702/)"*, Harsha Gurulingappa 1 , Abdul Mateen Rajput, Angus Roberts, Juliane Fluck, Martin Hofmann-Apitius, Luca Toldo

A significant amount of information about drug-related safety issues such as adverse effects are published in medical case reports that can only be explored by human readers due to their unstructured nature. The work presented here aims at generating a systematically annotated corpus that can support the development and validation of methods for the automatic extraction of drug-related adverse effects from medical case reports. The documents are systematically double annotated in various rounds to ensure consistent annotations. The annotated documents are finally harmonized to generate representative consensus annotations. In order to demonstrate an example use case scenario, the corpus was employed to train and validate models for the classification of informative against the non-informative sentences. A Maximum Entropy classifier trained with simple features and evaluated by 10-fold cross-validation resulted in the F₁ score of 0.70 indicating a potential useful application of the corpus.

+* πŸ“Š [WebNLG](https://synalp.gitlabpages.inria.fr/webnlg-challenge/) - πŸ““ *"[Creating Training Corpora for NLG Micro-Planners](https://paperswithcode.com/paper/creating-training-corpora-for-nlg-micro)"*, Claire Gardent, Anastasia Shimorina, Shashi Narayan, Laura Perez-Beltrachini

The WebNLG corpus comprises of sets of triplets describing facts (entities and relations between them) and the corresponding facts in form of natural language text. The corpus contains sets with up to 7 triplets each along with one or more reference texts for each set. The test set is split into two parts: seen, containing inputs created for entities and relations belonging to DBpedia categories that were seen in the training data, and unseen, containing inputs extracted for entities and relations belonging to 5 unseen categories. + + Initially, the dataset was used for the WebNLG natural language generation challenge which consists of mapping the sets of triplets to text, including referring expression generation, aggregation, lexicalization, surface realization, and sentence segmentation. The corpus is also used for a reverse task of triplets extraction.

+* πŸ“Š [Plant Science Knowledge Graph Corpus - PICKLE](https://zenodo.org/records/10076664) - πŸ““ *"[Plant Science Knowledge Graph Corpus: a gold standard entity and relation corpus for the molecular plant sciences](https://academic.oup.com/insilicoplants/article/6/1/diad021/7413143#434923034)"*, Serena Lotreck, Kenia Segura AbΓ‘, Melissa D Lehti-Shiu, Abigail Seeger, Brianna N I Brown, Thilanka Ranaweera, Ally Schumacher, Mohammad Ghassemi, Shin-Han Shiu

Natural language processing (NLP) techniques can enhance our ability to interpret plant science literature. Many state-of-the-art algorithms for NLP tasks require high-quality labelled data in the target domain, in which entities like genes and proteins, as well as the relationships between entities, are labelled according to a set of annotation guidelines. While there exist such datasets for other domains, these resources need development in the plant sciences. Here, we present the Plant ScIenCe KnowLedgE Graph (PICKLE) corpus, a collection of 250 plant science abstracts annotated with entities and relations, along with its annotation guidelines. The annotation guidelines were refined by iterative rounds of overlapping annotations, in which inter-annotator agreement was leveraged to improve the guidelines. To demonstrate PICKLE’s utility, we evaluated the performance of pretrained models from other domains and trained a new, PICKLE-based model for entity and relation extraction (RE). The PICKLE-trained models exhibit the second-highest in-domain entity performance of all models evaluated, as well as a RE performance that is on par with other models. Additionally, we found that computer science-domain models outperformed models trained on a biomedical corpus (GENIA) in entity extraction, which was unexpected given the intuition that biomedical literature is more similar to PICKLE than computer science. Upon further exploration, we established that the inclusion of new types on which the models were not trained substantially impacts performance. The PICKLE corpus is, therefore, an important contribution to training resources for entity and RE in the plant sciences.

+* πŸ“Š [Few-Shot Relation Classification Dataset (FewRel)](https://thunlp.github.io/fewrel) - πŸ““ *"[FewRel: A Large-Scale Supervised Few-Shot Relation Classification Dataset with State-of-the-Art Evaluation](https://paperswithcode.com/paper/fewrel-a-large-scale-supervised-few-shot)"*, Xu Han, Hao Zhu, Pengfei Yu, ZiYun Wang, Yuan YAO, Zhiyuan Liu, Maosong Sun

The FewRel (Few-Shot Relation Classification Dataset) contains 100 relations and 70,000 instances from Wikipedia. The dataset is divided into three subsets: training set (64 relations), validation set (16 relations) and test set (20 relations).

+* πŸ“Š [SciERC](http://nlp.cs.washington.edu/sciIE/) - πŸ““ *"[Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction](https://paperswithcode.com/paper/multi-task-identification-of-entities)"*, Yi Luan, Luheng He, Mari Ostendorf, Hannaneh Hajishirzi

SciERC dataset is a collection of 500 scientific abstract annotated with scientific entities, their relations, and coreference clusters. The abstracts are taken from 12 AI conference/workshop proceedings in four AI communities, from the Semantic Scholar Corpus. SciERC extends previous datasets in scientific articles SemEval 2017 Task 10 and SemEval 2018 Task 7 by extending entity types, relation types, relation coverage, and adding cross-sentence relations using coreference links.

+ +## Dataset possibilities + +* πŸ“Š [Medical Abbreviation Disambiguation - MeDAL](https://github.com/McGill-NLP/medal) - πŸ““ *"[MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining](https://arxiv.org/abs/2012.13978)"*, Zhi Wen, Xing Han Lu, Siva Reddy

One of the biggest challenges that prohibit the use of many current NLP methods in clinical settings is the availability of public datasets. In this work, we present MeDAL, a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. We pre-trained several models of common architectures on this dataset and empirically showed that such pre-training leads to improved performance and convergence speed when fine-tuning on downstream medical tasks.

+ + + +## Papers + +[[1]](#medgraphrag) *"Medical Graph RAG: Towards Safe Medical Large Language Model via Graph Retrieval-Augmented Generation"*, Junde Wu, Jiayuan Zhu, Yunli Qi, Jingkun Chen, Min Xu, Filippo Menolascina, Vicente Grau; [https://doi.org/10.48550/arXiv.2408.04187](https://doi.org/10.48550/arXiv.2408.04187)

We introduce a novel graph-based Retrieval-Augmented Generation (RAG) framework specifically designed for the medical domain, called **MedGraphRAG**, aimed at enhancing Large Language Model (LLM) capabilities for generating evidence-based medical responses, thereby improving safety and reliability when handling private medical data. Graph-based RAG (GraphRAG) leverages LLMs to organize RAG data into graphs, showing strong potential for gaining holistic insights from long-form documents. However, its standard implementation is overly complex for general use and lacks the ability to generate evidence-based responses, limiting its effectiveness in the medical field. To extend the capabilities of GraphRAG to the medical domain, we propose unique Triple Graph Construction and U-Retrieval techniques over it. In our graph construction, we create a triple-linked structure that connects user documents to credible medical sources and controlled vocabularies. In the retrieval process, we propose U-Retrieval which combines Top-down Precise Retrieval with Bottom-up Response Refinement to balance global context awareness with precise indexing. These effort enable both source information retrieval and comprehensive response generation. Our approach is validated on 9 medical Q\&A benchmarks, 2 health fact-checking benchmarks, and one collected dataset testing long-form generation. The results show that MedGraphRAG consistently outperforms state-of-the-art models across all benchmarks, while also ensuring that responses include credible source documentation and definitions. Our code is released at: this [https URL](https://github.com/MedicineToken/Medical-Graph-RAG).

+ +[[2]](#revisitingre) *"Revisiting Relation Extraction in the era of Large Language Models"*, Somin Wadhwa, Silvio Amir, Byron C. Wallace; [https://doi.org/10.48550/arXiv.2305.05003](https://doi.org/10.48550/arXiv.2305.05003)

Relation extraction (RE) is the core NLP task of inferring semantic relationships between entities from text. Standard supervised RE techniques entail training modules to tag tokens comprising entity spans and then predict the relationship between them. Recent work has instead treated the problem as a \emph{sequence-to-sequence} task, linearizing relations between entities as target strings to be generated conditioned on the input. Here we push the limits of this approach, using larger language models (GPT-3 and Flan-T5 large) than considered in prior work and evaluating their performance on standard RE tasks under varying levels of supervision. We address issues inherent to evaluating generative approaches to RE by doing human evaluations, in lieu of relying on exact matching. Under this refined evaluation, we find that: (1) Few-shot prompting with GPT-3 achieves near SOTA performance, i.e., roughly equivalent to existing fully supervised models; (2) Flan-T5 is not as capable in the few-shot setting, but supervising and fine-tuning it with Chain-of-Thought (CoT) style explanations (generated via GPT-3) yields SOTA results. We release this model as a new baseline for RE tasks.

+ +[[3]](#resurvey) *"A survey on Relation Extraction"*, Kartik Detroja, C.K.Bhensdadia, Brijesh S. Bhatt [https://doi.org/10.1016/j.iswa.2023.200244](https://doi.org/10.1016/j.iswa.2023.200244)

With the advent of the Internet, the daily production of digital text in the form of social media, emails, blogs, news items, books, research papers, and Q&A forums has increased significantly. This unstructured or semi-structured text contains a huge amount of information. Information Extraction (IE) can extract meaningful information from text sources and present it in a structured format. The sub-tasks of IE include Named Entity Recognition (NER), Event Extraction, Relation Extraction (RE), Sentiment Extraction, Opinion Extraction, Terminology Extraction, Reference Extraction, and so on.
+One way to represent information in the text is in the form of entities and relations representing links between entities. The Entity Extraction task identifies entities from the text, and the Relation Extraction (RE) task can identify relationships between those entities. Many NLP applications can benefit from relational information derived from natural language, including Structured Search, Knowledge Base (KB) population, Information Retrieval, Question-Answering, Language Understanding, Ontology Learning, etc. This survey covers (1) basic concepts of Relation Extraction; (2) various Relation Extraction methodologies; (3) Deep Learning techniques for Relation Extraction; and (4) different datasets that can be used to evaluate the RE system.

+ +[[4]](#science-llm-nerre) *"Structured information extraction from scientific text with large language models."* Dagdelen, J., Dunn, A., Lee, S. et al. Nature Communications 15, 1418 (2024). [https://doi.org/10.1038/s41467-024-45563-x](https://doi.org/10.1038/s41467-024-45563-x)

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.

+ +## Code + +* [Medical Graph RAG](https://github.com/MedicineToken/Medical-Graph-RAG) code from [[1]](#medgraphrag)