CEAR: Creating a knowledge graph of chemical entities and roles in scientific literature

this file will be updated in the next days

How it works

The corresponding paper described the following steps, that are used to create the knowledge graph (KG):

1. Text Extraction

We have downloaded 8,000 chemistry research papers from ChemRxiv. The text extraction process is conducted in a separate NestJS project, which uses pdf2txt and creates a JSON file which includes:

the papers' metadata (downloaded from ChemRxiv)
the papers' full text
de-duplication information
the PDF file's pages with the full text

2. Chemical Entity and Chemical Role Recognition

The ner-chem-trainer notebook fine-tunes a Google Electra model on different datasets for NER:

The BC5CDR dataset consists of human annotations of chemicals, diseases and their interactions from 1,500 PubMed articles
The NLM-Chem contains 150 full-text articles on biomedical literature, carefully selected for containing chemical entities which are difficult to find for NER tools. Ten domain experts annotated the chemical entities in three annotation rounds.
CRAFT contains 97 full-text open access articles from the PubMed Central Open Access subset. It identifies all mentions of nearly all concepts from nine prominent biomedical ontologies, including ChEBI

Both NLM-Chem and BC5CDR lack annotations for chemical roles (for example: solvent, catalyst, drug). We annotate them using a lexical approach for all chemical roles defined in the ChEBI ontology. This is accomplished in the corresponding loader classes BC5CDRLoader and NLMChemLoader.

3. Link Validation

The llama2-role-validator notebook uses a Llama-2-7b-chat-hf to check for all co-occurences of chemical entities and roles in a sentence, whether the chemical entity has the mentioned chemical role.

4. Knowledge Graph Creation

The kg_data_construction notebook links all confirmed pairs of chemical entities and roles to ChEBI. After grouping and counting these pairs, a hyperparameter minRef is applied to filter pairs of chemical entities and relations based on their frequency in the literature set. The knowledge graph consists of the described relations. It is stored using the Terse RDF Triple Language (Turtle). Each contained chemical entity (obo:CHEBI_24431) and role (obo:CHEBI_50906) is defined by its ChEBI identifier. Chemical entities or roles that are unknown to ChEBI are defined using the @prefix cear:<https://wwwiti.cs.uni-magdeburg.de/iti_dke/cear/> . namespace. The obo:RO_0000087 is used in ChEBI to define roles of chemical entities. The following listing shows an example for two chemical entities, ethylene glycol bis(2-aminoethyl)tetraacetate and PBS, both of which have the chemical role buffer:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix obo: <http://purl.obolibrary.org/obo/> .
@prefix cear: <https://wwwiti.cs.uni-magdeburg.de/iti_dke/cear/> .
obo:CHEBI_35225 rdf:type obo:CHEBI_50906 .
obo:CHEBI_35225 rdfs:label "buffer" .
obo:CHEBI_30741 rdf:type obo:CHEBI_24431 .
obo:CHEBI_30741 rdfs:label "ethylene glycol bis(2-aminoethyl)tetraacetate" .
obo:CHEBI_30741 obo:RO_0000087 obo:CHEBI_35225 .
cear:chem_4023 rdf:type obo:CHEBI_24431 .
cear:chem_4023 rdfs:label "PBS" .
cear:chem_4023 obo:RO_0000087 obo:CHEBI_35225 .

Depending on the minRef hyperparameter specification, different KGs are created. They can all be accessed at https://wwwiti.cs.uni-magdeburg.de/iti_dke/cear/

Additionally we create a nodes.json and an edges.json JSON file, which contain chemical entities and roles as nodes and the :hasRole relationship between them as edges. These files are used in a separate VueJS project using v-network-graph to visualize the KG. The following image shows the KG when created using a very high minRef of 50:

Setup

This section will contain detailed information of how to setup this project

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
loaders		loaders
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cear-inferer.ipynb		cear-inferer.ipynb
entity-validator.ipynb		entity-validator.ipynb
kg_data_construction.ipynb		kg_data_construction.ipynb
llama2-role-validator.ipynb		llama2-role-validator.ipynb
ner-chem-trainer.ipynb		ner-chem-trainer.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CEAR: Creating a knowledge graph of chemical entities and roles in scientific literature

How it works

1. Text Extraction

2. Chemical Entity and Chemical Role Recognition

3. Link Validation

4. Knowledge Graph Creation

Setup

About

Releases

Packages

Languages

License

stlanger/cear

Folders and files

Latest commit

History

Repository files navigation

CEAR: Creating a knowledge graph of chemical entities and roles in scientific literature

How it works

1. Text Extraction

2. Chemical Entity and Chemical Role Recognition

3. Link Validation

4. Knowledge Graph Creation

Setup

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages