-
Notifications
You must be signed in to change notification settings - Fork 3
Creating and testing the terpene synthase dictionaries for C. sinensis and S. lycopersicum : Anam Anjum
- Affiliation: J.D.B. Govt. Girls College, University of Kota, Kota, Rajasthan, India.
- Funding by: DST KARYA, Rajasthan.
The present research work entitled “Creating and Testing the Terpene synthase dictionaries” was conducted remotely at CEVOpen, an online platform at the interface of National Institute of Plant Genome Research (NIPGR), India and the University of Cambridge, U.K. with the following objectives:-
- To get the possible insights from the open scientific literature.
- To establish a terpene synthase dictionary that can be used to search for annotated scientific literature.
- To formulate the corpus of Camellia sinensis and Solanum lycopersicum by utilizing the pygetpaper toolkit.
CEVOpen is a global network initiative led by a collaborative group of young scientists. This young scientific community is a group of individuals from diverse fields such as biosciences, statistics, mathematics, computational biology, computer science, etc., and gives their valuable contribution. CEVOpen's work has been presented at various places, including Wikicite, COAR, and the Flash Forward Workshop, under the direction of Prof. Dr. Peter Murray-Rust.
In this project, we aim to work on open scientific literature that is available for terpene synthase enzymes. Plant metabolic pathways synthesize various volatile organic compounds (VOCs) which help in pollination, plant defense (herbivore attack) and mediating abiotic stress responses. Among VOCs, terpenes account for a larger proportion. These terpenes are mainly synthesized by methylerythritol phosphate (MEP) and mevalonic acid (MVA) pathways. Terpene synthase (TPS) enzymes from these 2 pathways play a crucial role in modifying one terpene to another. Though TPSs from some organisms have been identified and well characterized, there is a huge gap between functional annotation and actual enzymatic activity of particular TPS in plants. Here, we attempt to collect and classify all TPS genes that are available from genomics, proteomics and enzymatic studies.
CEVOpen (ContentMine, EssoilDB, and Verriclear) are the three organizations that have started this project which is an OpenNotebook to facilitate the scientific approach which contains all primary records of research projects that are stored on any type of data portal that makes them publicly accessible as they are created. CEVOpen project aims to develop coherent knowledge tools and resources for automatic conversion of the open scientific literature to a semantic atlas of plant chemistry and properties.
-
ContentMine (https://github.com/petermr/contentMine): ContentMine uses machines to automatically extract and interpret content from the literature. ContentMine works on the philosophy of creating an open resource for everyone which is also created by everyone.ContentMine was funded by the Shuttleworth Foundation (Fellowship to Peter Murray-Rust, University of Cambridge, UK).
-
EssoilDB (http://www.nipgr.ac.in/Essoildb/): The ESSential OIL DataBase (Dr. Gitanjali Yadav, NIPGR, New Delhi, India) is a knowledge resource for plant's volatile emissions, containing experimental records of essential oil composition data, from published reports.
-
Verriclear (https://verriclear.com/): Verriclear Natural Skin Essentials Ltd., an innovative developer of phytotherapy skincare products derived from bioactive plant extracts from around the world. Verriclear was founded by Emanuel Faria, creates 100% plant-based natural skincare formulations for skin conditions.
Using EuropePMC and ContentMine's getpapers/ami up to 10,000 articles on essential oils are available. These report some/all of:
- plant identity
- literature survey of previously reported activities
- oil chemical composition (by chemical name)
- experimentals determination of activity (often against organisms)
Nowadays modern medication use large scales of terpene for various treatment drugs (Franklin et al. 2001). There are commonly used plants like tea (Melaleuca alternifolia), thyme, Cannabis, citrus fruits (lemon, orange, mandarin) etc. that provide wide range of medicinal values (Perry et al. 2000). Terpene is also used to enhance skin penetration, prevent inflammatory diseases (Franklin et al. 2001). Tea tree oil is a volatile essential oil and is famous for its antimicrobial properties, and acts as the active ingredient that is used to treat cutaneous infections (Carson et al. 2006). Tea tree oil has increased in popularity in recent years when it comes to alternative medicine (Perry et al. 2000). This plant contains many medicinal properties like anticancer, antimicrobial, antifungal, antiviral, antihyperglycemic, analgesic, anti-inflammatory, and antiparasitic (Franklin et al. 2001).
In molecular biology, this protein domain belongs to the terpene synthase family (TPS). Its role is to synthesize terpenes which are part of primary metabolism, such as sterols and carotene and also part of the secondary metabolism. This entry will focus on the N terminal domain of the TPS protein.Terpenoids are structurally diverse and the most abundant plant secondary metabolites, playing an important role in plant life through direct and indirect plant defenses, by attracting pollinators and through different interactions between the plants and their environment. Terpenoids play several physiological and ecological functions in plant life through direct and indirect plant defenses and also in human society because of their enormous applications in the pharmaceutical, food and cosmetics industries. Through the aid of genetic engineering its role can be magnified to broad spectrum by improving genetic ability of crop plants, enhancing the aroma quality of fruits and flowers and the production of pharmaceutical terpenoids contents in medicinal plants. Terpenoids are also significant because of their enormous applications in the pharmaceutical, food and cosmetics industries. Due to their broad distribution and functional versatility, efforts are being made to decode the biosynthetic pathways and comprehend the regulatory mechanisms of terpenoids.
Figure 1: TPS from C. sinensis and S. lycopersicum
-
The command below states that our purpose is to compare our corpus activity with the CEVOpen dictionaries such as activity, country, essential oil plant, and plant compound that gives insights into the phytochemistry and its relevance to medical plants and chemicals. The ami section is used to categories scientific articles into the following sections: front, body, back, floats, and groups. Sectioning downloaded files creates a tree structure for us, which aids in navigating the file's content. Sectioning is accomplished by executing the following command at the command prompt:
ami -p "activity" section
-
Ami search performs a search and analysis of the terms in the project repository, returning the term's frequency data table and the corpus's histogram.
ami -p "activity" search --dictionary activity.xml eoPlant.xml plant_compound.xml
-
For valuable insights, alternative dictionaries were used such as essential oil plants, plant compounds, and country to get the possible insights from the open scientific literature regarding the association between the various essential oil plants and compounds with their medicinal activity. These Alternative dictionaries are available at the following link: https://github.com/petermr/CEVOpen/tree/master/dictionary
Here the link for previous work done by Radhu Ladani wikipage https://github.com/petermr/CEVOpen/wiki/Mini-Project:-Phytochemical-ontologies-for-analyzing-the-literature-on-essential-oils
3.1.1 Pygetpapers
pygetpapers is created by Ayush Garg, . This software has been developed to interface with access to open scientific text repositories, make requests to those repositories, gather hits, and download the articles in a systematic and non-interactive manner, which is a python version of getpapers that helps text miners with their work.
Primary URL: https://github.com/petermr/pygetpapers
Installation: https://github.com/petermr/pygetpapers/blob/main/README.md#6-installation 1.Download python along with pip from: https://www.python.org/downloads/ 2.Cloned the repository using git clone command to the local computer: git clone https://github.com/petermr/pygetpapers 3.Run the command: pip install git+git://github.com/petermr/pygetpapers
- Run the command pygetpapers in the command line to check the successful installation and it gives the command option used for getpapers as below:
Figure 2: Pygetpapers usage
- General syntax:
pygetpapers -q <"project title"> -o <output directory> -x<xml> -p<pdf> -k <number of papers required> -c <csv metadata file>
3.1.2 ami
ami turns documents into knowledge. It includes features tools for downloading scientific papers, processing documents into sections and XML, analyzing components (text, tables, diagrams), creating dictionaries, and searching. Ami is a novel toolkit for querying and analyzing a small-to-medium collection of documents, usually on local storage. ami is a declarative system comprised of commands and data modules and is written in Java.
Primary URL: https://github.com/petermr/ami3 Installation: https://github.com/petermr/openVirus/wiki/INSTALLING-ami3
- Download the backend software such as java, jdk, maven and git and set the path for them.
- Open the command line and git clone the repository ami3: git clone https://github.com/petermr/ami3
- In ami3 path, run the command: mvn install -Dmaven.test.skip=true
** ami section** Ami section is used to divide research papers into the following sections: front, body, back, floats, and groups. Sectioning downloaded files creates a tree structure for us, which aids in navigating the file's content. Sectioning is accomplished through the use of ami's section function. Which is executed via the command prompt.
General syntax: ami -p <cproject> section
** ami search** Ami search analyses and searches the keywords in the project repository, returning the term's frequency data table and the corpus's histogram.
General syntax: ami –p search –dictionary
- (iii) amidict Amidict is a set of commands to create dictionary.
General syntax: amidict -v --dictionary eo_CamelliaTPS --directory gene --input gene.txt create --informat list --outformats xml
Wikidata
Wikidata is a collaborative knowledge-based secondary database for structured data that is primarily used by the Wikimedia family of projects. Wikidata possesses numerous necessary characteristics for scientific knowledge, including multilingualism, human and machine editability, and a linked functionality approach. Wikidata is structured in triples and primarily consists of items, each of which has a label, a description, and an unlimited number of aliases. Items are uniquely identified by a Q followed by a number. All of this information is available in a variety of languages even if data originated in different languages. In wikidata, statements describe an item's detailed characteristics and are composed of a property and a value.
Figure 4: wikidata search
- The popular tools for searching and examining wikidata items are the Wikidata Query Service, Geneawiki, Reasonator, and Tree of life. Additionally, we can independently retrieve all data via the wikidata API.
Wikidata: SPARQL query service
Wikidata Query Service enables the extraction of specific information from Wikidata's vast network of linked and structured data. SPARQL contains parameters such as SELECT returns values of variables or variable or expressions and results are table values, ASK to return true/false, DESCRIBE return a description of a resource, CONSTRUCT queries can build RDF triples/graphs. The Wikidata Query Service is powered by SPARQL which is a semantic query language to formulate queries using knowledge databases. The pilot SPARQL endpoint included a graphical user interface for query construction. SPARQL enables the extraction of semantically rich data via queries composed of logical triple combinations. SPARQL operates on a knowledge graph database, such as Wikipedia, and enables the extraction of knowledge and information through the use of filters and constraints. Figure 5 : Wikidata SPARQL query service
This is the process workflow for searching open repositories using linked data dictionaries, based on wikidata, retrieving metadata, and performing machine learning analysis to produce catalogs and knowledge graphs.
Figure 6: Workflow diagram
Creation of Mini-Corpus
The named “pygetpapers”, was used to build a mini-corpus of open scientific literature on “TPS of Zea mays” from EuroPMC, a platform that offers free access to millions of articles in the field of biomedical science. This software is very quick to process (approximately 10 minutes), whereas downloading it individually could have taken 'n' number of hours. The command for the creation of the mini-corpus activity has been listed below:
-q "terpene synthase volatile Zea mays AND (((SRC:MED OR SRC:PMC OR SRC:AGR OR SRC:CBA) NOT (PUB_TYPE:"Review")))" -o ZeaTPS -x -p
The command pygetpapers initiate the process and -q refers to the query, which is to be searched. The query is entered in inverted commas as is done in "(terpene synthase volatile Zea mays)". The next element is -o which refers to the output directory and the parameter that follows it in the name of the directory, which is an ZeaTPS in my case. Then, -x -p corresponds to xml and pdf files to be included in the search, was generated and is available at the following link: https://github.com/petermr/CEVOpen/tree/master/minicorpora/activity
As shown in the image below, this query aids in the creation of a corpus of 100 research articles in full text and xml file format on a local machine.
Figure 7: Papers downloaded from EuroPMC using pygetpapers toolkit
Creation of Dictionary
Dictionaries are collections of terms accompanied by supporting information such as descriptions, provenance, and most importantly, links to other terminological resources, most notably Wikidata. The purpose of the project's dictionaries is:
- To identify words and phrases ("entities") within the documents.
- To establish connections between their meaning and context ("ontologies").
- To assemble a subset of terms that express a high-level concept of plant chemistry and properties.
Structure of dictionary
The format of dictionaries is straightforward and is best supported by XML or JSON. This section defines specific elements and their associated attributes.
- Dictionary/Title: This is the root element containing the title, and must be a single word and MUST be the filename's base.
- Header/Description: There are zero or more < desc >description elements in the header. These can include metadata about dates, maintenance, and provenance.
- Entry/Body: A dictionary's primary component is its entries. An entry is a well-defined object that is typically associated with a Wikidata item. This assigns it a unique identifier (Q-number), obviating the need for ongoing identifier maintenance.
For C. sinensis: I found TPS after removing duplicates.
- While creating the dictionary we have to follow some rules like:
- remove leading and trailing whitespace in attribute values:
<entry term=" abc " ... should be edited to <entry term="abc"
- remove whitespace in some hyphenated values:
<entry term="beta- ocimene" should be: <entry term="beta-ocimene"
- I used following commands to create an dictionary
amidict -v --dictionary eo_CAMSITps --directory gene --input gene.txt create --informat list --outformats xml
- Create corpus using this command pygetpapers
-q "terpene synthase TPS plant volatile" -o TPSvolatile -x -p -k <20>
Then lastly edit the dictionary with wikidata id. The activity dictionary provides the following attributes for biological activities, as well as metadata about the different entities as follows:
- The description parameter defines a human-readable string that describes the entry. It is frequently generated directly from Wikidata and can be used for grouping or disambiguation purposes.
- The name is the preferred name for the term. It is case-sensitive and frequently appears in the text; the name and term may or may not be synonymous.
- The term is the entry's one-of-a-kind lexical string (word). Terms are always written in lowercase and begin with a letter. In documents, the term may or may not be the linguistic entity.
- The wikidata ID & URL is the Wikidata item's identifier. It resolves to the following address: https://wikidata.org/wiki/wikidata>. A Wikidata item has a unique identifier, and the relationships and graphs are language-independent.
- The Wikipedia page is referred to as Wikipedia. It is frequently used as the term (for single words). It may lack spaces and contain escaped punctuation. It resolves to the following address: https://en.wikipedia.org/wiki/wikipedia>.
DICTIONARY OF Camellia sinensis: https://github.com/petermr/crops/blob/main/Camellia/eo_CAMSITps.xml
DICTIONARY OF Solanum lycopersicum: https://github.com/petermr/crops/blob/main/Solanum%20lycopersicum/eo_tomato.xml
The ami section is used to categories scientific articles into the following sections: front, body, back, floats, and groups. Sectioning downloaded files creates a tree structure for us, which aids in navigating the file's content. Sectioning is accomplished by executing the following command at the command prompt:
ami -p "TPSVolatile" section
- Ami search performs a search and analysis of the terms in the project repository, returning the term's frequency data table and the corpus's histogram. The command below states that our purpose is to compare our corpus activity with the CEVOpen dictionaries such as TPS, Volatile compounds, and plant compound that gives insights into the phytochemistry and its relevance to medical plants and chemicals.
ami -p "TPSvolatile" search --dictionary eo_Gene.xml
- For valuable insights, alternative dictionaries were used such as TPS Volatile plants, plant compounds, and chemicals to get the possible insights from the open scientific literature regarding the association between the various Terpene synthase with their medicinal activity. These Alternative dictionaries are available at the following link: https://github.com/petermr/crops
The following results have been obtained from the aforementioned materials and methods, which enable to address the scientific question regarding Terpene synthase, Volatile plant compound associated with them.
(i) Results of ami section: Generally, the dataset is sectioned for greater precision. When the folder 'sections' was opened in the cProject directory after the successful completion of the ami section command, the directory's papers are divided into the following sections. Figure 8: ami section
(ii) Results of amidict: amidict creates dictionary into directory folder that is given in the command.
Figure 3: ami dictionary
(iii) Results of ami search: ami search returns the following results in the form of table, a histogram, and results for each folder.
Figure 9: ami search
The most fundamental output was the complete data table, which was a rectangular table with columns representing the searches and rows representing the papers. In a web browser, open full.dataTables.html appears as follows:
Figure 10: Full Data Table
The open scientific literature on terpene synthase of Camellia sinensis and Solanum lycipersicum was analyzed and TPS enzyme terms were extracted. For these TPS terms, wikidata ID, TPS nomenclature and synonyms were searched and added. Using the ami search engine, associations between different TPS and volatile compounds can be determined. The results were obtained through fully automated metadata extraction from a corpus of open scientific literature on terpene synthase of Camellia sinensis and Solanum lycipersicum in the supported formats of xml and pdf using the ami toolkit, which divides the articles into various sections such as front, body, back, floats, and groups, each of which contains unique insights. With the advancement in genomics and metabolomics, number open scientific literature is getting flooded with numbers of papers mentioning terpenes, their activities and gene IDs. Therefore eo_CAMSITps and eo_tomato dictionary created here can be further improved with this information.
CEVOpen's overall goal is to strengthen open-source multiplatform tools for discovering, aggregating, cleaning, and semantically enriching scholarly documents that contain significant amounts of phytochemicals. This project's objective was to create an automated system capable of reading scientific literature and extracting its structure and meaning, with a particular emphasis on essential oils, the volatile component of a plant's phytochemistry. Open scientific literature is flooded with reports on oils extracted from specific plants, TPS of various crops, their methodology, chemical composition, and biological and medicinal activities. Manual and automated metadata analysis on this open scientific literature extracted TPS enzyme information and eo_CAMSITps and eo_tomato dictionary was created and successfully tested.
These TPS dictionaries further can be explored to correlate with other important dictionaries such as eo_plant, plant_compound and plant_material_history. The project's future direction will be towards text mining of open scientific literature for a multilingual semantic atlas of volatile phytochemistry, which will include text categorization, clustering, entity extraction, document summarization, sentiment analysis and entity-relationship modelling using various machine learning perspectives.