miniproject: viral epidemics and country

What countries do viral epidemics occur in?

owner (mentor):

Ambreen H

collaborators (mentee):

Pooja Pareek

miniproject summary

proposed activities

Use the communal corpus of 50 articles on viral epidemics. FINISHED
Meticulously scrutinize the corpus to detect true positives and false positive articles ie whether the articles are really on viral epidemics or not STARTED
Refine and rerun the query to create a corpus of 950 articles. This shall be the dataset for further metanalysis. STARTED * Create the country dictionary using amidict: FINISHED
Using ami search to get information about the countries where such epidemics are most likely to occur. STARTED
Test sectioning on epidemic50noCov/ to extract only those modules where the information about countries is most likely to be present. Annotation with dictionaries to create ami dataTables shall also be done. FINISHED
For ML techniques this shall be split into training, validation and test sets. STARTED
Use relevant machine learning techniques for the classification of data based on whether the papers are related to viral epidemics and the countries where the viral epidemics were reported. This shall primarily be done using Python. STARTED
The model shall be validated using the accuracy obtained when testing it upon the test data. STARTED

outcomes

Development of relevant spreadsheets as well as graphs with regards to the countries where the viral epidemics were reported and their respective frequencies.
Development of the ML model for data classification having acceptable accuracy

corpora

Initially the communal corpus of 50 articles on viral epidemics
Later a new corpus consisting of 950 papers shall be created using the country dictionary.

dictionaries

country dictionary

software

ami for the creation of corpus, use of dictionaries, sectioning
ami/SPARQL for the creation of dictionaries
Python and relevant libraries (Keras, TensorFlow, NLP, etc) for ML and data visualization (NumPy, Matplotlib, Seaborn, ggplot, etc)

constraints

Time would be a major constraint since this must be completed within a maximum period of 6 weeks.

`Update 1:`

06/07/2020

Updated ami
Created the country dictionary using the following function: amidict -v --dictionary country --directory country --input country.txt create --informat list --outformats xml,html --wikilinks wikipedia, wikidata
Further details on dictionary creation: https://github.com/petermr/openVirus/blob/master/dictionaries/country/country_dict.md
Link to the created dictionary: https://github.com/petermr/openVirus/blob/master/dictionaries/test/country_new.xml
Started the creation of a spreadsheet of true and false positives for classification using both the communal corpus as well as Europe PMC search
Tested ami section on a corpus of 50 articles. Details: https://github.com/petermr/openVirus/wiki/ami:section
Tested ami search on a corpus of 50 articles Details: https://github.com/petermr/openVirus/wiki/ami-search

`Update 2:`

08/07/2020

CORPUS_950

Created a communal corpus and pushed it to GitHub. This was accomplished by downloading and installing Visual Studio Code: https://code.visualstudio.com/download Next Steps (for reference):

Install git on your system
Clone your repository using git clone https://github.com/petermr/openVirus.git
Remember the location of your cloned document. Add folders in the specified location.
Go to the VS Code and open the folder where you cloned the repository. (Check git is enabled from settings)
Go to source control section & click on git icon
Give commit message & Commit the changes
Add remote repo (Github repo)
Push committed changes to GitHub repo
Check changes on GitHub repo

For Troubleshooting, check FAQ Pushed the corpus of 950 papers to GitHub : https://github.com/petermr/openVirus/tree/master/miniproject/corpus_950_papers

ami

Updated ami by:

Navigating to the ami3 folder on command prompt using cd path/to/ami3
Running the commands: git pull and mvn clean install -DskipTests

Dictionary

Converted SPARQL query results to XML format using the following command:

amidict -p country_project -v --dictionary country --directory=target/dictionary --input=country_wikidata.xml create --informat=wikisparqlxml

Reference Dictionary: country_converted

`Update 3:`

13/07/2020

Dictionary validation:

Used amidict --validate command to validate the dictionary:

Updated ami
Used the following code for dictionary validation: amidict --dictionary path/to/mydictionary -v display --fields --validate
Result:

Generic values (DictionaryDisplayTool)
================================
--testString        : d      null
--wikilinks         : d [Lorg.contentmine.ami.tools.AbstractAMIDictTool$WikiLink;@2631f68c
--fields            : m        []
--files             : d        []
--maxEntries        : d         3
--remote            : d [https://github.com/petermr/dictionary]
--validate          : m      true
--help              : d     false
--version           : d     false
--dictionary        : d [C:\Users\eless\country_2]
--directory         : d      null

Specific values (DictionaryDisplayTool)
================================
list all fields
dictionaries from C:\Users\ContentMine\dictionaries

Smoke-Test for Machine Learning:

Downloaded the requisite libraries in Python : TensorFlow, Seaborn, NumPy, pandas
Sorted out false positives from the corpus for a sample of 100 papers.
Used python for XML parsing and converting it to text. This was done by importing: xml.etree.ElementTree library and using Python loops to loop through all .xml files for extraction of abstracts as well as for conversion into .text (Library required: os )
For the Smoke test, this was only done for the abstracts within the paper.
Prepared a rough layout for the classification within python.

`Update 4:`

20/07/2020

Ami search for cooccurrence:

Updated 'ami`
Ran ami search again for getting the coocurrence on the corpus of 950 papers downloaded using getpapers
Results:

__cooccurrence folder containing 2 subfolders country, country-country, allPlots.svg (Each folder contains non-empty files)
errrors may arise due to a faulty dictionary

Refining the country dictionary:

Updated the wikidata query to remove irrelevant names from the country dictionary and also to sort it by country names:
Query used:

#Find ISO 3166-1 alpha-2 country codes
SELECT ?country ?countryLabel ?code ?wikipedia ?countryAltLabel
WHERE
{  ?country wdt:P297 ?code .
   
 OPTIONAL { ?wikipedia schema:about ?country .
 ?wikipedia schema:isPartOf <https://en.wikipedia.org/>.
} .
 SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"
                         
}
}

ORDER BY (?countryLabel)

One issue faced in the country dictionary was the occurrence of flags within synonyms. This was sorted out using Python.

The SRARQL endpoint file was first converted into the standard format using amidict ( for reference see above)
The new XML file was imported into python and all characters within the grandchild elements (ie synonyms) were converted to ASCII. This emptied the synonym elements

PYTHON CODE

import re
iname = "E:\\ami_try\\Dictionaries\\country_converted.xml"
oname = "E:\\ami_try\\Dictionaries\\country_converted2.xml"
pat = re.compile('(\s*<synonym>)(.*?)(</synonym>\s*)', re.U)
with open(iname, "rb") as fin:
    with open(oname, "wb") as fout:
        for line in fin:
            #line = line.decode('utf-8')
            line = line.decode('ascii', errors='ignore')
            m = pat.search(line)
            if m:
                g = m.groups()
                line = g[0].lower() + g[1].lower() + g[2].lower()
            fout.write(line.encode('utf-8'))

The empty elements were then deleted using python to create a new .xml file with all synonyms except the flags.

PYTHON CODE

from lxml import etree
def remove_empty_tag(tag, original_file, new_file):
    root = etree.parse(original_file)
    for element in root.xpath(f".//*[self::{tag} and not(node())]"):
        element.getparent().remove(element)
    # Serialize "root" and create a new tree using an XMLParser to clean up
    # formatting caused by removing elements.
    parser = etree.XMLParser(remove_blank_text=True)
    tree = etree.fromstring(etree.tostring(root), parser=parser)
    # Write to new file.
    etree.ElementTree(tree).write(new_file, pretty_print=True, xml_declaration=True, encoding="utf-8")
remove_empty_tag("synonym", "E:\\ami_try\\Dictionaries\\country_converted2.xml", "E:\\ami_try\\Dictionaries\\country_converted3.xml")

All code is reusable with a little modification

New dictionary for reference

A simpler approach in the modification of the SPARQL code was also tried which eliminated the need for a lot of work.

Detailed Tutorial for the QUERY

SPARQL dictionary witout flags

`Update 5:`

25/07/2020

Smoke_test

Tester: Ambreen H

Data preparation

In order to run the machine learning model, proper data preparation is necessary

The following libraries were used: xml.etree.ElementTree as ET, string, os and re
A function was written to locate XML files and extracting abstract from that
This was done on a small number of papers (30 positives and 30 negatives)
The abstract was cleaned by removing unnecessary characters, turning into lowercase and removing subheadings like 'abstract' etc
Finally a single data file was created in CSV format having 3 columns, one for the name of the file, other for the entire cleaned text in the abstract, and whether the result is a false positive or true positive.

Code File

`Up Next:`

Create separate text files for each paper. Split it into training and testing datasets.
Finalize the code for binary classification in Python
Run the smoke test for a sample of hundred papers

`Update 6:`

11/08/2020

Finalized the country dictionary:
Tried language variants for Hindi languages. Dictionary uploaded to GitHub Hindi_Dictionary: Dictioanry_with_hindi
Finished sectioning on the corpus
Tried ami search and uploaded the results: ami_search_results
Downloaded all libraries for the ML project

`Update 7:`

17/08/2020

Wrote the code to get abstract from ami section. The code was written with comments to understand what each line does. The code does the following:

Extracts abstracts from ami section
Cleans the text
Merges rows with similar paper IDs
Adds it all to the final csv file

This may be used for manual classification of all papers.

`Update 8:`

16/09/2020

Finalized the country dictionary (MORE DETAILS)
Ran ami search on the dictionary ami -p ami_12_08_2020/corpus_950 search --dictionary ami_12_08_2020/amidict10/country.xml
Ran ami search of two dictionaries (country and diseases) ami -p ami_12_08_2020/corpus_950 search --dictionary ami_12_08_2020/amidict10/country.xml
Validated the country dictionary
Started working on Machine Learning (NLP) in Jupyter Notebook

`Update 9:`

1/10/2020

Used Machine Learning for Binary Classification
Used getpapers to download a corpus of 2000 papers using the following queries:

getpapers -q (TITLE:"HIV" or TITLE:"EBOLA" NOT TITLE:"COVID-19" NOT TITLE:"Corona" NOT TITLE:"COVID") -o hackathon/test/ML_other_1 -f v_epid/log.txt -x -p -k 200 getpapers -q (TITLE:"COVID-19" or TITLE:"Corona" NOT TITLE:"Ebola" NOT TITLE:"HIV") -o hackathon/test/ML_covid_1 -f v_epid/log.txt -x -p -k 200

Ran ami search to scrutize the papers from the datatables generated using the disease dictionary:

ami -p hackathon/ML_covid_1 search --dictionary hackathon/dictionaries/disease.xml ami -p hackathon/ML_other_1 search --dictionary hackathon/dictionaries/disease.xml

Attempted binary classification on the two corpora after proper preprocessing: https://github.com/petermr/openVirus/tree/master/cambiohack2020

Initial Summary

Submitter: Pooja Pareek

The project is all about viral epidemics with respect to different countries. The purpose of doing the project is putting all essential data in one place with the dictionary, country so it will be easy to understand each and every one.

Initial work:

Initially for getting started need to install all the necessary software. what I have done so far is here below:

Installed getpapers:

installed getpapers with using the information provided here: https://github.com/ContentMine/getpapers/blob/master/README.md

went to the download page and installed nvm-setup-zip
run the nvm-setup-zip and installed included installer.
installed NODE by using command prompt and the command one after another nvm install 7 nvm use 7.10.1
tested installation by node --version
ran the command npm install --global getpapers and get papers was installed successfully.

Installed ami:

with the help of https://github.com/petermr/openVirus/wiki/INSTALLING-ami3
installed java.
tested java installation by command java -version and got the java version 1.8
installed JDK.
set the path as per instruction.

Installed git

downloaded git
launched git bash
tested the installation by giving command git --version and got the git version 2.27.0

Installed maven

download the maven by using https://maven.apache.org/install.html
download apache-maven-3.6.3-bin.tar.gz or apache-maven-3.6.3-bin.zip
extracted all files
set the path variable
test installation using mvn -version and got apache maven version 3.6.3

open the command prompt and enter the following command one after another

git clone https://github.com/petermr/ami3.git

cd ami3

mvn install -Dmaven.test.skip=true

tested the installation by using ami --help

ami has installed sucessfully
set the path for ami3
tried to run the ami search and ami section for initially 20 papers.

[error : java should point to the jdk not jrk]

work done

after the successfully installation of basic software the following things has been learnt with the help of mentor's instructions that was provided on wiki -

edit a page.
create a corpus (corpus is a group of articles, in which all the article belongs to the respective topic).
section the articles with amisection.
create a datatable with amisearch.
create a dictionary using amidict.
create a spaqrl euery.
manually differentiate between true positive and false positive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly