Skip to content

miniproject: viral epidemics and country

AmbreenH edited this page Jun 28, 2022 · 38 revisions

What countries do viral epidemics occur in?

owner (mentor):

Ambreen H

collaborators (mentee):

Pooja Pareek

miniproject summary

proposed activities

  1. Use the communal corpus of 50 articles on viral epidemics. #c5f015 FINISHED
  2. Meticulously scrutinize the corpus to detect true positives and false positive articles ie whether the articles are really on viral epidemics or not #1589F0 STARTED
  3. Refine and rerun the query to create a corpus of 950 articles. This shall be the dataset for further metanalysis.#1589F0 STARTED * Create the country dictionary using amidict: #c5f015 FINISHED
  4. Using ami search to get information about the countries where such epidemics are most likely to occur. #1589F0 STARTED
  5. Test sectioning on epidemic50noCov/ to extract only those modules where the information about countries is most likely to be present. Annotation with dictionaries to create ami dataTables shall also be done. #c5f015 FINISHED
  6. For ML techniques this shall be split into training, validation and test sets. #1589F0 STARTED
  7. Use relevant machine learning techniques for the classification of data based on whether the papers are related to viral epidemics and the countries where the viral epidemics were reported. This shall primarily be done using Python. #1589F0 STARTED
  8. The model shall be validated using the accuracy obtained when testing it upon the test data. #1589F0 STARTED

outcomes

  1. Development of relevant spreadsheets as well as graphs with regards to the countries where the viral epidemics were reported and their respective frequencies.
  2. Development of the ML model for data classification having acceptable accuracy

corpora

  1. Initially the communal corpus of 50 articles on viral epidemics
  2. Later a new corpus consisting of 950 papers shall be created using the country dictionary.

dictionaries

  • country dictionary

software

  1. ami for the creation of corpus, use of dictionaries, sectioning
  2. ami/SPARQL for the creation of dictionaries
  3. Python and relevant libraries (Keras, TensorFlow, NLP, etc) for ML and data visualization (NumPy, Matplotlib, Seaborn, ggplot, etc)

constraints

Time would be a major constraint since this must be completed within a maximum period of 6 weeks.



#c5f015 Update 1:

06/07/2020

  1. Updated ami
  2. Created the country dictionary using the following function: amidict -v --dictionary country --directory country --input country.txt create --informat list --outformats xml,html --wikilinks wikipedia, wikidata
  3. Further details on dictionary creation: https://github.com/petermr/openVirus/blob/master/dictionaries/country/country_dict.md
  4. Link to the created dictionary: https://github.com/petermr/openVirus/blob/master/dictionaries/test/country_new.xml
  5. Started the creation of a spreadsheet of true and false positives for classification using both the communal corpus as well as Europe PMC search
  6. Tested ami section on a corpus of 50 articles. Details: https://github.com/petermr/openVirus/wiki/ami:section
  7. Tested ami search on a corpus of 50 articles Details: https://github.com/petermr/openVirus/wiki/ami-search


#c5f015 Update 2:

08/07/2020

CORPUS_950

Created a communal corpus and pushed it to GitHub. This was accomplished by downloading and installing Visual Studio Code: https://code.visualstudio.com/download Next Steps (for reference):

  1. Install git on your system
  2. Clone your repository using git clone https://github.com/petermr/openVirus.git
  3. Remember the location of your cloned document. Add folders in the specified location.
  4. Go to the VS Code and open the folder where you cloned the repository. (Check git is enabled from settings)
  5. Go to source control section & click on git icon
  6. Give commit message & Commit the changes
  7. Add remote repo (Github repo)
  8. Push committed changes to GitHub repo
  9. Check changes on GitHub repo

For Troubleshooting, check FAQ Pushed the corpus of 950 papers to GitHub : https://github.com/petermr/openVirus/tree/master/miniproject/corpus_950_papers

ami

Updated ami by:

  1. Navigating to the ami3 folder on command prompt using cd path/to/ami3
  2. Running the commands: git pull and mvn clean install -DskipTests

Dictionary

Converted SPARQL query results to XML format using the following command:

amidict -p country_project -v --dictionary country --directory=target/dictionary --input=country_wikidata.xml create --informat=wikisparqlxml 

Reference Dictionary: country_converted



#c5f015 Update 3:

13/07/2020

Dictionary validation:

Used amidict --validate command to validate the dictionary:

  1. Updated ami
  2. Used the following code for dictionary validation: amidict --dictionary path/to/mydictionary -v display --fields --validate
  3. Result:
Generic values (DictionaryDisplayTool)
================================
--testString        : d      null
--wikilinks         : d [Lorg.contentmine.ami.tools.AbstractAMIDictTool$WikiLink;@2631f68c
--fields            : m        []
--files             : d        []
--maxEntries        : d         3
--remote            : d [https://github.com/petermr/dictionary]
--validate          : m      true
--help              : d     false
--version           : d     false
--dictionary        : d [C:\Users\eless\country_2]
--directory         : d      null

Specific values (DictionaryDisplayTool)
================================
list all fields
dictionaries from C:\Users\ContentMine\dictionaries

Smoke-Test for Machine Learning:

  1. Downloaded the requisite libraries in Python : TensorFlow, Seaborn, NumPy, pandas
  2. Sorted out false positives from the corpus for a sample of 100 papers.
  3. Used python for XML parsing and converting it to text. This was done by importing: xml.etree.ElementTree library and using Python loops to loop through all .xml files for extraction of abstracts as well as for conversion into .text (Library required: os )
  4. For the Smoke test, this was only done for the abstracts within the paper.
  5. Prepared a rough layout for the classification within python.

#c5f015 Update 4:

20/07/2020

Ami search for cooccurrence:

  1. Updated 'ami`
  2. Ran ami search again for getting the coocurrence on the corpus of 950 papers downloaded using getpapers
  3. Results:
  • __cooccurrence folder containing 2 subfolders country, country-country, allPlots.svg (Each folder contains non-empty files)
  • errrors may arise due to a faulty dictionary

Refining the country dictionary:

  1. Updated the wikidata query to remove irrelevant names from the country dictionary and also to sort it by country names:
  2. Query used:
#Find ISO 3166-1 alpha-2 country codes
SELECT ?country ?countryLabel ?code ?wikipedia ?countryAltLabel
WHERE
{  ?country wdt:P297 ?code .
   
 OPTIONAL { ?wikipedia schema:about ?country .
 ?wikipedia schema:isPartOf <https://en.wikipedia.org/>.
} .
 SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"
                         
}
}

ORDER BY (?countryLabel)
  1. One issue faced in the country dictionary was the occurrence of flags within synonyms. This was sorted out using Python.
  • The SRARQL endpoint file was first converted into the standard format using amidict ( for reference see above)
  • The new XML file was imported into python and all characters within the grandchild elements (ie synonyms) were converted to ASCII. This emptied the synonym elements

PYTHON CODE

import re
iname = "E:\\ami_try\\Dictionaries\\country_converted.xml"
oname = "E:\\ami_try\\Dictionaries\\country_converted2.xml"
pat = re.compile('(\s*<synonym>)(.*?)(</synonym>\s*)', re.U)
with open(iname, "rb") as fin:
    with open(oname, "wb") as fout:
        for line in fin:
            #line = line.decode('utf-8')
            line = line.decode('ascii', errors='ignore')
            m = pat.search(line)
            if m:
                g = m.groups()
                line = g[0].lower() + g[1].lower() + g[2].lower()
            fout.write(line.encode('utf-8'))
  • The empty elements were then deleted using python to create a new .xml file with all synonyms except the flags.

PYTHON CODE

from lxml import etree
def remove_empty_tag(tag, original_file, new_file):
    root = etree.parse(original_file)
    for element in root.xpath(f".//*[self::{tag} and not(node())]"):
        element.getparent().remove(element)
    # Serialize "root" and create a new tree using an XMLParser to clean up
    # formatting caused by removing elements.
    parser = etree.XMLParser(remove_blank_text=True)
    tree = etree.fromstring(etree.tostring(root), parser=parser)
    # Write to new file.
    etree.ElementTree(tree).write(new_file, pretty_print=True, xml_declaration=True, encoding="utf-8")
remove_empty_tag("synonym", "E:\\ami_try\\Dictionaries\\country_converted2.xml", "E:\\ami_try\\Dictionaries\\country_converted3.xml")

All code is reusable with a little modification

New dictionary for reference

  1. A simpler approach in the modification of the SPARQL code was also tried which eliminated the need for a lot of work.

Detailed Tutorial for the QUERY

SPARQL dictionary witout flags


#c5f015 Update 5:

25/07/2020

Smoke_test

Tester: Ambreen H

Data preparation

In order to run the machine learning model, proper data preparation is necessary

  • The following libraries were used: xml.etree.ElementTree as ET, string, os and re
  • A function was written to locate XML files and extracting abstract from that
  • This was done on a small number of papers (30 positives and 30 negatives)
  • The abstract was cleaned by removing unnecessary characters, turning into lowercase and removing subheadings like 'abstract' etc
  • Finally a single data file was created in CSV format having 3 columns, one for the name of the file, other for the entire cleaned text in the abstract, and whether the result is a false positive or true positive.

Code File

#f0b215 Up Next:

  1. Create separate text files for each paper. Split it into training and testing datasets.
  2. Finalize the code for binary classification in Python
  3. Run the smoke test for a sample of hundred papers

#c5f015 Update 6:

11/08/2020

  1. Finalized the country dictionary:

  2. Tried language variants for Hindi languages. Dictionary uploaded to GitHub Hindi_Dictionary: Dictioanry_with_hindi

  3. Finished sectioning on the corpus

  4. Tried ami search and uploaded the results: ami_search_results

  5. Downloaded all libraries for the ML project



#c5f015 Update 7:

17/08/2020

  1. Wrote the code to get abstract from ami section. The code was written with comments to understand what each line does. The code does the following:
  • Extracts abstracts from ami section
  • Cleans the text
  • Merges rows with similar paper IDs
  • Adds it all to the final csv file

This may be used for manual classification of all papers.

#c5f015 Update 8:

16/09/2020

  1. Finalized the country dictionary (MORE DETAILS)
  2. Ran ami search on the dictionary ami -p ami_12_08_2020/corpus_950 search --dictionary ami_12_08_2020/amidict10/country.xml
  3. Ran ami search of two dictionaries (country and diseases) ami -p ami_12_08_2020/corpus_950 search --dictionary ami_12_08_2020/amidict10/country.xml
  4. Validated the country dictionary
  5. Started working on Machine Learning (NLP) in Jupyter Notebook

#c5f015 Update 9:

1/10/2020

  1. Used Machine Learning for Binary Classification
  2. Used getpapers to download a corpus of 2000 papers using the following queries:

getpapers -q (TITLE:"HIV" or TITLE:"EBOLA" NOT TITLE:"COVID-19" NOT TITLE:"Corona" NOT TITLE:"COVID") -o hackathon/test/ML_other_1 -f v_epid/log.txt -x -p -k 200 getpapers -q (TITLE:"COVID-19" or TITLE:"Corona" NOT TITLE:"Ebola" NOT TITLE:"HIV") -o hackathon/test/ML_covid_1 -f v_epid/log.txt -x -p -k 200

  1. Ran ami search to scrutize the papers from the datatables generated using the disease dictionary:

ami -p hackathon/ML_covid_1 search --dictionary hackathon/dictionaries/disease.xml ami -p hackathon/ML_other_1 search --dictionary hackathon/dictionaries/disease.xml

  1. Attempted binary classification on the two corpora after proper preprocessing: https://github.com/petermr/openVirus/tree/master/cambiohack2020



Initial Summary

Submitter: Pooja Pareek

The project is all about viral epidemics with respect to different countries. The purpose of doing the project is putting all essential data in one place with the dictionary, country so it will be easy to understand each and every one.

Initial work:

Initially for getting started need to install all the necessary software. what I have done so far is here below:

  1. Installed getpapers:

installed getpapers with using the information provided here: https://github.com/ContentMine/getpapers/blob/master/README.md

  • went to the download page and installed nvm-setup-zip
  • run the nvm-setup-zip and installed included installer.
  • installed NODE by using command prompt and the command one after another nvm install 7 nvm use 7.10.1
  • tested installation by node --version
  • ran the command npm install --global getpapers and get papers was installed successfully.
  1. Installed ami:
  1. Installed git
  • downloaded git
  • launched git bash
  • tested the installation by giving command git --version and got the git version 2.27.0
  1. Installed maven
  • download the maven by using https://maven.apache.org/install.html
  • download apache-maven-3.6.3-bin.tar.gz or apache-maven-3.6.3-bin.zip
  • extracted all files
  • set the path variable
  • test installation using mvn -version and got apache maven version 3.6.3
  1. open the command prompt and enter the following command one after another
git clone https://github.com/petermr/ami3.git

cd ami3

mvn install -Dmaven.test.skip=true
  • tested the installation by using ami --help
  • ami has installed sucessfully
  • set the path for ami3
  • tried to run the ami search and ami section for initially 20 papers.

[error : java should point to the jdk not jrk]

work done

after the successfully installation of basic software the following things has been learnt with the help of mentor's instructions that was provided on wiki -

  • edit a page.
  • create a corpus (corpus is a group of articles, in which all the article belongs to the respective topic).
  • section the articles with amisection.
  • create a datatable with amisearch.
  • create a dictionary using amidict.
  • create a spaqrl euery.
  • manually differentiate between true positive and false positive.
Clone this wiki locally