-
Notifications
You must be signed in to change notification settings - Fork 17
miniproject: viral epidemics and country
Ambreen H
Pooja Pareek
proposed activities
- Use the communal corpus of 50 articles on viral epidemics.
FINISHED
- Meticulously scrutinize the corpus to detect true positives and false positive articles ie whether the articles are really on viral epidemics or not
STARTED
- Refine and rerun the query to create a corpus of 950 articles. This shall be the dataset for further metanalysis.
STARTED
* Create the country dictionary usingamidict
:FINISHED
- Using ami search to get information about the countries where such epidemics are most likely to occur.
STARTED
- Test sectioning on
epidemic50noCov/
to extract only those modules where the information about countries is most likely to be present. Annotation with dictionaries to create ami dataTables shall also be done.FINISHED
- For ML techniques this shall be split into training, validation and test sets.
STARTED
- Use relevant machine learning techniques for the classification of data based on whether the papers are related to viral epidemics and the countries where the viral epidemics were reported. This shall primarily be done using Python.
STARTED
- The model shall be validated using the accuracy obtained when testing it upon the test data.
STARTED
outcomes
- Development of relevant spreadsheets as well as graphs with regards to the countries where the viral epidemics were reported and their respective frequencies.
- Development of the ML model for data classification having acceptable accuracy
- Initially the communal corpus of 50 articles on viral epidemics
- Later a new corpus consisting of 950 papers shall be created using the country dictionary.
- country dictionary
- ami for the creation of corpus, use of dictionaries, sectioning
- ami/SPARQL for the creation of dictionaries
- Python and relevant libraries (Keras, TensorFlow, NLP, etc) for ML and data visualization (NumPy, Matplotlib, Seaborn, ggplot, etc)
Time would be a major constraint since this must be completed within a maximum period of 6 weeks.
06/07/2020
- Updated ami
- Created the country dictionary using the following function:
amidict -v --dictionary country --directory country --input country.txt create --informat list --outformats xml,html --wikilinks wikipedia, wikidata
- Further details on dictionary creation: https://github.com/petermr/openVirus/blob/master/dictionaries/country/country_dict.md
- Link to the created dictionary: https://github.com/petermr/openVirus/blob/master/dictionaries/test/country_new.xml
- Started the creation of a spreadsheet of true and false positives for classification using both the communal corpus as well as Europe PMC search
- Tested
ami section
on a corpus of 50 articles. Details: https://github.com/petermr/openVirus/wiki/ami:section - Tested
ami search
on a corpus of 50 articles Details: https://github.com/petermr/openVirus/wiki/ami-search
08/07/2020
Created a communal corpus and pushed it to GitHub. This was accomplished by downloading and installing Visual Studio Code: https://code.visualstudio.com/download Next Steps (for reference):
- Install git on your system
- Clone your repository using
git clone https://github.com/petermr/openVirus.git
- Remember the location of your cloned document. Add folders in the specified location.
- Go to the VS Code and open the folder where you cloned the repository. (Check git is enabled from settings)
- Go to source control section & click on git icon
- Give commit message & Commit the changes
- Add remote repo (Github repo)
- Push committed changes to GitHub repo
- Check changes on GitHub repo
For Troubleshooting, check FAQ Pushed the corpus of 950 papers to GitHub : https://github.com/petermr/openVirus/tree/master/miniproject/corpus_950_papers
Updated ami
by:
- Navigating to the
ami3
folder on command prompt usingcd path/to/ami3
- Running the commands:
git pull
andmvn clean install -DskipTests
Converted SPARQL query results to XML format using the following command:
amidict -p country_project -v --dictionary country --directory=target/dictionary --input=country_wikidata.xml create --informat=wikisparqlxml
Reference Dictionary: country_converted
13/07/2020
Used amidict --validate
command to validate the dictionary:
- Updated
ami
- Used the following code for dictionary validation:
amidict --dictionary path/to/mydictionary -v display --fields --validate
- Result:
Generic values (DictionaryDisplayTool)
================================
--testString : d null
--wikilinks : d [Lorg.contentmine.ami.tools.AbstractAMIDictTool$WikiLink;@2631f68c
--fields : m []
--files : d []
--maxEntries : d 3
--remote : d [https://github.com/petermr/dictionary]
--validate : m true
--help : d false
--version : d false
--dictionary : d [C:\Users\eless\country_2]
--directory : d null
Specific values (DictionaryDisplayTool)
================================
list all fields
dictionaries from C:\Users\ContentMine\dictionaries
- Downloaded the requisite libraries in Python : TensorFlow, Seaborn, NumPy, pandas
- Sorted out false positives from the corpus for a sample of 100 papers.
- Used python for XML parsing and converting it to text. This was done by importing: xml.etree.ElementTree library and using Python loops to loop through all .xml files for extraction of abstracts as well as for conversion into .text (Library required: os )
- For the Smoke test, this was only done for the abstracts within the paper.
- Prepared a rough layout for the classification within python.
20/07/2020
- Updated 'ami`
- Ran
ami search
again for getting the coocurrence on the corpus of 950 papers downloaded usinggetpapers
- Results:
- __cooccurrence folder containing 2 subfolders country, country-country, allPlots.svg (Each folder contains non-empty files)
- errrors may arise due to a faulty dictionary
- Updated the wikidata query to remove irrelevant names from the country dictionary and also to sort it by country names:
- Query used:
#Find ISO 3166-1 alpha-2 country codes
SELECT ?country ?countryLabel ?code ?wikipedia ?countryAltLabel
WHERE
{ ?country wdt:P297 ?code .
OPTIONAL { ?wikipedia schema:about ?country .
?wikipedia schema:isPartOf <https://en.wikipedia.org/>.
} .
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en"
}
}
ORDER BY (?countryLabel)
- One issue faced in the country dictionary was the occurrence of flags within synonyms. This was sorted out using Python.
- The
SRARQL
endpoint file was first converted into the standard format usingamidict
( for reference see above) - The new XML file was imported into python and all characters within the grandchild elements (ie synonyms) were converted to
ASCII
. This emptied the synonym elements
PYTHON CODE
import re
iname = "E:\\ami_try\\Dictionaries\\country_converted.xml"
oname = "E:\\ami_try\\Dictionaries\\country_converted2.xml"
pat = re.compile('(\s*<synonym>)(.*?)(</synonym>\s*)', re.U)
with open(iname, "rb") as fin:
with open(oname, "wb") as fout:
for line in fin:
#line = line.decode('utf-8')
line = line.decode('ascii', errors='ignore')
m = pat.search(line)
if m:
g = m.groups()
line = g[0].lower() + g[1].lower() + g[2].lower()
fout.write(line.encode('utf-8'))
- The empty elements were then deleted using python to create a new .xml file with all synonyms except the flags.
PYTHON CODE
from lxml import etree
def remove_empty_tag(tag, original_file, new_file):
root = etree.parse(original_file)
for element in root.xpath(f".//*[self::{tag} and not(node())]"):
element.getparent().remove(element)
# Serialize "root" and create a new tree using an XMLParser to clean up
# formatting caused by removing elements.
parser = etree.XMLParser(remove_blank_text=True)
tree = etree.fromstring(etree.tostring(root), parser=parser)
# Write to new file.
etree.ElementTree(tree).write(new_file, pretty_print=True, xml_declaration=True, encoding="utf-8")
remove_empty_tag("synonym", "E:\\ami_try\\Dictionaries\\country_converted2.xml", "E:\\ami_try\\Dictionaries\\country_converted3.xml")
All code is reusable with a little modification
- A simpler approach in the modification of the SPARQL code was also tried which eliminated the need for a lot of work.
Detailed Tutorial for the QUERY
SPARQL dictionary witout flags
25/07/2020
Tester: Ambreen H
In order to run the machine learning model, proper data preparation is necessary
- The following libraries were used:
xml.etree.ElementTree as ET, string, os
andre
- A function was written to locate XML files and extracting abstract from that
- This was done on a small number of papers (30 positives and 30 negatives)
- The abstract was cleaned by removing unnecessary characters, turning into lowercase and removing subheadings like 'abstract' etc
- Finally a single data file was created in CSV format having 3 columns, one for the name of the file, other for the entire cleaned text in the abstract, and whether the result is a false positive or true positive.
- Create separate text files for each paper. Split it into training and testing datasets.
- Finalize the code for binary classification in Python
- Run the smoke test for a sample of hundred papers
11/08/2020
-
Finalized the country dictionary:
-
Tried language variants for Hindi languages. Dictionary uploaded to GitHub Hindi_Dictionary: Dictioanry_with_hindi
-
Finished sectioning on the corpus
-
Tried ami search and uploaded the results: ami_search_results
-
Downloaded all libraries for the ML project
17/08/2020
- Wrote the code to get abstract from ami section. The code was written with comments to understand what each line does. The code does the following:
- Extracts abstracts from ami section
- Cleans the text
- Merges rows with similar paper IDs
- Adds it all to the final csv file
This may be used for manual classification of all papers.
16/09/2020
- Finalized the country dictionary (MORE DETAILS)
- Ran ami search on the dictionary
ami -p ami_12_08_2020/corpus_950 search --dictionary ami_12_08_2020/amidict10/country.xml
- Ran ami search of two dictionaries (country and diseases)
ami -p ami_12_08_2020/corpus_950 search --dictionary ami_12_08_2020/amidict10/country.xml
- Validated the country dictionary
- Started working on Machine Learning (NLP) in Jupyter Notebook
1/10/2020
- Used Machine Learning for Binary Classification
- Used
getpapers
to download a corpus of 2000 papers using the following queries:
getpapers -q (TITLE:"HIV" or TITLE:"EBOLA" NOT TITLE:"COVID-19" NOT TITLE:"Corona" NOT TITLE:"COVID") -o hackathon/test/ML_other_1 -f v_epid/log.txt -x -p -k 200 getpapers -q (TITLE:"COVID-19" or TITLE:"Corona" NOT TITLE:"Ebola" NOT TITLE:"HIV") -o hackathon/test/ML_covid_1 -f v_epid/log.txt -x -p -k 200
- Ran ami search to scrutize the papers from the datatables generated using the disease dictionary:
ami -p hackathon/ML_covid_1 search --dictionary hackathon/dictionaries/disease.xml ami -p hackathon/ML_other_1 search --dictionary hackathon/dictionaries/disease.xml
- Attempted binary classification on the two corpora after proper preprocessing: https://github.com/petermr/openVirus/tree/master/cambiohack2020
Submitter: Pooja Pareek
The project is all about viral epidemics with respect to different countries. The purpose of doing the project is putting all essential data in one place with the dictionary, country so it will be easy to understand each and every one.
Initially for getting started need to install all the necessary software. what I have done so far is here below:
- Installed
getpapers
:
installed getpapers
with using the information provided here: https://github.com/ContentMine/getpapers/blob/master/README.md
- went to the download page and installed
nvm-setup-zip
- run the nvm-setup-zip and installed included installer.
- installed NODE by using command prompt and the command one after another
nvm install 7
nvm use 7.10.1
- tested installation by
node --version
- ran the command
npm install --global getpapers
and get papers was installed successfully.
- Installed
ami
:
- with the help of https://github.com/petermr/openVirus/wiki/INSTALLING-ami3
- installed java.
- tested java installation by command
java -version
and got the java version 1.8 - installed JDK.
- set the path as per instruction.
- Installed git
- downloaded git
- launched git bash
- tested the installation by giving command
git --version
and got thegit version 2.27.0
- Installed maven
- download the maven by using https://maven.apache.org/install.html
- download apache-maven-3.6.3-bin.tar.gz or apache-maven-3.6.3-bin.zip
- extracted all files
- set the path variable
- test installation using
mvn -version
and gotapache maven version 3.6.3
- open the command prompt and enter the following command one after another
git clone https://github.com/petermr/ami3.git
cd ami3
mvn install -Dmaven.test.skip=true
- tested the installation by using
ami --help
- ami has installed sucessfully
- set the path for ami3
- tried to run the
ami search
andami section
for initially 20 papers.
[error : java should point to the jdk not jrk
]
after the successfully installation of basic software the following things has been learnt with the help of mentor's instructions that was provided on wiki -
- edit a page.
- create a corpus (corpus is a group of articles, in which all the article belongs to the respective topic).
- section the articles with
amisection
. - create a datatable with
amisearch
. - create a dictionary using
amidict
. - create a spaqrl euery.
- manually differentiate between true positive and false positive.