README.md

Dataset

Folder containing the necessary code to create a dataset for analysis from the PubMed Central Open Access collection.

config folder: contains config files, ground truth, the list of BMC and PLoS journals as well as the Science-Metrix journal classification.
das classifier folder: contains code and instructions to reproduce the DAS classification step.
dev set folder: contains a uniform sample of 1000 articles from the PMC OA collection, created using the sample_dev_set.py script, which can be used for agile development.
exports folder: contains exports from scripts.
logs folder: empty, for log files.
A set of scripts to create the dataset, see below for instructions. You might need to adjust some parameters at the beginning of each script before using them.

Download the Pubmed OA collection, e.g. via their FTP service: https://www.ncbi.nlm.nih.gov/pmc/tools/ftp. Optionally sample it using the sample_dev_set.py script (or use the development dataset of 1000 articles which is provided in the dev set folder).
Setup a MongoDB and update the config file.
Run the parser_main.py script, which will create a first collection of articles in Mongo.
Run the calculate_stats.py script, which will calculate citation counts for articles and authors and create the relative collections in Mongo.
Run the get_export.py script, which will create a first export of the dataset in the exports folder.
Run the get_das_unique.py script, which will pull out unique DAS for classification.
Follow the instructions in the DAS classifier README.
Run the get_export_merged.py script, to create the final dataset for analysis.
Optionally, run the evaluation_plos.py and get_authors_top.py for evaluation.