Folder containing the necessary code to create a dataset for analysis from the PubMed Central Open Access collection.
- config folder: contains config files, ground truth, the list of BMC and PLoS journals as well as the Science-Metrix journal classification.
- das classifier folder: contains code and instructions to reproduce the DAS classification step.
- dev set folder: contains a uniform sample of 1000 articles from the PMC OA collection, created using the sample_dev_set.py script, which can be used for agile development.
- exports folder: contains exports from scripts.
- logs folder: empty, for log files.
- A set of scripts to create the dataset, see below for instructions. You might need to adjust some parameters at the beginning of each script before using them.
- Download the Pubmed OA collection, e.g. via their FTP service: https://www.ncbi.nlm.nih.gov/pmc/tools/ftp. Optionally sample it using the sample_dev_set.py script (or use the development dataset of 1000 articles which is provided in the dev set folder).
- Setup a MongoDB and update the config file.
- Run the parser_main.py script, which will create a first collection of articles in Mongo.
- Run the calculate_stats.py script, which will calculate citation counts for articles and authors and create the relative collections in Mongo.
- Run the get_export.py script, which will create a first export of the dataset in the exports folder.
- Run the get_das_unique.py script, which will pull out unique DAS for classification.
- Follow the instructions in the DAS classifier README.
- Run the get_export_merged.py script, to create the final dataset for analysis.
- Optionally, run the evaluation_plos.py and get_authors_top.py for evaluation.
See requirements.