-
Notifications
You must be signed in to change notification settings - Fork 9
AlphaTest (petermr) 2
petermr edited this page Jul 25, 2021
·
8 revisions
Not covered in first alphatest
-z, --zip download files from ftp endpoint if available (only eupmc supported)
-l LOGLEVEL, --loglevel LOGLEVEL
-f LOGFILE, --logfile LOGFILE
-k LIMIT, --limit LIMIT
-r RESTART, --restart RESTART
-u UPDATE, --update UPDATE
--onlyquery Saves json file containing the result of the query in storage. (only eupmc
-c, --makecsv Stores the per-document metadata as csv.
--makehtml Stores the per-document metadata as html.
--synonym Results contain synonyms as well.
--startdate STARTDATE
--enddate ENDDATE Gives papers till given date. Format: YYYY-MM-DD
--terms TERMS Location of the txt file which contains terms serperated by a comma which
--api API API to search [eupmc, crossref,arxiv,biorxiv,medrxiv,rxivist] (default:
--filter FILTER filter by key value pair (only crossref supported)
not easily testable (need some pointers to PMCIDs)
pygetpapers -q 'TPS30' -k 10 -o TPS30
INFO: Final query is TPS30
INFO: Total Hits are 18
0it [00:00, ?it/s]WARNING: Keywords not found for paper 4
WARNING: Keywords not found for paper 5
WARNING: Keywords not found for paper 10
1it [00:00, 268.38it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00, 2.28it/s]
(base) pm286macbook:petermr pm286$ tree TPS30/
TPS30/
├── PMC4457800
│ └── eupmc_result.json
├── PMC5122590
│ └── eupmc_result.json
├── PMC5161391
│ └── eupmc_result.json
├── PMC5655044
│ └── eupmc_result.json
├── PMC6266747
│ └── eupmc_result.json
├── PMC6742361
│ └── eupmc_result.json
├── PMC7305226
│ └── eupmc_result.json
├── PMC7600171
│ └── eupmc_result.json
├── PMC8036305
│ └── eupmc_result.json
├── PMC8201348
│ └── eupmc_result.json
└── eupmc_results.json
10 directories, 11 files
Takes the partial output of previous search (limited to 10) and increases to 100 (actually only 18 hits). I couldn't get it right.
$ pygetpapers -q 'TPS30' --update TPS30/eupmc_results.json -k 100 -o TPS30
INFO: Final query is TPS30
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
Traceback (most recent call last):
File "/opt/anaconda3/bin/pygetpapers", line 8, in <module>
sys.exit(main())
File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 583, in main
callpygetpapers.handlecli()
File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 564, in handlecli
self.handle_update(args)
File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 73, in handle_update
self.europe_pmc.eupmc_update(args)
File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/europe_pmc.py", line 190, in eupmc_update
os.chdir(os.path.dirname(args.update))
FileNotFoundError: [Errno 2] No such file or directory: 'TPS30'
Don't understand error: current files:
ls
2021_07_24_19_18_09 README.md TPS31 tps_terms_3.txt
2021_07_25_18_21_12 TPS30 tps_terms_2.txt tps_terms_50.txt
(Cleaned TPS31/
)
pygetpapers -q 'TPS31' --loglevel debug --logfile TPS31/logfile.txt -k 100 -x -o TPS31
gives
├── TPS31
│ ├── PMC3193516
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC3997964
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC4457800
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC5122590
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC5378189
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC6360234
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC7049213
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC7214349
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC7304153
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC7422722
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ ├── PMC8002989
│ │ ├── eupmc_result.json
│ │ └── fulltext.xml
│ └── eupmc_results.json
Extraction of metadata as CSV or HTML
pygetpapers -q 'TPS32' --makecsv --makehtml -k 100 -x -o TPS32
gives
├── TPS32
│ ├── PMC3997964
│ │ ├── eupmc_result.html
│ │ ├── eupmc_result.json
│ │ ├── fulltext.csv
│ │ └── fulltext.xml
│ ├── PMC4457800
│ │ ├── eupmc_result.html
│ │ ├── eupmc_result.json
│ │ ├── fulltext.csv
│ │ └── fulltext.xml
│ ├── PMC5378189
│ │ ├── eupmc_result.html
│ │ ├── eupmc_result.json
│ │ ├── fulltext.csv
│ │ └── fulltext.xml
│ ├── PMC6546838
│ │ ├── eupmc_result.html
│ │ ├── eupmc_result.json
│ │ ├── fulltext.csv
│ │ └── fulltext.xml
│ ├── PMC6742361
│ │ ├── eupmc_result.html
│ │ ├── eupmc_result.json
│ │ ├── fulltext.csv
│ │ └── fulltext.xml
│ ├── PMC7214349
│ │ ├── eupmc_result.html
│ │ ├── eupmc_result.json
│ │ ├── fulltext.csv
│ │ └── fulltext.xml
│ ├── PMC7305226
│ │ ├── eupmc_result.html
│ │ ├── eupmc_result.json
│ │ ├── fulltext.csv
│ │ └── fulltext.xml
│ ├── PMC7422722
│ │ ├── eupmc_result.html
│ │ ├── eupmc_result.json
│ │ ├── fulltext.csv
│ │ └── fulltext.xml
│ ├── PMC7821278
│ │ ├── eupmc_result.html
│ │ ├── eupmc_result.json
│ │ ├── fulltext.csv
│ │ └── fulltext.xml
│ ├── PMC8168065
│ │ ├── eupmc_result.html
│ │ ├── eupmc_result.json
│ │ ├── fulltext.csv
│ │ └── fulltext.xml
│ ├── eupmc_results.html
│ ├── eupmc_results.json
│ └── europe_pmc.csv
"fulltext.csv" contents:
more TPS32/PMC8168065/fulltext.csv
,Info_By_EuropePMC_Api
downloaded,True
htmlmade,False
full,"{'id': '33556995', 'source': 'MED', 'pmid': '33556995', 'pmcid': 'PMC8168065', 'fullTextIdList': {'fullTextId': 'PMC8168065'}, 'doi': '10.1002/jmrs.456', 'title': 'The cognitive and perceptual processes that affect observer performance in lung cancer detection: a scoping review.', 'authorString': 'Van De Luecht MR, Reed WM.', 'authorList': {'author': [{'fullName': 'Van De Luecht MR', 'firstName': 'Monica-Rose', 'lastName': 'Van De Luecht', 'initials': 'MR', 'authorId': {'@type': 'ORCID', '#text': '0000-0002-1168-3199'}, 'authorAffiliationDetailsList': {'authorAffiliation': {'affiliation': 'Discipline of Medical Imaging Science, Faculty of Medicine and Health, Sydney School of Health Sciences, The University of Sydney, Sydney, NS
...
Seems to have a column with complete JSON metadata in.
- is this a good idea??
- is
fulltext.csv
a good name (surely it's metadata)
TPS32/PMC8168065/eupmc_result.html This file is a useful single row table of metadata
<!doctype html>
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js">
</script>
<link rel="stylesheet" type="text/css" href="https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css">
<script type="text/javascript" src="https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js"></script>
<style>
# table {
height: 250px;
overflow-y:scroll;
}
</style>
</head>
<body><table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>downloaded</th>
<th>htmllinks</th>
<th>abstract</th>
<th>Keywords</th>
<th>pdflinks</th>
<th>journaltitle</th>
<th>authorinfo</th>
<th>title</th>
</tr>
</thead>
<tbody>
<tr>
<th>PMC8168065</th>
<td>True</td>
<td><a target="_blank" href="https://europepmc.org/articles/PMC8168065">Link</a></td>
<td><div id="table"><h4>Introduction</h4>Early detection of malignant pulmonary nodules through screening has been shown to reduce lung cancer-related mortality by 20%. However, perceptual and cognitive factors that affect nodule detection are poorly understood. This review examines the cognitive and visual processes of various observers, with a particular focus on radiologists, during lung nodule detection.<h4>Methods</h4>Four databases (Medline, Embase, Scopus and PubMed) were searched to extract studies on eye-tracking in pulmonary nodule detection. Studies were included if they used eye-tracking to assess the search and detection of lung nodules in computed tomography or 2D radiographic imaging. Data were charted according to identified themes and synthesised using a thematic narrative approach.<h4>Results</h4>The literature search yielded 25 articles and five themes were discovered: 1 - functional visual field and satisfaction of search, 2 - expert search patterns, 3 - error classification through dwell time, 4 - the impact of the viewing environment and 5 - the effect of prevalence expectation on search. Functional visual field reduced to 2.7° in 3D imaging compared to 5° in 2D radiographs. Although gr(base) ....
TPS32/eupmc_results.html This is concatenated metadata but includes JSON and is unwieldly.
no dates
$ pygetpapers -q 'TPS' -n
INFO: Final query is TPS
INFO: Total number of hits for the query are 19509
start
pygetpapers -q 'TPS' --startdate 2019-01-01 -n
INFO: Final query is (TPS) AND (FIRST_PDATE:[2019-01-01 TO 2021-07-25])
INFO: Total number of hits for the query are 6277
end
$ pygetpapers -q 'TPS' --enddate 2018-12-31 -n
INFO: Final query is (TPS) AND (FIRST_PDATE:[TO 2018-12-31])
INFO: Total number of hits for the query are 2251
NOTE: the hits should add up, but the discrepancy is probably missing date metadata.