Skip to content

AlphaTest (petermr) 2

petermr edited this page Jul 25, 2021 · 8 revisions

Alphatest continued

Not covered in first alphatest

  -z, --zip             download files from ftp endpoint if available (only eupmc supported)
  -l LOGLEVEL, --loglevel LOGLEVEL
  -f LOGFILE, --logfile LOGFILE
  -k LIMIT, --limit LIMIT
  -r RESTART, --restart RESTART
  -u UPDATE, --update UPDATE
  --onlyquery           Saves json file containing the result of the query in storage. (only eupmc
  -c, --makecsv         Stores the per-document metadata as csv.
  --makehtml            Stores the per-document metadata as html.
  --synonym             Results contain synonyms as well.
  --startdate STARTDATE
  --enddate ENDDATE     Gives papers till given date. Format: YYYY-MM-DD
  --terms TERMS         Location of the txt file which contains terms serperated by a comma which
  --api API             API to search [eupmc, crossref,arxiv,biorxiv,medrxiv,rxivist] (default:
  --filter FILTER       filter by key value pair (only crossref supported)

zip

not easily testable (need some pointers to PMCIDs)

-k LIMIT

pygetpapers -q 'TPS30' -k 10 -o TPS30
INFO: Final query is TPS30
INFO: Total Hits are 18
0it [00:00, ?it/s]WARNING: Keywords not found for paper 4
WARNING: Keywords not found for paper 5
WARNING: Keywords not found for paper 10
1it [00:00, 268.38it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.28it/s]
(base) pm286macbook:petermr pm286$ tree TPS30/
TPS30/
├── PMC4457800
│   └── eupmc_result.json
├── PMC5122590
│   └── eupmc_result.json
├── PMC5161391
│   └── eupmc_result.json
├── PMC5655044
│   └── eupmc_result.json
├── PMC6266747
│   └── eupmc_result.json
├── PMC6742361
│   └── eupmc_result.json
├── PMC7305226
│   └── eupmc_result.json
├── PMC7600171
│   └── eupmc_result.json
├── PMC8036305
│   └── eupmc_result.json
├── PMC8201348
│   └── eupmc_result.json
└── eupmc_results.json

10 directories, 11 files

update [PROBLEM]

Takes the partial output of previous search (limited to 10) and increases to 100 (actually only 18 hits). I couldn't get it right.

$ pygetpapers -q 'TPS30' --update TPS30/eupmc_results.json -k 100 -o TPS30
INFO: Final query is TPS30
INFO: Please ensure that you are providing the same --api as the one in the corpus or you may get errors
Traceback (most recent call last):
  File "/opt/anaconda3/bin/pygetpapers", line 8, in <module>
    sys.exit(main())
  File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 583, in main
    callpygetpapers.handlecli()
  File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 564, in handlecli
    self.handle_update(args)
  File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/pygetpapers.py", line 73, in handle_update
    self.europe_pmc.eupmc_update(args)
  File "/opt/anaconda3/lib/python3.8/site-packages/pygetpapers/europe_pmc.py", line 190, in eupmc_update
    os.chdir(os.path.dirname(args.update))
FileNotFoundError: [Errno 2] No such file or directory: 'TPS30'

Don't understand error: current files:

ls
2021_07_24_19_18_09	README.md		TPS31			tps_terms_3.txt
2021_07_25_18_21_12	TPS30			tps_terms_2.txt		tps_terms_50.txt

log and xml [Some problems]

(Cleaned TPS31/)

pygetpapers -q 'TPS31' --loglevel debug --logfile TPS31/logfile.txt -k 100 -x -o TPS31

gives

├── TPS31
│   ├── PMC3193516
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC3997964
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC4457800
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC5122590
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC5378189
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC6360234
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC7049213
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC7214349
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC7304153
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC7422722
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   ├── PMC8002989
│   │   ├── eupmc_result.json
│   │   └── fulltext.xml
│   └── eupmc_results.json

csv and html

Extraction of metadata as CSV or HTML

 pygetpapers -q 'TPS32' --makecsv --makehtml -k 100 -x -o TPS32

gives

├── TPS32
│   ├── PMC3997964
│   │   ├── eupmc_result.html
│   │   ├── eupmc_result.json
│   │   ├── fulltext.csv
│   │   └── fulltext.xml
│   ├── PMC4457800
│   │   ├── eupmc_result.html
│   │   ├── eupmc_result.json
│   │   ├── fulltext.csv
│   │   └── fulltext.xml
│   ├── PMC5378189
│   │   ├── eupmc_result.html
│   │   ├── eupmc_result.json
│   │   ├── fulltext.csv
│   │   └── fulltext.xml
│   ├── PMC6546838
│   │   ├── eupmc_result.html
│   │   ├── eupmc_result.json
│   │   ├── fulltext.csv
│   │   └── fulltext.xml
│   ├── PMC6742361
│   │   ├── eupmc_result.html
│   │   ├── eupmc_result.json
│   │   ├── fulltext.csv
│   │   └── fulltext.xml
│   ├── PMC7214349
│   │   ├── eupmc_result.html
│   │   ├── eupmc_result.json
│   │   ├── fulltext.csv
│   │   └── fulltext.xml
│   ├── PMC7305226
│   │   ├── eupmc_result.html
│   │   ├── eupmc_result.json
│   │   ├── fulltext.csv
│   │   └── fulltext.xml
│   ├── PMC7422722
│   │   ├── eupmc_result.html
│   │   ├── eupmc_result.json
│   │   ├── fulltext.csv
│   │   └── fulltext.xml
│   ├── PMC7821278
│   │   ├── eupmc_result.html
│   │   ├── eupmc_result.json
│   │   ├── fulltext.csv
│   │   └── fulltext.xml
│   ├── PMC8168065
│   │   ├── eupmc_result.html
│   │   ├── eupmc_result.json
│   │   ├── fulltext.csv
│   │   └── fulltext.xml
│   ├── eupmc_results.html
│   ├── eupmc_results.json
│   └── europe_pmc.csv

"fulltext.csv" contents:

more TPS32/PMC8168065/fulltext.csv
,Info_By_EuropePMC_Api
downloaded,True
htmlmade,False
full,"{'id': '33556995', 'source': 'MED', 'pmid': '33556995', 'pmcid': 'PMC8168065', 'fullTextIdList': {'fullTextId': 'PMC8168065'}, 'doi': '10.1002/jmrs.456', 'title': 'The cognitive and perceptual processes that affect observer performance in lung cancer detection: a scoping review.', 'authorString': 'Van De Luecht MR, Reed WM.', 'authorList': {'author': [{'fullName': 'Van De Luecht MR', 'firstName': 'Monica-Rose', 'lastName': 'Van De Luecht', 'initials': 'MR', 'authorId': {'@type': 'ORCID', '#text': '0000-0002-1168-3199'}, 'authorAffiliationDetailsList': {'authorAffiliation': {'affiliation': 'Discipline of Medical Imaging Science, Faculty of Medicine and Health, Sydney School of Health Sciences, The University of Sydney, Sydney, NS
...

Seems to have a column with complete JSON metadata in.

  • is this a good idea??
  • is fulltext.csv a good name (surely it's metadata)

TPS32/PMC8168065/eupmc_result.html This file is a useful single row table of metadata


    <!doctype html>
    <html>
      <head>
          <meta http-equiv="Content-type" content="text/html; charset=utf-8">
          <script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.2/jquery.min.js">
          </script>
          <link rel="stylesheet" type="text/css" href="https://cdn.datatables.net/1.10.16/css/jquery.dataTables.css">
          <script type="text/javascript" src="https://cdn.datatables.net/1.10.16/js/jquery.dataTables.js"></script>
          <style>
          # table {
              height: 250px;
              overflow-y:scroll;
          }
          </style>
      </head>
      <body><table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>downloaded</th>
      <th>htmllinks</th>
      <th>abstract</th>
      <th>Keywords</th>
      <th>pdflinks</th>
      <th>journaltitle</th>
      <th>authorinfo</th>
      <th>title</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>PMC8168065</th>
      <td>True</td>
      <td><a target="_blank" href="https://europepmc.org/articles/PMC8168065">Link</a></td>
      <td><div id="table"><h4>Introduction</h4>Early detection of malignant pulmonary nodules through screening has been shown to reduce lung cancer-related mortality by 20%. However, perceptual and cognitive factors that affect nodule detection are poorly understood. This review examines the cognitive and visual processes of various observers, with a particular focus on radiologists, during lung nodule detection.<h4>Methods</h4>Four databases (Medline, Embase, Scopus and PubMed) were searched to extract studies on eye-tracking in pulmonary nodule detection. Studies were included if they used eye-tracking to assess the search and detection of lung nodules in computed tomography or 2D radiographic imaging. Data were charted according to identified themes and synthesised using a thematic narrative approach.<h4>Results</h4>The literature search yielded 25 articles and five themes were discovered: 1 - functional visual field and satisfaction of search, 2 - expert search patterns, 3 - error classification through dwell time, 4 - the impact of the viewing environment and 5 - the effect of prevalence expectation on search. Functional visual field reduced to 2.7° in 3D imaging compared to 5° in 2D radiographs. Although gr(base) ....

TPS32/eupmc_results.html This is concatenated metadata but includes JSON and is unwieldly.

dates

no dates

$ pygetpapers -q 'TPS' -n 
INFO: Final query is TPS
INFO: Total number of hits for the query are 19509

start

pygetpapers -q 'TPS' --startdate 2019-01-01 -n 
INFO: Final query is (TPS) AND (FIRST_PDATE:[2019-01-01 TO 2021-07-25])
INFO: Total number of hits for the query are 6277

end

$ pygetpapers -q 'TPS' --enddate 2018-12-31 -n 
INFO: Final query is (TPS) AND (FIRST_PDATE:[TO 2018-12-31])
INFO: Total number of hits for the query are 2251

NOTE: the hits should add up, but the discrepancy is probably missing date metadata.

simulated user problems

  • no hits
pygetpapers -q 'TPS999' -n 
INFO: Final query is TPS999
INFO: Total number of hits for the query are 0

(getpapers would fail)

  • broad query
pygetpapers -q 'method' -n 
INFO: Final query is method
INFO: Total number of hits for the query are 6387597

(The defaults for --limit would catch this before it downloaded for 2 months.)

  • garbage characters
pygetpapers -q '@$%^^' -n 
INFO: Final query is @$%^^
INFO: Total number of hits for the query are 0
  • bad parameters
pygetpapers -q 'TPS' --limit -10  
INFO: Final query is TPS
1it [00:00, 2462.89it/s]
0it [00:00, ?it/s]

? is this an escape sequence? should give an error.

  • repeated parameters
pygetpapers -q 'TPS' -q 'Lantana' --limit 5 -o TPSX  
INFO: Final query is Lantana
INFO: Total Hits are 1912
0it [00:00, ?it/s]WARNING: html url not found for paper 1

First query ignored??

pygetpapers -q 'TPS' -q 'Lantana' --limit 5 --limit 10 -o TPSX  
INFO: Final query is Lantana
INFO: Total Hits are 1912
0it [00:00, ?it/s]WARNING: html url not found for paper 1
WARNING: pdf url not found for paper 1
WARNING: Keywords not found for paper 2
WARNING: Keywords not found for paper 3
WARNING: Author list not found for paper 5
WARNING: Author list not found for paper 8
WARNING: Keywords not found for paper 9

appears to ignore earlier repeated args.