From b4797e93e037492b941e06e4995c6e0883f8a47a Mon Sep 17 00:00:00 2001 From: petermr Date: Tue, 15 Mar 2022 10:03:27 +0000 Subject: [PATCH 1/9] Update paper.md --- joss/paper.md | 59 ++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 47 insertions(+), 12 deletions(-) diff --git a/joss/paper.md b/joss/paper.md index ae9f646..a23433c 100644 --- a/joss/paper.md +++ b/joss/paper.md @@ -35,6 +35,7 @@ In 2015, we reviewed tools for scraping websites and decided that none met our n An important aspect is to provide a simple cross-platform approach for scientists who may find tools like `curl` too complex and want a one-line command to combine the search, download, and analysis into a single: "please give me the results". We've tested this on many interns who learn `pygetpapers` in minutes. It was also easy to wrap it `tkinter GUI`[@tkinter]. The architecture of the results is simple and natural, based on full-text files in the normal filesystem. The result of `pygetpapers` is interfaced using a “master” JSON file (for eg. eupmc_results.json), which allows corpus to be reused/added to. This allows maximum flexibility of re-use and some projects have large amounts of derived data in these directories. +
``` pygetpapers -q "METHOD: invasive plant species" -k 10 -o "invasive_plant_species_test" -c --makehtml -x --save_query ``` @@ -51,9 +52,11 @@ INFO: Saving XML files to C:\Users\shweata\invasive_plant_species_test\*\fulltex ```

Fig.1 Example query of `pygetpapers`

- +
The number and type of scientific repositories (especially preprints) is expanding , and users do not want to use a different tool for each new one. `pygetpapers` is built on a modular system and repository-specific code can be swapped in as needed. Often they use different query systems and `pygetpapers` makes a start on simplifying this. By configuring repositories in a configuration file, users can easily configure support for new repositories. +
+ ``` [europe_pmc] query_url=https://www.ebi.ac.uk/europepmc/webservices/rest/searchPOST @@ -72,6 +75,7 @@ features_not_supported = ["filter",] ```

Fig.2 Example configuration for a repository (europePMC)

+
Many **searches** are simple keywords or phrases. However, these often fail to include synonyms and phrases and authors spend time creating complex error-prone boolean queries. We have developed a dictionary-based approach to automate much of the creation of complex queries. @@ -82,6 +86,8 @@ Frequently users want to search **incrementally**, e.g. downloading part and res `pygetpapers` takes the approach of downloading once and re-analyzing later on local filestore. This saves repeated querying where connections are poor or where there is suspicion that publishers may surveil users. Moreover, publishers rarely provide more than full-text Boolean searches, whereas local tools can analyze sections and non-textual material. +The number of repositories is rapidly expanding, driven by the rise in preprint use (both per-subjects and percountry), Institutional repositories and aggregation sites such as EuropePMC, HAL, SciELO, etc. Each of these uses their own dialect of query syntax and API access. A major aspect of `pygetpapers` is to make it easy to add new repositories, often by people who have littlw coding experience. + We do not know of other tools which have the same functionality. `curl` [@curl] requires detailed knowledge of the download protocol. VosViewer [@VOSviewer] is mainly aimed at bibliography/citations. # Overview of the architecture @@ -98,15 +104,25 @@ The download may be repository-dependent but usually contains: - PDF - usually includes the whole material but not machine-sectioned - HTML . often avaliable on websites * supplemental data. This is very variable, often as PDF but also raw data files and sometimes zipped. It is not systematically arranged but `pygetpapers` allows for some heuristics. +* figures. This is not supported by some repositories and others may require custom code. + +
![Fig.3 Architecture of `pygetpapers`](../resources/archietecture.png)

Fig.3 Architecture of `pygetpapers`

+
-This directory structure is designed so that analysis tools can add computed data for articles +For this reason we create a directory structure with a root (`CProjects`) and a (`CTree`) subdirectory for each downloaded article or document. `pygetpapers` will routinely populate this with 1-5 files or subdirectories (see above). At present `pygetpapers` always creates a *_result.json file (possibly empty) and this can be used as a marker for identifying CTrees. This means that a `CProject` contains subdirectories which may be CTrees or not, distinguished by this marker. +## derived data -``` +Besides the downloaded data (already quite variable) users often wish to create new derived data and this directory structure is designed so that tools can add an arbitrary amount of new data, normally in sub-directory trees. For example we have sibling projects that add data to the `CTree`: +* docanalysis (text analysis including NLTK and spaCy/sciSpaCy [URL] +* pyamiimage (image processing and analysis of figures). [URL] + +
+```directory C:. │ eupmc_results.json │ @@ -118,23 +134,42 @@ C:. │ eupmc_result.json │ fulltext.xml │ +``` +and with examples of derived data +``` ├───PMC8198815 │ eupmc_result.json │ fulltext.xml +|. bag_of_words.txt +|. figure/ +|. raw.jpg +|. skeleton.png │ -├───PMC8216501 -│ eupmc_result.json -│ fulltext.xml -│ -├───PMC8309040 -│ eupmc_result.json +├───10.9999_123456 # CTree due to fooRxiv_result.json +│ fooRxiv_result.json │ fulltext.xml +|. bag_of_words.txt +|. search/ +|. results/ +|. terpenes.csv │ -└───PMC8325914 - eupmc_result.json - fulltext.xml +|. univ_bar_thesis_studies_on_lantana/ # CTree dues to thesis_12345_results.json +|. thesis_12345_results.json +| fulltext.pdf +|. figures/ +|. figure/ +| Fig1/ +|. +|____summary/ # not CTree as no child *_results.json +|. bag_of_words.txt +|. figures/ +| + ```

Fig.4 Typical download directory

+

Several types of download have been combined in this CProject and some CTrees have derived data +

+ ## Code From 628f03a1198c499eafdc27cf239aeb5a332ec9b9 Mon Sep 17 00:00:00 2001 From: petermr Date: Tue, 15 Mar 2022 10:06:20 +0000 Subject: [PATCH 2/9] Update paper.md --- joss/paper.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/joss/paper.md b/joss/paper.md index a23433c..38d087d 100644 --- a/joss/paper.md +++ b/joss/paper.md @@ -122,7 +122,7 @@ Besides the downloaded data (already quite variable) users often wish to create * pyamiimage (image processing and analysis of figures). [URL]
-```directory +``` C:. │ eupmc_results.json │ From ccfea8d2f68e0c677e11497767f2271f6d7f83f4 Mon Sep 17 00:00:00 2001 From: petermr Date: Tue, 15 Mar 2022 10:07:47 +0000 Subject: [PATCH 3/9] Update paper.md --- joss/paper.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/joss/paper.md b/joss/paper.md index 38d087d..47cf9d7 100644 --- a/joss/paper.md +++ b/joss/paper.md @@ -122,6 +122,7 @@ Besides the downloaded data (already quite variable) users often wish to create * pyamiimage (image processing and analysis of figures). [URL]
+ ``` C:. │ eupmc_results.json @@ -135,7 +136,9 @@ C:. │ fulltext.xml │ ``` + and with examples of derived data + ``` ├───PMC8198815 │ eupmc_result.json @@ -166,6 +169,7 @@ and with examples of derived data | ``` +

Fig.4 Typical download directory

Several types of download have been combined in this CProject and some CTrees have derived data

@@ -218,7 +222,6 @@ Among our own downstream tools are # Acknowledgements -We thank Dr. Peter Murray-Rust for the support and help with the design of the manuscript. From e2f1aa6ff30b6cdc9f80f92001fbcb7e344047de Mon Sep 17 00:00:00 2001 From: petermr Date: Tue, 15 Mar 2022 10:23:30 +0000 Subject: [PATCH 4/9] Update paper.md --- joss/paper.md | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/joss/paper.md b/joss/paper.md index 47cf9d7..bffa045 100644 --- a/joss/paper.md +++ b/joss/paper.md @@ -94,6 +94,8 @@ We do not know of other tools which have the same functionality. `curl` [@curl] ## Data +### raw data + The download may be repository-dependent but usually contains: * download metadata. (query, date, errors, etc.) * journal/article metadata. We use JATS-NISO [@JATS] which is widely used by publishers and repository owners, especially in bioscience and medicine. There are over 200 tags. @@ -115,12 +117,13 @@ The download may be repository-dependent but usually contains: For this reason we create a directory structure with a root (`CProjects`) and a (`CTree`) subdirectory for each downloaded article or document. `pygetpapers` will routinely populate this with 1-5 files or subdirectories (see above). At present `pygetpapers` always creates a *_result.json file (possibly empty) and this can be used as a marker for identifying CTrees. This means that a `CProject` contains subdirectories which may be CTrees or not, distinguished by this marker. -## derived data +### derived data Besides the downloaded data (already quite variable) users often wish to create new derived data and this directory structure is designed so that tools can add an arbitrary amount of new data, normally in sub-directory trees. For example we have sibling projects that add data to the `CTree`: * docanalysis (text analysis including NLTK and spaCy/sciSpaCy [URL] * pyamiimage (image processing and analysis of figures). [URL] +
``` @@ -173,6 +176,7 @@ and with examples of derived data

Fig.4 Typical download directory

Several types of download have been combined in this CProject and some CTrees have derived data

+ @@ -211,6 +215,18 @@ The control module `pygetpapers` reads the commandline and `getpapers` was implemented in `NodeJS` which allows multithreading and therefore potentially download rates of several XML documents per second on a fast line. Installing `NodeJS` was a problem on some systems (especially Windows) and was not well suited for integration with scientific libraries (mainly coded in Java and Python). We, therefore, decided to rewrite in Python, keeping only the command line and output structure, and have found very easy integration with other tools, including GUIs. `pygetpapers` can be run both as a command-line tool and a module, which makes it versatile. +## core +The core mainly consists of: +* `pygetpapers.py` (query-builder and runner). This includes query abstractions such as dates and Boolean queries for terms +* `download_tools.py` (generic code for query/download (REST)) + +## repository interfaces +We have tried to minimise the amount of repository-specific code, choosing to use declarative configuration files. To add a new repository you will need to: +* create a configuration file (Fig. 2) +* subclass the repo from `repository_interface.py` +* add any repository_specific code to add features or disable others + + # Interface with other tools Downloading is naturally modular, rather slow, and we interface by writing all output to the filesystem. This means that a wide range of tools (Unix, Windows, Java, Python, etc.) can analyze and transform it. The target documents are usually static so downloads only need to be done once. From 837be2614d689d813ad0d0f9e85f9a44f32fc69b Mon Sep 17 00:00:00 2001 From: Ayush Garg Date: Wed, 16 Mar 2022 16:02:47 +0800 Subject: [PATCH 5/9] Update paper.md --- joss/paper.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/joss/paper.md b/joss/paper.md index bffa045..c9f2533 100644 --- a/joss/paper.md +++ b/joss/paper.md @@ -53,7 +53,7 @@ INFO: Saving XML files to C:\Users\shweata\invasive_plant_species_test\*\fulltex

Fig.1 Example query of `pygetpapers`

-The number and type of scientific repositories (especially preprints) is expanding , and users do not want to use a different tool for each new one. `pygetpapers` is built on a modular system and repository-specific code can be swapped in as needed. Often they use different query systems and `pygetpapers` makes a start on simplifying this. By configuring repositories in a configuration file, users can easily configure support for new repositories. +The number of repositories is rapidly expanding, driven by the rise in preprint use (both per-subjects and percountry), Institutional repositories and aggregation sites such as EuropePMC, HAL, SciELO, etc. Each of these uses their own dialect of query syntax and API access. A major aspect of `pygetpapers` is to make it easy to add new repositories, often by people who have little coding experience. `pygetpapers` is built on a modular system and repository-specific code can be swapped in as needed. By configuring repositories in a configuration file, users can easily configure support for new repositories.
@@ -86,8 +86,6 @@ Frequently users want to search **incrementally**, e.g. downloading part and res `pygetpapers` takes the approach of downloading once and re-analyzing later on local filestore. This saves repeated querying where connections are poor or where there is suspicion that publishers may surveil users. Moreover, publishers rarely provide more than full-text Boolean searches, whereas local tools can analyze sections and non-textual material. -The number of repositories is rapidly expanding, driven by the rise in preprint use (both per-subjects and percountry), Institutional repositories and aggregation sites such as EuropePMC, HAL, SciELO, etc. Each of these uses their own dialect of query syntax and API access. A major aspect of `pygetpapers` is to make it easy to add new repositories, often by people who have littlw coding experience. - We do not know of other tools which have the same functionality. `curl` [@curl] requires detailed knowledge of the download protocol. VosViewer [@VOSviewer] is mainly aimed at bibliography/citations. # Overview of the architecture @@ -188,12 +186,13 @@ Most repository APIs provide a cursor-based approach to querying: 1. A query is sent and the repository creates a list of M hits (pointers to documents), sets a cursor start, and returns this information to the `pygetpapers` client. 2. The client requests a chunk of size N <= M (normally 25-1000) and the repository replies with N pointers to documents. 3. The server response is pages of hits (metadata) as XML , normally <= 1000 hits per page , (1 sec) -4. `pygetpapers` - incremental aggregates XML metadata as python dict in memory - small example for paper -5. If cursor indicates next page, submits a query for next page, else if end terminates this part -6. When finished all pages, writes metadata to CProject (Top level project directory) as JSON (total, and creates CTrees (per-article directories) with individual metadata) -7. Recover from crashes, restart (if needed) +4. `pygetpapers` - incremental aggregates XML metadata as python dict in memory +5. If cursor indicates next page, `pygetpapers` submits a query for next page, otherwise it terminates the data collection and processes the python dict +6. If user has requested supplemental data (eg. references, citations, fulltext, etc.) then the `pygetpapers` iterates through the python dict and uses the identifier, usually in the form of DOI, to query and download supplemental data seperately. +7. When the search is finished, `pygetpapers` writes the metadata to CProject (Top level project directory) as JSON (total, and creates CTrees (per-article directories) with individual metadata) +8. It also recovers from crashes and restarts if needed). -The control module `pygetpapers` reads the commandline and +The control module `pygetpapers.py` reads the commandline and * Selects the repository-specific downloader * Creates a query from user input and/or terms from dictionaries * Adds options and constraints @@ -201,7 +200,7 @@ The control module `pygetpapers` reads the commandline and # Generic downloading concerns -* Download speeds. Excessively rapid or voluminous downloads can overload servers and are sometimes hostile (DOS). We have discussed this with major sites (EPMC, biorXiv, Crossref etc. and have a default (resettable) delay in `pygetpapers`. +* Download speeds. Excessively rapid or voluminous downloads can overload servers and are sometimes hostile (DOS). We have discussed this with major sites (EPMC, biorXiv, Crossref etc. and therefore choose to download sequentially instead of sending parallel requests in `pygetpapers`. * Authentication (alerting repo to downloader header). `pygetpapers` supports anonymous, non-authenticated, access but includes a header (e.g. for Crossref) # Design @@ -224,7 +223,7 @@ The core mainly consists of: We have tried to minimise the amount of repository-specific code, choosing to use declarative configuration files. To add a new repository you will need to: * create a configuration file (Fig. 2) * subclass the repo from `repository_interface.py` -* add any repository_specific code to add features or disable others +* add any repository specific code to add features or disable others # Interface with other tools From 9e0d1cf3933d95b952fff5a4a3b8e6a72cfcf088 Mon Sep 17 00:00:00 2001 From: petermr Date: Thu, 24 Mar 2022 09:47:42 +0000 Subject: [PATCH 6/9] Update europe_pmc.py --- pygetpapers/repository/europe_pmc.py | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/pygetpapers/repository/europe_pmc.py b/pygetpapers/repository/europe_pmc.py index fdc4d5a..061df65 100644 --- a/pygetpapers/repository/europe_pmc.py +++ b/pygetpapers/repository/europe_pmc.py @@ -65,7 +65,23 @@ HTML = "html" class EuropePmc(RepositoryInterface): - """ """ + """ Downloads metadata and optionally fulltext from https://europepmc.org""" + + """Can optionally download supplemental author data, the content of which is irregular and + not weell specified. + For articles with figures, the links to the figures on the EPMC site are included in the fulltext.xml + but the figures are NOT included. (We have are adding this functionality to our `docanalysis` and `pyamiimage` + codes. + + In some cases a "zip" file is provided by EPMC which does contain figures in the paper and supplemntal author data; + this can be downloaded. + + EPMC has a number of additional services including: + - references and citations denoted by 3-letter codes + + pygetpapers can translate a standard date into EPMC format and include it in the query. + + """ def __init__(self): self.download_tools = DownloadTools(EUROPEPMC) From d3482a8664e4ce1c265bd72117ca28f79ba88bdc Mon Sep 17 00:00:00 2001 From: petermr Date: Thu, 24 Mar 2022 09:57:55 +0000 Subject: [PATCH 7/9] Update rxiv.py --- pygetpapers/repository/rxiv.py | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/pygetpapers/repository/rxiv.py b/pygetpapers/repository/rxiv.py index 0bc7c90..b1b0e0e 100644 --- a/pygetpapers/repository/rxiv.py +++ b/pygetpapers/repository/rxiv.py @@ -28,7 +28,12 @@ class Rxiv(RepositoryInterface): - """Rxiv class which handles Biorxiv and Medrxiv repository""" + """Biorxiv and Medrxiv repositories + + At present (2022-03) the API appears only to support date searches. + The `rxivist` system is layered on top and supports fuller queries + +""" def __init__(self,api="biorxiv"): """initiate Rxiv class""" From 6eea6547ad449df9b1cf145be502d18248d162ec Mon Sep 17 00:00:00 2001 From: petermr Date: Thu, 24 Mar 2022 10:04:25 +0000 Subject: [PATCH 8/9] Update rxivist.py --- pygetpapers/repository/rxivist.py | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/pygetpapers/repository/rxivist.py b/pygetpapers/repository/rxivist.py index 21616b7..0999fe0 100644 --- a/pygetpapers/repository/rxivist.py +++ b/pygetpapers/repository/rxivist.py @@ -32,7 +32,15 @@ class Rxivist(RepositoryInterface): - """Rxivist class which handles the rxivist wrapper""" + """Rxivist wrapper for biorxiv and medrxiv + + From the site (rxivist.org): + "Rxivist combines biology preprints from bioRxiv and medRxiv with data from Twitter + to help you find the papers being discussed in your field." + + Appears to be metadata-only. To get full-text you may have to submit the IDs to biorxiv or medrxiv + or EPMC as this aggregates preprints. + """ def __init__(self): self.download_tools = DownloadTools(RXIVIST) @@ -189,4 +197,4 @@ def noexecute(self, query_namespace): result_dict = self.rxivist(query_namespace.query, size=10) results = result_dict[NEW_RESULTS] totalhits = results[TOTAL_HITS] - logging.info("Total number of hits for the query are %s", totalhits) \ No newline at end of file + logging.info("Total number of hits for the query are %s", totalhits) From 6deb5ce1eb1461e7a7ef37d41eb43517e0eefebb Mon Sep 17 00:00:00 2001 From: petermr Date: Thu, 24 Mar 2022 10:26:25 +0000 Subject: [PATCH 9/9] Update arxiv.py --- pygetpapers/repository/arxiv.py | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/pygetpapers/repository/arxiv.py b/pygetpapers/repository/arxiv.py index 0010836..40206dc 100644 --- a/pygetpapers/repository/arxiv.py +++ b/pygetpapers/repository/arxiv.py @@ -50,7 +50,17 @@ from pygetpapers.repositoryinterface import RepositoryInterface class Arxiv(RepositoryInterface): - """Arxiv class which handles arxiv repository""" + ""arxiv.org repository + + This uses a PyPI code `arxiv` to download metadata. It is not clear whether this is + created by the `arXiv` project or layered on top of the public API. + + `arXiv` current practice for bulk data download (e.g. PDFs) is described in +https://arxiv.org/help/bulk_data. Please be considerate and also include a rate limit. + + + + """ def __init__(self): self.download_tools = DownloadTools(ARXIV) @@ -176,4 +186,4 @@ def apipaperdownload(self, query_namespace): makecsv=query_namespace["makecsv"], makexml=query_namespace["xml"], makehtml=query_namespace["makehtml"], - ) \ No newline at end of file + )