Upgrade of pathway figure collection, processing and deposition pipeline #20

AlexanderPico · 2020-12-08T22:29:04Z

We plan to make incremental improvements to the processing of pathway figure content and annotations. First, we will modularize the processing pipeline and define input/output interfaces for each step. For example, one module will take any image file along with an optional PMCID as input and perform the OCR and processing required to generate a standard output of OCR-extracted text and metadata. An independent module will take this standardized content as input to perform normalization, transformations, matching and other processing steps in order to generate a standard output of identified genes, chemicals and diseases, along with metadata. We will also increase the automation of the pipeline as part of the modularization and refactoring, focusing initially on command line interface implementations that can later be programmatically called and scheduled.

AlexanderPico · 2021-12-03T20:32:17Z

@ariutta Can you bullet point some of the upgrades performed over past year? Also add bullet points for things to upgrade in the next round?

ariutta · 2021-12-07T21:21:54Z

Completed:

Implemented and documented a reproducible installation process: https://github.com/wikipathways/pathway-figure-ocr#install
Increased standardization and automation of the PFOCR pipeline, including:
- getting a list of likely pathway figures
- downloading those figures
- using the machine learning model we trained earlier to distinguish pathway vs. non-pathway figures
- running OCR on the figures classified as pathway
- extracting genes from the text output from OCR

Upcoming upgrades:

Integrate the code for extracting non-gene content (small molecules, amino acids and diseases) into the pipeline
Automate the updating of our gene name lexicon
Explore how the PMC Open Access Subset can make this process easier and better. We could additionally explore using other sources of potential pathway figures, such as those in preprints.
Automate submissions of results to third-party data deposition targets

AlexanderPico added enhancement New feature or request Group 4 labels Dec 8, 2020

AlexanderPico added this to the Segment 2 milestone Dec 8, 2020

AlexanderPico assigned ariutta and AlexanderPico Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade of pathway figure collection, processing and deposition pipeline #20

Upgrade of pathway figure collection, processing and deposition pipeline #20

AlexanderPico commented Dec 8, 2020

AlexanderPico commented Dec 3, 2021

ariutta commented Dec 7, 2021 •

edited

Loading

Upgrade of pathway figure collection, processing and deposition pipeline #20

Upgrade of pathway figure collection, processing and deposition pipeline #20

Comments

AlexanderPico commented Dec 8, 2020

AlexanderPico commented Dec 3, 2021

ariutta commented Dec 7, 2021 • edited Loading

ariutta commented Dec 7, 2021 •

edited

Loading