Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade of pathway figure collection, processing and deposition pipeline #20

Open
AlexanderPico opened this issue Dec 8, 2020 · 2 comments
Assignees
Labels
enhancement New feature or request Group 4
Milestone

Comments

@AlexanderPico
Copy link
Member

We plan to make incremental improvements to the processing of pathway figure content and annotations. First, we will modularize the processing pipeline and define input/output interfaces for each step. For example, one module will take any image file along with an optional PMCID as input and perform the OCR and processing required to generate a standard output of OCR-extracted text and metadata. An independent module will take this standardized content as input to perform normalization, transformations, matching and other processing steps in order to generate a standard output of identified genes, chemicals and diseases, along with metadata. We will also increase the automation of the pipeline as part of the modularization and refactoring, focusing initially on command line interface implementations that can later be programmatically called and scheduled.

@AlexanderPico AlexanderPico added enhancement New feature or request Group 4 labels Dec 8, 2020
@AlexanderPico AlexanderPico added this to the Segment 2 milestone Dec 8, 2020
@AlexanderPico
Copy link
Member Author

@ariutta Can you bullet point some of the upgrades performed over past year? Also add bullet points for things to upgrade in the next round?

@ariutta
Copy link
Member

ariutta commented Dec 7, 2021

Completed:

  • Implemented and documented a reproducible installation process: https://github.com/wikipathways/pathway-figure-ocr#install
  • Increased standardization and automation of the PFOCR pipeline, including:
    • getting a list of likely pathway figures
    • downloading those figures
    • using the machine learning model we trained earlier to distinguish pathway vs. non-pathway figures
    • running OCR on the figures classified as pathway
    • extracting genes from the text output from OCR

Upcoming upgrades:

  • Integrate the code for extracting non-gene content (small molecules, amino acids and diseases) into the pipeline
  • Automate the updating of our gene name lexicon
  • Explore how the PMC Open Access Subset can make this process easier and better. We could additionally explore using other sources of potential pathway figures, such as those in preprints.
  • Automate submissions of results to third-party data deposition targets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Group 4
Projects
None yet
Development

No branches or pull requests

2 participants