You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We plan to make incremental improvements to the processing of pathway figure content and annotations. First, we will modularize the processing pipeline and define input/output interfaces for each step. For example, one module will take any image file along with an optional PMCID as input and perform the OCR and processing required to generate a standard output of OCR-extracted text and metadata. An independent module will take this standardized content as input to perform normalization, transformations, matching and other processing steps in order to generate a standard output of identified genes, chemicals and diseases, along with metadata. We will also increase the automation of the pipeline as part of the modularization and refactoring, focusing initially on command line interface implementations that can later be programmatically called and scheduled.
The text was updated successfully, but these errors were encountered:
Increased standardization and automation of the PFOCR pipeline, including:
getting a list of likely pathway figures
downloading those figures
using the machine learning model we trained earlier to distinguish pathway vs. non-pathway figures
running OCR on the figures classified as pathway
extracting genes from the text output from OCR
Upcoming upgrades:
Integrate the code for extracting non-gene content (small molecules, amino acids and diseases) into the pipeline
Automate the updating of our gene name lexicon
Explore how the PMC Open Access Subset can make this process easier and better. We could additionally explore using other sources of potential pathway figures, such as those in preprints.
Automate submissions of results to third-party data deposition targets
We plan to make incremental improvements to the processing of pathway figure content and annotations. First, we will modularize the processing pipeline and define input/output interfaces for each step. For example, one module will take any image file along with an optional PMCID as input and perform the OCR and processing required to generate a standard output of OCR-extracted text and metadata. An independent module will take this standardized content as input to perform normalization, transformations, matching and other processing steps in order to generate a standard output of identified genes, chemicals and diseases, along with metadata. We will also increase the automation of the pipeline as part of the modularization and refactoring, focusing initially on command line interface implementations that can later be programmatically called and scheduled.
The text was updated successfully, but these errors were encountered: