Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PFOCR for prioritization / clustering #451

Closed
ariutta opened this issue May 25, 2022 · 12 comments
Closed

PFOCR for prioritization / clustering #451

ariutta opened this issue May 25, 2022 · 12 comments
Labels
enhancement New feature or request

Comments

@ariutta
Copy link
Collaborator

ariutta commented May 25, 2022

We've discussed various ideas for using PFOCR to do prioritization and/or clustering of trapi results. Based on our 9am meeting today, we decided to create a new issue for this and to use the following strategy:

  1. initially explore the idea using Jupyter notebooks to post-process trapi results for sample queries
  2. if the results from 1. are good, move the processing from the Jupyter notebooks into the JS code (probably in query_graph_handler). This code could run in conjunction with the actual calling of the APIs to get records, and potentially also in part afterwards to supplement the trapi results
  3. if the results from 2. are good, we can explore how to extend the ideas to the UI

This issue replaces old issue biothings/biothings_explorer #151. It is related to but does not replace these other two issues: wikipathways/pathway-figure-ocr #24 and biothings/BioThings_Explorer_TRAPI #420.

@ariutta
Copy link
Collaborator Author

ariutta commented May 28, 2022

Here's a simple co-occurrence analysis in a notebook:
https://github.com/wikipathways/pathway-figure-ocr/blob/master/notebooks/bte_with_pfocr_cooccurrence.ipynb

I've got another analysis going for using clustering, but it's not ready to publish yet. I'm trying locality-sensitive hashing (like minhash or simhash) as well as an all-pair-binary algorithm.

Another avenue of exploration: using PFOCR as a toggle for the user to specify whether they want to see known findings in the field vs novel connections. When the user wants known findings, we would prioritize results where the gene sets are found in multiple figures, especially when those gene sets are core to multiple figures. For novel connections, perhaps the user could specify which edge(s) should be novel. Then we could prioritize results where the two genes on that edge are rarely or never found in other figures.

@ariutta
Copy link
Collaborator Author

ariutta commented Jun 15, 2022

Updated notebooks (still rough draft / exploratory)

Visual Explainer: show how BTE + PFOCR can serve as a Visual Explainer by showing how one disease, gene or small molecule is related to another in context of PFOCR figures. Low priority -- just a demo.

Dampen combinatorial explosions: there are a list of questions in the “View TRAPI results and figures” section. This one is higher priority.

@andrewsu, do you have specific goals for deliverables beyond what's in issue #420 (incorporate PFOCR results into the BTE results)? Will it be up to the UI team (Translator, ARAX, any other) to move things forward from there? I'm happy to continue working on notebooks to explore ideas on more advanced clustering, but I just want to note that once we get beyond issue #420 and the ideas in the two notebooks linked in this comment, it starts getting much more exploratory.

@ariutta
Copy link
Collaborator Author

ariutta commented Jun 21, 2022

@andrewsu, does the TODO NEXT list at the bottom of this notebook appear reasonable?
https://github.com/wikipathways/pathway-figure-ocr/blob/master/notebooks/bte_clustering.ipynb

Copying the list over here too for convenience.

TODO NEXT:

  1. Excluding the CURIEs from the best figure, repeat the process to get the second-best figure. Continue until we've excluded all the TRAPI results CURIEs. Now we have a set of figures to use to cluster the TRAPI results. We can call these our "cluster figures".
  2. Group each TRAPI result with the cluster figure most relevant to it, based on Fisher's Exact Test, similar to how we did the grouping in the bte_sleeve.ipynb notebook that is a sibling of this notebook. The main difference will be that instead of grouping TRAPI results by all PFOCR figures like in that notebook, we'll instead just group by the selected subset of 3. PFOCR figures we're calling the cluster figures.
  3. Display each cluster figure along with the top 10 TRAPI results most relevant to it, similar to the "By figure" section of the bte_sleeve.ipynb notebook.

@ariutta
Copy link
Collaborator Author

ariutta commented Jun 21, 2022

At the end of the same notebook, I also listed two questions, which I'm copying over here for convenience.

QUESTIONS:

  1. What should we do in cases of ties for p-values for identifying cluster figures? We could treat them like one large figure and display them all.
  2. Should we do anything special for cluster figures and TRAPI results that have any of the ids the user specified?

To expand on question 1, we could get multiple PFOCR figures that get identical p-values for the fisher's exact test against all unique CURIEs in all TRAPI results. In that case, we could merge the figures into one large "meta figure" if they meet some threshold for CURIE overlap. But if the overlap threshold is not met, then we could choose one figure randomly and leave the other figures to be potentially selected in future rounds, after excluding the CURIEs in the selected figure.

To expand on question 2, a user-specified id in a TRAPI query is probably more important than other ids for returning relevant results to the user. If we had the following simplified scenario:

  • TRAPI results CURIEs {"NCBIGene:1234", "NCBIGene:1235", "NCBIGene:1236"} (with user specified id of "NCBIGene:1234")
  • PFOCR figure1 CURIEs {"NCBIGene:1234", "NCBIGene:0000"}
  • PFOCR figure2 CURIEs {"NCBIGene:1236", "NCBIGene:0000"}

If we just run the fisher’s exact test, we’d actually get a tied p-value for figure1 and figure2. But in this case, it would make sense to choose figure1 before figure2 as a cluster figure.

What about a more ambiguous case, such as when the user-specified id is in a PFOCR figure but the p-value is worse than for a different figure that doesn't have the user-specified id, e.g., a figure3 with {"NCBIGene:1235", "NCBIGene:1236"}?

@andrewsu
Copy link
Member

That TODO NEXT list sounds pretty great. I don't think you need to iterate until you've excluded all of the TRAPI result CURIEs (in your step 1) -- maybe iterate until 50% of CURIEs are excluded with a maximum of 10 iterations? (or something like that?)

Q1: you're asking if two figures have the exact same p-value but a different set of overlapping entities? That should be pretty rare I think, right? Even if the overlap between the TRAPI list and a figure is the same size, it's unlikely that the figure has the exact same number of entities in it? Regardless, in that scenario, I'd be fine randomly choosing one of them. (If it's the same set of overlapping entities, then would be great if both figures could be associated with that cluster.)

Q2: For now, I think you can treat the user-specified IDs the same as the BTE-retrieved IDs -- no special weighting or consideration. (But on each iteration when you remove a set of TRAPI results, the user-specified CURIEs will still be present in the next round because you'll have them in all the other results. So basically, those IDs will be incorporated in every iteration.)

@ariutta
Copy link
Collaborator Author

ariutta commented Jun 21, 2022

Q1: you're asking if two figures have the exact same p-value but a different set of overlapping entities? That should be pretty rare I think, right? Even if the overlap between the TRAPI list and a figure is the same size, it's unlikely that the figure has the exact same number of entities in it?

I thought this would be rare too, but in my test query, multiple ties showed up. I don't have anything rigorous and quantitative to indicate how common ties will be, but I didn't expect to have multiple exact ties in my first query.

@ariutta
Copy link
Collaborator Author

ariutta commented Jun 24, 2022

I have a work-in-progress version I pushed today. We'll need to double check the specifics of how the clustering is being done, but this is the first version where you can see some clustering by PFOCR figure. To see it right away, scroll down to the "View TRAPI Results Clustered by PFOCR Figure" section.
https://github.com/wikipathways/pathway-figure-ocr/blob/master/notebooks/bte_clustering.ipynb

There is definitely some clustering happening:

  1. The TRAPI results clustered with the first figure ("Targeting the Transforming Growth FactorB Signaling Pathway in Human Cancer") do seem to be potentially related to cancer.
  2. For the second figure ("Regulation of autophagy by mTOR-dependent and -independent pathways"), we get a cluster of INS and calcium channel blockers, but that's partially just because the figure included a list of calcium channel blockers. I'm not familiar enough with the details of the biology to say whether the other items like Resveratrol and Valproic acid make sense as part of this cluster.
  3. For "Diagram of genes in the PI3K/AKT signaling pathway", we just have a cluster that includes INS.
  4. For "Tryptophan metabolic pathways", we have a cluster where the third query nodes are Melatonin, Tryptamine, Serotonin, Tryptophan, Nadide, Acetic acid and Kynurenic acid. This seems like a plausible cluster, but again, I'd need someone with more familiarity with the details of the biology to weigh in.

I'm not sure whether it's a problem, but you'll notice none of the cluster figures include RAB13 (NCBIGene:5872), which the user specified in ids for n0. There are over 200 PFOCR figures that have RAB13, but none of them were closely related enough to the unique CURIEs from all TRAPI results for this query to be selected as cluster figures.

In this iteration, I didn't do anything for handling tied p-values. If 3 figures tied for the best p-value when choosing the cluster figures, I just included all three and moved on to find the next cluster figures.

@andrewsu
Copy link
Member

A few thoughts on potential next steps:

  • Just so we can clearly communicate the procedure to others (and to ourselves), I think sketching out a schematic and/or a block of psuedocode would be very useful.
  • It would be useful to run the procedure through some additional use cases where the "right answer" might be more apparent
  • Regarding the observation in @ariutta's example above that "none of the cluster figures include RAB13 (NCBIGene:5872), which the user specified in ids for n0", I think it would be worth having a parameter that indicates whether explicitly specified IDs are required to be present in the returned pathway figures.

@ariutta
Copy link
Collaborator Author

ariutta commented Jul 5, 2022

I updated the documentation to help explain what's going on (edit: for the latest, read the documentation in any of the updated notebooks in the two comments after this one). Here's the updated notebook with the previous query, plus the queries you mentioned:

For imatinib - [Gene] - leukemia vs. imatinib - [Gene] - asthma, there's a clear difference in results, and we do see leukemia-related figures vs. asthma-related figures. For the asthma-related figures, it appears it may be worthwhile to take into account the explicitly-specified IDs. None of the PFOCR figures appear to have Imatinib, which is a little surprising -- could be useful for me to double check that.

@ariutta
Copy link
Collaborator Author

ariutta commented Jul 7, 2022

The analysis and documentation is updated in the following notebooks.

I specified two disease IDs (leukemia and asthma) in the same query just for demo purposes. A user would be unlikely to run this actual query, but I figured this would be a way to see whether the clustering allows a user to see results for leukemia and asthma.

In both examples, there are many more results for leukemia than for asthma. Should we expect to see results clustered for leukemia vs. asthma based on the n1 gene query node? In the first notebook above, we have a cluster for asthma based on the word "asthma" appearing in the figure, but when clustering by gene, it's mostly leukemia with a couple of asthma references scattered in there.

When user-specified IDs were not required, notice there was a 15-way tie for lowest p-value. In this case, it could be helpful to merge figures with similar CURIE sets, e.g., when the CURIEs in one figure are a subset of those in another.

@ariutta
Copy link
Collaborator Author

ariutta commented Jul 7, 2022

I did another set of examples, similar to the previous set:

When user-specified IDs were not required, there was a 35-way tie for lowest p-value.

This set of examples raises two notes:

  1. The PFOCR gene extraction is robust with multiple optimizations to handle issues at different points in the pipeline, such as minor OCR errors. The chemical and disease extraction, however, doesn't yet have those optimizations, so it's more brittle and easier to get tripped up by imperfections at any point in the pipeline. PFOCR didn't extract "CANCER" from this figure (extraction results). If we port over a few of the gene extraction optimizations to the extraction for chemicals and diseases, we would extract more chemicals and diseases, including CANCER from that figure.
  2. For this figure, I checked the source data, and PFOCR did extract "tumor metastasis" (MESH:D009362), but it didn't get included in the export file for the API (extraction results in API), potentially because it's more of a disease process rather than a disease itself. If that term should be included in diseases in the API, we could create an updated export to the API to include that one and other cases like it.

@andrewsu
Copy link
Member

Going to close this exploratory ticket in favor of the concrete game plan laid out in #538

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants