-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PFOCR for prioritization / clustering #451
Comments
Here's a simple co-occurrence analysis in a notebook: I've got another analysis going for using clustering, but it's not ready to publish yet. I'm trying locality-sensitive hashing (like minhash or simhash) as well as an all-pair-binary algorithm. Another avenue of exploration: using PFOCR as a toggle for the user to specify whether they want to see known findings in the field vs novel connections. When the user wants known findings, we would prioritize results where the gene sets are found in multiple figures, especially when those gene sets are core to multiple figures. For novel connections, perhaps the user could specify which edge(s) should be novel. Then we could prioritize results where the two genes on that edge are rarely or never found in other figures. |
Updated notebooks (still rough draft / exploratory)Visual Explainer: show how BTE + PFOCR can serve as a Visual Explainer by showing how one disease, gene or small molecule is related to another in context of PFOCR figures. Low priority -- just a demo. Dampen combinatorial explosions: there are a list of questions in the “View TRAPI results and figures” section. This one is higher priority. @andrewsu, do you have specific goals for deliverables beyond what's in issue #420 (incorporate PFOCR results into the BTE results)? Will it be up to the UI team (Translator, ARAX, any other) to move things forward from there? I'm happy to continue working on notebooks to explore ideas on more advanced clustering, but I just want to note that once we get beyond issue #420 and the ideas in the two notebooks linked in this comment, it starts getting much more exploratory. |
@andrewsu, does the TODO NEXT list at the bottom of this notebook appear reasonable? Copying the list over here too for convenience. TODO NEXT:
|
At the end of the same notebook, I also listed two questions, which I'm copying over here for convenience. QUESTIONS:
To expand on question 1, we could get multiple PFOCR figures that get identical p-values for the fisher's exact test against all unique CURIEs in all TRAPI results. In that case, we could merge the figures into one large "meta figure" if they meet some threshold for CURIE overlap. But if the overlap threshold is not met, then we could choose one figure randomly and leave the other figures to be potentially selected in future rounds, after excluding the CURIEs in the selected figure. To expand on question 2, a user-specified id in a TRAPI query is probably more important than other ids for returning relevant results to the user. If we had the following simplified scenario:
If we just run the fisher’s exact test, we’d actually get a tied p-value for figure1 and figure2. But in this case, it would make sense to choose figure1 before figure2 as a cluster figure. What about a more ambiguous case, such as when the user-specified id is in a PFOCR figure but the p-value is worse than for a different figure that doesn't have the user-specified id, e.g., a figure3 with |
That TODO NEXT list sounds pretty great. I don't think you need to iterate until you've excluded all of the TRAPI result CURIEs (in your step 1) -- maybe iterate until 50% of CURIEs are excluded with a maximum of 10 iterations? (or something like that?) Q1: you're asking if two figures have the exact same p-value but a different set of overlapping entities? That should be pretty rare I think, right? Even if the overlap between the TRAPI list and a figure is the same size, it's unlikely that the figure has the exact same number of entities in it? Regardless, in that scenario, I'd be fine randomly choosing one of them. (If it's the same set of overlapping entities, then would be great if both figures could be associated with that cluster.) Q2: For now, I think you can treat the user-specified IDs the same as the BTE-retrieved IDs -- no special weighting or consideration. (But on each iteration when you remove a set of TRAPI results, the user-specified CURIEs will still be present in the next round because you'll have them in all the other results. So basically, those IDs will be incorporated in every iteration.) |
I thought this would be rare too, but in my test query, multiple ties showed up. I don't have anything rigorous and quantitative to indicate how common ties will be, but I didn't expect to have multiple exact ties in my first query. |
I have a work-in-progress version I pushed today. We'll need to double check the specifics of how the clustering is being done, but this is the first version where you can see some clustering by PFOCR figure. To see it right away, scroll down to the "View TRAPI Results Clustered by PFOCR Figure" section. There is definitely some clustering happening:
I'm not sure whether it's a problem, but you'll notice none of the cluster figures include RAB13 (NCBIGene:5872), which the user specified in In this iteration, I didn't do anything for handling tied p-values. If 3 figures tied for the best p-value when choosing the cluster figures, I just included all three and moved on to find the next cluster figures. |
A few thoughts on potential next steps:
|
I updated the documentation to help explain what's going on (edit: for the latest, read the documentation in any of the updated notebooks in the two comments after this one). Here's the updated notebook with the previous query, plus the queries you mentioned:
For |
The analysis and documentation is updated in the following notebooks. I specified two disease IDs (leukemia and asthma) in the same query just for demo purposes. A user would be unlikely to run this actual query, but I figured this would be a way to see whether the clustering allows a user to see results for leukemia and asthma.
In both examples, there are many more results for leukemia than for asthma. Should we expect to see results clustered for leukemia vs. asthma based on the When user-specified IDs were not required, notice there was a 15-way tie for lowest p-value. In this case, it could be helpful to merge figures with similar CURIE sets, e.g., when the CURIEs in one figure are a subset of those in another. |
I did another set of examples, similar to the previous set:
When user-specified IDs were not required, there was a 35-way tie for lowest p-value. This set of examples raises two notes:
|
Going to close this exploratory ticket in favor of the concrete game plan laid out in #538 |
We've discussed various ideas for using PFOCR to do prioritization and/or clustering of trapi results. Based on our 9am meeting today, we decided to create a new issue for this and to use the following strategy:
This issue replaces old issue biothings/biothings_explorer #151. It is related to but does not replace these other two issues: wikipathways/pathway-figure-ocr #24 and biothings/BioThings_Explorer_TRAPI #420.
The text was updated successfully, but these errors were encountered: