Create simplified notebook that executes PFOCR clustering #538

andrewsu · 2022-12-15T22:08:56Z

This ticket is based on the exploratory work done in #451, but the goal here is to create a notebook that can be easily used and modified by other data analysts within Translator. This notebook should read in a TRAPI result, perform a PFOCR enrichment analysis over the entities in each TRAPI results, and then report groups of related results with the PFOCR figures that join them.

Requirements:

Take TRAPI results as input
- Read in from URL (e.g., ) or from local file (or directly assigned). Example TRAPI results can be found here:
  - Imatinib - [Gene] - [Gene] - Asthma: https://arax.ncats.io/api/arax/v1.3/response/49d80ecb-7fd9-4ee6-a642-6d7994903f04 (41 MB, 2862 results); "results" in n1 and n2
  - Imatinib - [Gene] - Asthma: https://arax.ncats.io/api/arax/v1.3/response/7b14f961-9066-41f7-9e3b-d76b2b4a7fac (83kB, 7 results); "results" in n1
- Notebook should not worry about executing the query itself
User-specified parameters:
- which node ID from message.query_graph.nodes should be used for clustering. The entities mapped to that node ID are the "result entities"
- a parameter n corresponding to the number of clusters desired
- whether clusters must include user-specified node IDs for the subject, the object, both, or neither
Implementation details:
- The enrichment analysis will be iterative in design. On the first iteration, all result entities will be grouped in one list, and an enrichment analysis will be run relative to all PFOCR figures. The most significantly enriched PFOCR figure will be set aside as PFOCR Figure 1.
- The result entities that occurred in PFOCR Figure 1 will be removed from the result entity list, and a new enrichment analysis will be performed. The most significantly-enriched PFOCR figure will be set aside as PFOCR Figure 2.
- The process will be repeated a total of n times.
- The basic flow has been worked out in https://github.com/wikipathways/pathway-figure-ocr/blob/master/notebooks/bte_clustering.ipynb, but that notebook needs to be simplified for broader use.
Output:
- The n figures corresponding to the n most enriched PFOCR figure clusters
- A heatmap diagram
  - each column is a PFOCR figure cluster (n)
  - each row is a result entity
  - the cell is colored if the result entity is in the PFOCR figure cluster

The text was updated successfully, but these errors were encountered:

andrewsu · 2023-01-27T23:05:20Z

That log message refers to the scores that are reported under message.results.score, which are based on the normalized google distance (NGD). Those scores are unrelated to the PFOCR-based scoring and clustering in this issue. You should use all the results in the TRAPI output under message.results, regardless of what is reported for the score.

ayushi-agrawal-gladstone · 2023-01-27T23:16:51Z

Got it. Thanks Andrew.

AlexanderPico · 2023-01-28T00:57:31Z

This is not from our BTE-PFOCR notebook, but I'm just putting this here as an example since making it recently for another project. The rows and columns are swapped from the suggestion above and it is for a "normal" enrichment of PFOCR rather than an iterative enrichment-with-exclusion that we are planning. Nevertheless it illustrates what can be learned from a heatmap view and highlights why we are interested in the exclusion strategy. Note the redundant representation by these 16 pathways of gene groupings that could equally represented by just 2 or 3 pathways).

ayushi-agrawal-gladstone · 2023-01-30T23:00:59Z

Here is the repo I created for this issue. The PFOCR figure results using the example of 7 TRAPI results are in the jupyter notebook: https://github.com/wikipathways/BioThings_Explorer_PFOCR_prioritization/blob/main/bte_clustering_AA.ipynb

In the current implementation, the notebook expects the 3rd user-specified parameter (i.e. whether clusters must include user-specified node IDs for the subject, the object, both, or neither) to be "true" or "false". If true, we require that the PFOCR figures include at least one of the specific user-specified CURIEs. If false, we do not require this. This functionality will be improved in the next version of the notebook to include all options i.e. include some node ids, or all node ids or neither.
We have not yet implemented any clustering in this notebook. We are just mapping PFOCR figures that have at least one user-specified CURIE from the query to the TRAPI results. As seen in this notebook, the final data frame significant_trapi_results_with_figures_df has more than double results because we show the same result multiple times for different figures. Don't we need another step to further refine these results?
Can you please give an example of "result entities"? I want to make sure I am understanding the term correctly.
The current version does not include the heatmap as output. This will be added in the next version. @AlexanderPico Please correct me if I am wrong.
Does the the final data frame significant_trapi_results_with_figures_df in the latest notebook correspond to the below required output?

The n figures corresponding to the n most enriched PFOCR figure clusters

AlexanderPico · 2023-01-31T02:21:33Z

Heatmap view from notebook run on example query with 2862 results. Note: not all row labels are being shown (for legibility). The plot is 167 rows × 9 columns.

AlexanderPico · 2023-01-31T20:19:36Z

Ok. We're are almost there! Some key feedback and decisions for the completion of this v1 issue:

1. We don't need to resolve all the edge types via knowledge graph lookups. For example, we can work with the df of 7 results without expanding it to 18 full-resolved results.
2. For v1, users can specify a single boolean: require_specified_query_nodes that prioritizes PFOCR results that include these nodes (e.g., Imatinib and/or Asthma). No need to explicitly list node IDs or n#s. This should be True by default. Ayushi and I will work out a good default behavior and then we really need more user feedback to refine how this should work in later versions.
3. We should be using this service for unifying node IDs https://nodenormalization-sri.renci.org/docs, EXCEPT in this case the knowledge graph item attributes already include all the xrefs, so we can simply pull them from there.
4. Outputs:
- A. Figures by CURIEs (binary heatmap): group CURIES (like in heatmap above) and sort figures by iteration order.
- B. Figures by Results (binary heatmap): rows are concatenated result nodes (from results df in notebook), sorted by original result number (e.g., 0-7), and columns are figures, sorted by iteration order.
- C. Figure-grouped results (image + table): Render each figure in iteration order with a table of the results containing intersecting CURIEs (not including the "required" CURIEs) such that users can browse groups of results that have representative figures.
5. This notebook will always start with a results URL. The primary users will be internal for now and will likely have result URLs of interest already in hand, so no need to implement the query execution.
6. This notebook will always end with the summary outputs (item 4 above), so no need to construct a specific df or json object. If this approach proves useful after rounds of testing and tweaking, it might get implemented as a feature of the Translator UI, to be used on results per user intention.

AlexanderPico · 2023-02-04T01:14:12Z

All the features above are complete. Looks pretty good! Should be ready to demo next week.

I propose opening new issues for additional bugs and features, or other varieties of PFOCR notebooks.

AlexanderPico · 2023-02-04T02:10:10Z

And we're trying out a new name :)

PET Notebook
A notebook for exploring Pathway Enriched TRAPI results

https://github.com/wikipathways/BioThings_Explorer_PFOCR_prioritization/blob/main/PET_notebook_v1.ipynb

ayushi-agrawal-gladstone · 2023-02-05T01:39:27Z

PET Notebook uses PFOCR CSV files as input. The Jupyter notebook that generates the required PFOCR input CSV files has been moved to the PFOCR repo. I have also updated the PFOCR pipeline details to ensure this notebook is run in the next release of PFOCR and the required CSV files are generated. This will keep the inputs required by PET notebook up to date with the latest release of PFOCR.

andrewsu · 2023-02-13T18:30:58Z

(Sorry, forgot to update comments here.) I think the notebook looks great. I don't completely love the examples I suggested in the initial post to demonstrate the utility. Would be great to explore other options (I think one of you had suggested "genes related to Alzheimer's" and I think that would be great).

Also noting that Alex presented to the Translator User-Centered Working Group and that was well-received. Would be good to follow up with them in a month or two from now...

ayushi-agrawal-gladstone · 2023-02-14T22:20:26Z

The issue tracker for the PET notebook v1 is here.

We tested the Alzheimer's disease query and found a bug in the notebook. Kristina is helping us test more examples. We have an active issue for this here.

ayushi-agrawal-gladstone · 2023-02-16T22:31:24Z

@andrewsu What steps did you follow for getting the below example URLs? Kristina is testing some queries on the notebook and we are facing issues which I think might link back to the JSON URL.

Imatinib - [Gene] - [Gene] - Asthma: https://arax.ncats.io/api/arax/v1.3/response/49d80ecb-7fd9-4ee6-a642-6d7994903f04 (41 MB, 2862 results); "results" in n1 and n2
Imatinib - [Gene] - Asthma: https://arax.ncats.io/api/arax/v1.3/response/7b14f961-9066-41f7-9e3b-d76b2b4a7fac (83kB, 7 results); "results" in n1

andrewsu · 2023-02-16T23:54:34Z

When a query is posted to the ARS, you get back a JSON object with a primary key (PK). For example, this is the first few lines of an ARS response:

{
    "model": "tr_ars.message",
    "pk": "8d85bbb4-2085-4ad8-a71a-b4dd5099c4a0",
    "fields": {
        "name": "",
...

That PK can then be plugged into the ARAX UI. For the example above: https://arax.ncats.io/?r=8d85bbb4-2085-4ad8-a71a-b4dd5099c4a0. You can also get JSON output here: https://arax.ncats.io/api/arax/v1.3/response/8d85bbb4-2085-4ad8-a71a-b4dd5099c4a0. In either case, you can see that there is a child PK for BTE's response to this query: 6e71e100-0ec8-493e-9bbe-b69886487ec5. The JSON for that response can be retrieved at this URL: https://arax.ncats.io/api/arax/v1.3/response/6e71e100-0ec8-493e-9bbe-b69886487ec5. That content should be equivalent to the imatinib/asthma examples linked above. Let me know if you need more clarification!

ayushi-agrawal-gladstone · 2023-02-17T00:02:07Z

@khanspers Tagging you here so you can see Andrew's response as well for getting the response JSON URL.

AlexanderPico · 2024-09-30T19:09:19Z

Done. Pursuing joint strategies with other Translator teams for result list-level enrichment.

andrewsu assigned AlexanderPico Dec 15, 2022

andrewsu changed the title ~~Create simplified notebook that executes~~ Create simplified notebook that executes PFOCR clustering Dec 15, 2022

andrewsu mentioned this issue Dec 15, 2022

PFOCR for prioritization / clustering #451

Closed

ayushi-agrawal-gladstone self-assigned this Dec 16, 2022

colleenXu mentioned this issue Apr 19, 2023

overview and management of TRAPI 1.4 features #613

Closed

15 tasks

colleenXu mentioned this issue May 5, 2023

Query graph node id is blank #636

Closed

AlexanderPico closed this as completed Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create simplified notebook that executes PFOCR clustering #538

Create simplified notebook that executes PFOCR clustering #538

andrewsu commented Dec 15, 2022

andrewsu commented Jan 27, 2023

ayushi-agrawal-gladstone commented Jan 27, 2023

AlexanderPico commented Jan 28, 2023 •

edited

Loading

ayushi-agrawal-gladstone commented Jan 30, 2023

AlexanderPico commented Jan 31, 2023

AlexanderPico commented Jan 31, 2023 •

edited

Loading

AlexanderPico commented Feb 4, 2023

AlexanderPico commented Feb 4, 2023

ayushi-agrawal-gladstone commented Feb 5, 2023 •

edited

Loading

andrewsu commented Feb 13, 2023

ayushi-agrawal-gladstone commented Feb 14, 2023 •

edited

Loading

ayushi-agrawal-gladstone commented Feb 16, 2023

andrewsu commented Feb 16, 2023

ayushi-agrawal-gladstone commented Feb 17, 2023

AlexanderPico commented Sep 30, 2024

Create simplified notebook that executes PFOCR clustering #538

Create simplified notebook that executes PFOCR clustering #538

Comments

andrewsu commented Dec 15, 2022

andrewsu commented Jan 27, 2023

ayushi-agrawal-gladstone commented Jan 27, 2023

AlexanderPico commented Jan 28, 2023 • edited Loading

ayushi-agrawal-gladstone commented Jan 30, 2023

AlexanderPico commented Jan 31, 2023

AlexanderPico commented Jan 31, 2023 • edited Loading

AlexanderPico commented Feb 4, 2023

AlexanderPico commented Feb 4, 2023

ayushi-agrawal-gladstone commented Feb 5, 2023 • edited Loading

andrewsu commented Feb 13, 2023

ayushi-agrawal-gladstone commented Feb 14, 2023 • edited Loading

ayushi-agrawal-gladstone commented Feb 16, 2023

andrewsu commented Feb 16, 2023

ayushi-agrawal-gladstone commented Feb 17, 2023

AlexanderPico commented Sep 30, 2024

AlexanderPico commented Jan 28, 2023 •

edited

Loading

AlexanderPico commented Jan 31, 2023 •

edited

Loading

ayushi-agrawal-gladstone commented Feb 5, 2023 •

edited

Loading

ayushi-agrawal-gladstone commented Feb 14, 2023 •

edited

Loading