Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create simplified notebook that executes PFOCR clustering #538

Closed
andrewsu opened this issue Dec 15, 2022 · 15 comments
Closed

Create simplified notebook that executes PFOCR clustering #538

andrewsu opened this issue Dec 15, 2022 · 15 comments
Assignees

Comments

@andrewsu
Copy link
Member

This ticket is based on the exploratory work done in #451, but the goal here is to create a notebook that can be easily used and modified by other data analysts within Translator. This notebook should read in a TRAPI result, perform a PFOCR enrichment analysis over the entities in each TRAPI results, and then report groups of related results with the PFOCR figures that join them.

Requirements:

  • Take TRAPI results as input
  • User-specified parameters:
    • which node ID from message.query_graph.nodes should be used for clustering. The entities mapped to that node ID are the "result entities"
    • a parameter n corresponding to the number of clusters desired
    • whether clusters must include user-specified node IDs for the subject, the object, both, or neither
  • Implementation details:
    • The enrichment analysis will be iterative in design. On the first iteration, all result entities will be grouped in one list, and an enrichment analysis will be run relative to all PFOCR figures. The most significantly enriched PFOCR figure will be set aside as PFOCR Figure 1.
    • The result entities that occurred in PFOCR Figure 1 will be removed from the result entity list, and a new enrichment analysis will be performed. The most significantly-enriched PFOCR figure will be set aside as PFOCR Figure 2.
    • The process will be repeated a total of n times.
    • The basic flow has been worked out in https://github.com/wikipathways/pathway-figure-ocr/blob/master/notebooks/bte_clustering.ipynb, but that notebook needs to be simplified for broader use.
  • Output:
    • The n figures corresponding to the n most enriched PFOCR figure clusters
    • A heatmap diagram
      • each column is a PFOCR figure cluster (n)
      • each row is a result entity
      • the cell is colored if the result entity is in the PFOCR figure cluster
@andrewsu andrewsu changed the title Create simplified notebook that executes Create simplified notebook that executes PFOCR clustering Dec 15, 2022
@andrewsu
Copy link
Member Author

That log message refers to the scores that are reported under message.results.score, which are based on the normalized google distance (NGD). Those scores are unrelated to the PFOCR-based scoring and clustering in this issue. You should use all the results in the TRAPI output under message.results, regardless of what is reported for the score.

@ayushi-agrawal-gladstone
Copy link
Collaborator

Got it. Thanks Andrew.

@AlexanderPico
Copy link
Collaborator

AlexanderPico commented Jan 28, 2023

Screen Shot 2023-01-27 at 4 52 44 PM

This is not from our BTE-PFOCR notebook, but I'm just putting this here as an example since making it recently for another project. The rows and columns are swapped from the suggestion above and it is for a "normal" enrichment of PFOCR rather than an iterative enrichment-with-exclusion that we are planning. Nevertheless it illustrates what can be learned from a heatmap view and highlights why we are interested in the exclusion strategy. Note the redundant representation by these 16 pathways of gene groupings that could equally represented by just 2 or 3 pathways).

@ayushi-agrawal-gladstone
Copy link
Collaborator

Here is the repo I created for this issue. The PFOCR figure results using the example of 7 TRAPI results are in the jupyter notebook: https://github.com/wikipathways/BioThings_Explorer_PFOCR_prioritization/blob/main/bte_clustering_AA.ipynb

  1. In the current implementation, the notebook expects the 3rd user-specified parameter (i.e. whether clusters must include user-specified node IDs for the subject, the object, both, or neither) to be "true" or "false". If true, we require that the PFOCR figures include at least one of the specific user-specified CURIEs. If false, we do not require this. This functionality will be improved in the next version of the notebook to include all options i.e. include some node ids, or all node ids or neither.

  2. We have not yet implemented any clustering in this notebook. We are just mapping PFOCR figures that have at least one user-specified CURIE from the query to the TRAPI results. As seen in this notebook, the final data frame significant_trapi_results_with_figures_df has more than double results because we show the same result multiple times for different figures. Don't we need another step to further refine these results?

  3. Can you please give an example of "result entities"? I want to make sure I am understanding the term correctly.

  4. The current version does not include the heatmap as output. This will be added in the next version. @AlexanderPico Please correct me if I am wrong.

  5. Does the the final data frame significant_trapi_results_with_figures_df in the latest notebook correspond to the below required output?

The n figures corresponding to the n most enriched PFOCR figure clusters

@AlexanderPico
Copy link
Collaborator

Screen Shot 2023-01-30 at 6 19 23 PM

Heatmap view from notebook run on example query with 2862 results. Note: not all row labels are being shown (for legibility). The plot is 167 rows × 9 columns.

@AlexanderPico
Copy link
Collaborator

AlexanderPico commented Jan 31, 2023

Ok. We're are almost there! Some key feedback and decisions for the completion of this v1 issue:

  • 1. We don't need to resolve all the edge types via knowledge graph lookups. For example, we can work with the df of 7 results without expanding it to 18 full-resolved results.
  • 2. For v1, users can specify a single boolean: require_specified_query_nodes that prioritizes PFOCR results that include these nodes (e.g., Imatinib and/or Asthma). No need to explicitly list node IDs or n#s. This should be True by default. Ayushi and I will work out a good default behavior and then we really need more user feedback to refine how this should work in later versions.
  • 3. We should be using this service for unifying node IDs https://nodenormalization-sri.renci.org/docs, EXCEPT in this case the knowledge graph item attributes already include all the xrefs, so we can simply pull them from there.
  • 4. Outputs:
    • A. Figures by CURIEs (binary heatmap): group CURIES (like in heatmap above) and sort figures by iteration order.
    • B. Figures by Results (binary heatmap): rows are concatenated result nodes (from results df in notebook), sorted by original result number (e.g., 0-7), and columns are figures, sorted by iteration order.
    • C. Figure-grouped results (image + table): Render each figure in iteration order with a table of the results containing intersecting CURIEs (not including the "required" CURIEs) such that users can browse groups of results that have representative figures.
  • 5. This notebook will always start with a results URL. The primary users will be internal for now and will likely have result URLs of interest already in hand, so no need to implement the query execution.
  • 6. This notebook will always end with the summary outputs (item 4 above), so no need to construct a specific df or json object. If this approach proves useful after rounds of testing and tweaking, it might get implemented as a feature of the Translator UI, to be used on results per user intention.

@AlexanderPico
Copy link
Collaborator

All the features above are complete. Looks pretty good! Should be ready to demo next week.

I propose opening new issues for additional bugs and features, or other varieties of PFOCR notebooks.

@AlexanderPico
Copy link
Collaborator

And we're trying out a new name :)

PET Notebook
A notebook for exploring Pathway Enriched TRAPI results

https://github.com/wikipathways/BioThings_Explorer_PFOCR_prioritization/blob/main/PET_notebook_v1.ipynb

@ayushi-agrawal-gladstone
Copy link
Collaborator

ayushi-agrawal-gladstone commented Feb 5, 2023

PET Notebook uses PFOCR CSV files as input. The Jupyter notebook that generates the required PFOCR input CSV files has been moved to the PFOCR repo. I have also updated the PFOCR pipeline details to ensure this notebook is run in the next release of PFOCR and the required CSV files are generated. This will keep the inputs required by PET notebook up to date with the latest release of PFOCR.

@andrewsu
Copy link
Member Author

(Sorry, forgot to update comments here.) I think the notebook looks great. I don't completely love the examples I suggested in the initial post to demonstrate the utility. Would be great to explore other options (I think one of you had suggested "genes related to Alzheimer's" and I think that would be great).

Also noting that Alex presented to the Translator User-Centered Working Group and that was well-received. Would be good to follow up with them in a month or two from now...

@ayushi-agrawal-gladstone
Copy link
Collaborator

ayushi-agrawal-gladstone commented Feb 14, 2023

The issue tracker for the PET notebook v1 is here.

We tested the Alzheimer's disease query and found a bug in the notebook. Kristina is helping us test more examples. We have an active issue for this here.

@ayushi-agrawal-gladstone
Copy link
Collaborator

@andrewsu What steps did you follow for getting the below example URLs? Kristina is testing some queries on the notebook and we are facing issues which I think might link back to the JSON URL.

Imatinib - [Gene] - [Gene] - Asthma: https://arax.ncats.io/api/arax/v1.3/response/49d80ecb-7fd9-4ee6-a642-6d7994903f04 (41 MB, 2862 results); "results" in n1 and n2
Imatinib - [Gene] - Asthma: https://arax.ncats.io/api/arax/v1.3/response/7b14f961-9066-41f7-9e3b-d76b2b4a7fac (83kB, 7 results); "results" in n1

@andrewsu
Copy link
Member Author

When a query is posted to the ARS, you get back a JSON object with a primary key (PK). For example, this is the first few lines of an ARS response:

{
    "model": "tr_ars.message",
    "pk": "8d85bbb4-2085-4ad8-a71a-b4dd5099c4a0",
    "fields": {
        "name": "",
...

That PK can then be plugged into the ARAX UI. For the example above: https://arax.ncats.io/?r=8d85bbb4-2085-4ad8-a71a-b4dd5099c4a0. You can also get JSON output here: https://arax.ncats.io/api/arax/v1.3/response/8d85bbb4-2085-4ad8-a71a-b4dd5099c4a0. In either case, you can see that there is a child PK for BTE's response to this query: 6e71e100-0ec8-493e-9bbe-b69886487ec5. The JSON for that response can be retrieved at this URL: https://arax.ncats.io/api/arax/v1.3/response/6e71e100-0ec8-493e-9bbe-b69886487ec5. That content should be equivalent to the imatinib/asthma examples linked above. Let me know if you need more clarification!

@ayushi-agrawal-gladstone
Copy link
Collaborator

@khanspers Tagging you here so you can see Andrew's response as well for getting the response JSON URL.

@AlexanderPico
Copy link
Collaborator

Done. Pursuing joint strategies with other Translator teams for result list-level enrichment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants