Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x-bte-refactoring: multiple input/output ID namespaces #748

Open
colleenXu opened this issue Oct 21, 2023 · 2 comments
Open

x-bte-refactoring: multiple input/output ID namespaces #748

colleenXu opened this issue Oct 21, 2023 · 2 comments
Assignees
Labels

Comments

@colleenXu
Copy link
Collaborator

colleenXu commented Oct 21, 2023

The motivation

(This is a specific problem that involves large-scale x-bte refactoring. Let's discuss the "large issues" of x-bte refactoring one at a time.)

This proposal is to refactor the x-bte annotation unit/operation so it can include multiple input/output ID namespaces and the slightly diff querying/response-mapping/processing that may be needed to handle this.

Currently, there are many operations where the core difference is the input and/or output ID namespace. Other unique parts of the MetaEdge are the same (subject/object category, predicate, qualifier-set, source, KL/AT). This causes repetition/annoyance up to combinatorial explosion and overwhelm when writing/maintaining.

Scope: this occurs in a significant fraction of operations for every kind of API

  • ALL core BioThings APIs: 11-57%
    • MyDisease (8/26 operations): the disease <-> phenotype and disease <-> chemical ones
    • MyGene (8/18 operations): pathway <-> gene, (I think homolog-homolog can be done with 1 input/1 output namespace)
    • MyChem (4/35 right now): chembl drug-mech
    • MyVariant (16/28): Clinvar disease <-> gene/variant operations
  • some pending BioThings APIs (7/30 APIs) - but can be a major problem for them:
    • semmeddb: hundreds of operations may have both ncbigene and umls versions (based on ~15% of operations (739/4675) were ncbigene after adding that namespace)
    • Multiomics Wellness: 108 operations due to combinatorial explosion with 7 chem namespaces, 2 clinical finding namespaces, 1 gene namespace
    • Multiomics EHR risk: 148 operations due to combinatorial explosion w/ 3 disease namespaces, 3 pheno, 2 chem, 1 procedure
    • AGR: combinatorial explosion with 7 gene namespaces
    • TTD: combinatorial explosion with 2 chem namespaces, 2 disease namespaces, 2 gene/protein namespaces
    • 2 others, less of a deal: pfocr, ncats rare-source
  • some external APIs (3/8) - but can cause problems needing bespoke post-processing solutions right now:
    • Monarch: we handle by setting specific input/output namespace in param but if we didn't, we could retrieve more data
    • CTD: haven't annotated some data because cannot handle multiple output ID namespaces from 1 field
    • RaMP (although not written yet because of other stuff in the issue)

(came out of many discussions, some documented in #656)

@colleenXu
Copy link
Collaborator Author

colleenXu commented Oct 22, 2023

Initial thoughts

This isn't as simple as listing all the ID namespaces:

  • different request info (request body / parameters): happens for input (more often?) and output. I have examples for both below)
  • different response-mapping: happens for input (input_name keyword) and output (more affected)
Stuff not included in this proposal, but I'm thinking over

A. The following proposal handles both "multiple input" and "multiple output" namespaces, but I wonder if handling just 1 (only "multiple outputs"?) is easier

B: I wonder if the "multiple output namespaces" can be handled in post-processing, so only 1 sub-query is needed to get the info for all output namespaces

C. One feature that may be nice for both inputs and outputs is a flag to toggle "multiple namespace prioritization" behavior. This is an issue with BioThings TTD output namespaces, currently handled by using different request info (see example 4 below). But I don't know how possible this is...(it may be more possible if B is implemented...)

  • prioritize order that namespaces are listed in operation. Example for outputs: for each response hit, look for which output namespace field is in the response "hit" (in the order of namespaces in the operation). Once a namespace field is found, stop.
  • VS no priority: all namespaces should be queried (input) / looked for in the response (output)

First proposal

  1. unnest operation contents. Right now every x-bte-kgs-operation object is a one-element array, where that element is the main contents. I don't see a need for an array here. (fairly minor change?)
  2. inputs: make this an object, with semantic (-> semanticType) field held separately from id (-> namespaces) info. Also include input_name info (-> inputs.namespaces.name_field)
  3. requestInfo: includes all sub-query-construction info (requestBody, requestBodyObject, parameters). Uses tags / structure to support when the "multiple input namespaces have different sub-query info" OR the "multiple output namespaces have different sub-query info". But not supporting both in the same operation
  4. outputs: same as inputs, except also including response field for the output ID (-> outputs.namespaces.id_field) and output_name (-> outputs.namespaces.name_field)
  5. response-mapping: only holds edge-attributes / trapi_sources handling. So it's now an optional field (sometimes we don't have that info)

Examples

1: multiple input namespaces that can be queried together (not different sub-query info)

(Based on MyGene's PathwayHasGene2, PathwayHasGene3, PathwayHasGene4 operations. Not including PathwayHasGene1 because it has a different source)

  x-bte-kgs-operations:
    cpdb-PathwayHasGene:
      supportBatch: true
      useTemplating: true
      inputs:
        semanticType: Pathway
        namespaces: 
          - prefix: "KEGG.PATHWAY"
            name_field: pathway.kegg.name
          - prefix: WIKIPATHWAYS
            name_field: pathway.wikipathways.name
          - prefix: BIOCARTA
            name_field: pathway.biocarta.name
      requestInfo:
        differsByInputNamespace: false
        differsByOutputNamespace: false
        requestBody:
          body:
            q: "{{ queryInputs }}"
            scopes: pathway.kegg.id,pathway.wikipathways.id,pathway.biocarta.id
        parameters:
          fields: entrezgene,pathway.kegg.name,pathway.wikipathways.name,pathway.biocarta.name
          species: human
          size: 1000 
      outputs:
        semanticType: Gene
        namespaces: 
          - prefix: NCBIGene
            id_field: entrezgene
      predicate: has_participant
      source: "infores:cpdb"
      knowledge_level: knowledge_assertion
      agent_type: manual_agent
      ## NO RESPONSE MAPPING: no edge-attributes 

2: multiple input namespaces that are queried separately (diff sub-query info)

  • Must query separately because OMIM and ORPHANET IDs can be mistaken for each other: they have the same "pattern for local unique identifiers" (numeric)
  • no input name field or output name field info
  • still has response-mapping for edge-attributes

(Based on MyDisease's disease-phenotype, disease-phenotype2)

  x-bte-kgs-operations:
    disease-phenotype:
      supportBatch: true
      useTemplating: true
      inputs:
        semanticType: Disease
        namespaces: 
          - prefix: OMIM
          - prefix: ORPHANET
      requestInfo:
        differsByInputNamespace: true
        differsByOutputNamespace: false
        byInputNamespace:
          OMIM:
            requestBody:
              body:
                q: "{{ queryInputs }}"
                scopes: hpo.omim
            ## using $ref to make less repetitive
            parameters:
              "$ref": "#/components/x-bte-refs/disease-phenotype-parameters"
          ORPHANET:
            requestBody:
              body:
                q: "{{ queryInputs }}"
                scopes: hpo.orphanet
            parameters:
              "$ref": "#/components/x-bte-refs/disease-phenotype-parameters"
      outputs:
        semanticType: PhenotypicFeature
        namespaces: 
          - prefix: HP
            id_field: hpo.phenotype_related_to_disease.hpo_id
      predicate: has_phenotype
      source: "infores:hpo-annotations"
      knowledge_level: knowledge_assertion
      agent_type: manual_agent
      response_mapping:
        "$ref": "#/components/x-bte-response-mapping/disease-phenotype"
  x-bte-refs:
    disease-pheno-parameters:
      fields: >-
        hpo.phenotype_related_to_disease.hpo_id,
        hpo.phenotype_related_to_disease.pmid_refs,
        hpo.phenotype_related_to_disease.isbn_refs,
        hpo.phenotype_related_to_disease.website_refs,
        hpo.phenotype_related_to_disease.numeric_freq,
        hpo.phenotype_related_to_disease.hp_freq,
        hpo.phenotype_related_to_disease.freq_numerator,
        hpo.phenotype_related_to_disease.freq_denominator
  x-bte-response-mapping:
    disease-phenotype:
      ref_pmid: hpo.phenotype_related_to_disease.pmid_refs
      ref_isbn: hpo.phenotype_related_to_disease.isbn_refs
      ref_url: hpo.phenotype_related_to_disease.website_refs
      "biolink:has_quotient": hpo.phenotype_related_to_disease.numeric_freq
      "biolink:frequency_qualifier": hpo.phenotype_related_to_disease.hp_freq
      "biolink:has_count": hpo.phenotype_related_to_disease.freq_numerator
      "biolink:has_total": hpo.phenotype_related_to_disease.freq_denominator

3: multiple output namespaces (not different sub-query info)

  • Reverse of example 1
  • name_field used for outputs
  • doesn't have response-mapping

(Based on MyGene's involvedInPathway2, involvedInPathway3, involvedInPathway4 operations. Not including involvedInPathway1 because it has a different source)

  x-bte-kgs-operations:
    cpdb-involvedInPathway:
      supportBatch: true
      useTemplating: true
      inputs:
        semanticType: Gene
        namespaces:
          - prefix: NCBIGene
      requestInfo:
        differsByInputNamespace: false
        differsByOutputNamespace: false
        requestBody:
          body:
            q: "{{ queryInputs }}"
            scopes: entrezgene
        parameters:
          fields: >-
            pathway.kegg.id,pathway.kegg.name,
            pathway.wikipathways.id,pathway.wikipathways.name,
            pathway.biocarta.id,pathway.biocarta.name
          species: human
          size: 1000
      outputs:
        semanticType: Pathway
        namespaces: 
          - prefix: "KEGG.PATHWAY"
            id_field: pathway.kegg.id
            name_field: pathway.kegg.name
          - prefix: WIKIPATHWAYS
            id_field: pathway.wikipathways.id
            name_field: pathway.wikipathways.name
          - prefix: BIOCARTA
            id_field: pathway.biocarta.id
            name_field: pathway.biocarta.name
      predicate: participates_in
      source: "infores:cpdb"
      knowledge_level: knowledge_assertion
      agent_type: manual_agent
      ## NO RESPONSE MAPPING

4: multiple input namespaces AND output namespaces, different sub-query info for outputs

  • 2 input namespaces: "PUBCHEM.COMPOUND" and "TTD.DRUG". They have different "patterns for local unique identifiers":
    • PUBCHEM.COMPOUND:139600308 (numeric)
    • TTD.DRUG:DZJ3D5 (has letters and numbers, bioregisty entry looks different...)
  • 2 output namespaces: MONDO and ICD11
  • using filter so sub-query info won't differ by input namespace
  • but output namespaces do have different sub-query info (parameters)
  • doesn't have response-mapping

(Based on BioThings TTD operations: pubchem_treats_mondo, pubchem_treats_icd11, ttd_drug_id_treats_mondo, ttd_drug_id_treats_icd11)

  x-bte-kgs-operations:
    chemical-treats-disease:
      supportBatch: true
      useTemplating: true
      inputs:
        semanticType: SmallMolecule
        namespaces: 
          - prefix: "PUBCHEM.COMPOUND"
            name_field: subject.name
          - prefix: "TTD.DRUG"
            name_field: subject.name
      requestInfo:
        differsByInputNamespace: false 
        differsByOutputNamespace: true
        byOutputNamespace:
          MONDO:
            requestBody:
              "$ref": "#/components/x-bte-refs/requestBody-chemTreatsDisease"
            parameters:
              fields: object.mondo,object.name,subject.name
              filter: association.predicate:"biolink:treats"
              size: 1000
          ICD11:
            requestBody:
              "$ref": "#/components/x-bte-refs/requestBody-chemTreatsDisease"
            parameters:
              fields: object.icd11,object.name,subject.name
              filter: association.predicate:"biolink:treats" AND (NOT _exists_:object.mondo)
              size: 1000
      outputs:
        semanticType: Disease
        namespaces: 
          - prefix: MONDO
            id_field: object.mondo
            name_field: object.name
          - prefix: ICD11
            id_field: object.icd11
            name_field: object.name
      predicate: treats
      source: "infores:ttd"
      knowledge_level: knowledge_assertion
      agent_type: manual_agent
      ## NO RESPONSE MAPPING
  x-bte-refs:
    requestBody-chemTreatsDisease:
      body:
        q: "{{ queryInputs }}"
        scopes: subject.pubchem_compound,subject.ttd_drug_id

5: multiple input namespaces AND output namespaces, different sub-query info for inputs

  • 3 input ID namespaces: HP, NCIT, SNOMEDCT
    • must query separately because HP and SNOMEDCT IDs can be mistaken for each other (both are numeric) -> different request bodies
  • using filter for cleaner text
  • 3 output ID namespaces: MONDO, NCIT, SNOMEDCT
  • no source/KL/AT fields because it's in the response-mapping.

Based on 6 current Multiomics EHR Risk operations (since 3 combinations don't actually exist in the data...)

  • PhenoHP_increased_DiseaseMONDO
  • PhenoHP_increased_DiseaseNCIT
  • PhenoHP_increased_DiseaseSNOMEDCT
  • PhenoNCIT_increased_DiseaseMONDO
  • PhenoSNOMEDCT_increased_DiseaseMONDO
  • PhenoSNOMEDCT_increased_DiseaseSNOMEDCT
  x-bte-kgs-operations:
    pheno-increased-disease:
      supportBatch: true
      useTemplating: true
      inputs:
        semanticType: PhenotypicFeature
        namespaces: 
          - prefix: HP
            name_field: subject.name
          - prefix: NCIT
            name_field: subject.name
          - prefix: SNOMEDCT
            name_field: subject.name
      requestInfo:
        differsByInputNamespace: true
        differsByOutputNamespace: false
        byInputNamespace:
          HP:
            requestBody:
              "$ref": "#/components/x-bte-refs/requestInfo_HP"
            parameters:
              "$ref": "#/components/x-bte-refs/params_phenoIncreasedDisease"
          NCIT:
            requestBody:
              "$ref": "#/components/x-bte-refs/requestInfo_NCIT"
            parameters:
              "$ref": "#/components/x-bte-refs/params_phenoIncreasedDisease"
          SNOMEDCT:
            requestBody:
              "$ref": "#/components/x-bte-refs/requestInfo_SNOMEDCT"
            parameters:
              "$ref": "#/components/x-bte-refs/params_phenoIncreasedDisease"
      outputs:
        semanticType: Disease
        namespaces: 
          - prefix: MONDO
            id_field: object.MONDO
            name_field: object.name
          - prefix: NCIT
            id_field: object.NCIT
            name_field: object.name
          - prefix: SNOMEDCT
            id_field: object.SNOMEDCT
            name_field: object.name
      predicate: associated_with
      qualifiers:
        object_direction_qualifier: increased
        object_aspect_qualifier: likelihood
      response_mapping:
        "$ref": "#/components/x-bte-response-mapping/edge-info"
  x-bte-response-mapping:
    edge-info:
      edge-attributes: association.edge_attributes
      trapi_sources: source.edge_sources
  x-bte-refs:
    requestInfo_HP:
      requestBody:
        body:
          q: "{{ queryInputs | rmPrefix() }}"
          scopes: subject.HP
    requestInfo_NCIT:
      requestBody:
        body:
          q: "{{ queryInputs | rmPrefix() }}"
          scopes: subject.NCIT
    requestInfo_SNOMEDCT:
      requestBody:
        body:
          q: "{{ queryInputs | rmPrefix() }}"
          scopes: subject.SNOMEDCT
    params_phenoIncreasedDisease:
      fields: >-
        object.MONDO,object.NCIT,object.SNOMEDCT,
        association.edge_attributes,source.edge_sources,
        subject.name,object.name
      size: 1000
      filter: >-
        subject.type:"biolink:PhenotypicFeature" AND 
        association.predicate:associated_with_increased_likelihood_of AND
        object.type:"biolink:Disease"

@rjawesome
Copy link
Contributor

I set up this proposal in the multiple-input-output branch using the smartapi-kg and api-respone-transform.js repositories

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants