x-bte-refactoring: multiple input/output ID namespaces #748

colleenXu · 2023-10-21T06:49:45Z

The motivation

(This is a specific problem that involves large-scale x-bte refactoring. Let's discuss the "large issues" of x-bte refactoring one at a time.)

This proposal is to refactor the x-bte annotation unit/operation so it can include multiple input/output ID namespaces and the slightly diff querying/response-mapping/processing that may be needed to handle this.

Currently, there are many operations where the core difference is the input and/or output ID namespace. Other unique parts of the MetaEdge are the same (subject/object category, predicate, qualifier-set, source, KL/AT). This causes repetition/annoyance up to combinatorial explosion and overwhelm when writing/maintaining.

Scope: this occurs in a significant fraction of operations for every kind of API

ALL core BioThings APIs: 11-57%
- MyDisease (8/26 operations): the disease <-> phenotype and disease <-> chemical ones
- MyGene (8/18 operations): pathway <-> gene, (I think homolog-homolog can be done with 1 input/1 output namespace)
- MyChem (4/35 right now): chembl drug-mech
- MyVariant (16/28): Clinvar disease <-> gene/variant operations
some pending BioThings APIs (7/30 APIs) - but can be a major problem for them:
- semmeddb: hundreds of operations may have both ncbigene and umls versions (based on ~15% of operations (739/4675) were ncbigene after adding that namespace)
- Multiomics Wellness: 108 operations due to combinatorial explosion with 7 chem namespaces, 2 clinical finding namespaces, 1 gene namespace
- Multiomics EHR risk: 148 operations due to combinatorial explosion w/ 3 disease namespaces, 3 pheno, 2 chem, 1 procedure
- AGR: combinatorial explosion with 7 gene namespaces
- TTD: combinatorial explosion with 2 chem namespaces, 2 disease namespaces, 2 gene/protein namespaces
- 2 others, less of a deal: pfocr, ncats rare-source
some external APIs (3/8) - but can cause problems needing bespoke post-processing solutions right now:
- Monarch: we handle by setting specific input/output namespace in param but if we didn't, we could retrieve more data
- CTD: haven't annotated some data because cannot handle multiple output ID namespaces from 1 field
- RaMP (although not written yet because of other stuff in the issue)

(came out of many discussions, some documented in #656)

colleenXu · 2023-10-22T23:40:42Z

Initial thoughts

This isn't as simple as listing all the ID namespaces:

different request info (request body / parameters): happens for input (more often?) and output. I have examples for both below)
different response-mapping: happens for input (input_name keyword) and output (more affected)

Stuff not included in this proposal, but I'm thinking over

A. The following proposal handles both "multiple input" and "multiple output" namespaces, but I wonder if handling just 1 (only "multiple outputs"?) is easier

B: I wonder if the "multiple output namespaces" can be handled in post-processing, so only 1 sub-query is needed to get the info for all output namespaces

C. One feature that may be nice for both inputs and outputs is a flag to toggle "multiple namespace prioritization" behavior. This is an issue with BioThings TTD output namespaces, currently handled by using different request info (see example 4 below). But I don't know how possible this is...(it may be more possible if B is implemented...)

prioritize order that namespaces are listed in operation. Example for outputs: for each response hit, look for which output namespace field is in the response "hit" (in the order of namespaces in the operation). Once a namespace field is found, stop.
VS no priority: all namespaces should be queried (input) / looked for in the response (output)

First proposal

unnest operation contents. Right now every x-bte-kgs-operation object is a one-element array, where that element is the main contents. I don't see a need for an array here. (fairly minor change?)
inputs: make this an object, with semantic (-> semanticType) field held separately from id (-> namespaces) info. Also include input_name info (-> inputs.namespaces.name_field)
requestInfo: includes all sub-query-construction info (requestBody, requestBodyObject, parameters). Uses tags / structure to support when the "multiple input namespaces have different sub-query info" OR the "multiple output namespaces have different sub-query info". But not supporting both in the same operation
outputs: same as inputs, except also including response field for the output ID (-> outputs.namespaces.id_field) and output_name (-> outputs.namespaces.name_field)
response-mapping: only holds edge-attributes / trapi_sources handling. So it's now an optional field (sometimes we don't have that info)

Examples

1: multiple input namespaces that can be queried together (not different sub-query info)

has different input name_fields
doesn't have response-mapping
can query the ID namespaces together because field only uses ID (no prefix) and each ID namespace has unique patterns
- KEGG.PATHWAY:hsa00120 (looks like three lower-case letters, then numeric?)
- WIKIPATHWAYS:WP2034 (looks like WP then numeric?)
- BIOCARTA:raspathway (looks like all lowercase? bioregisty entry looks different)

(Based on MyGene's PathwayHasGene2, PathwayHasGene3, PathwayHasGene4 operations. Not including PathwayHasGene1 because it has a different source)

  x-bte-kgs-operations:
    cpdb-PathwayHasGene:
      supportBatch: true
      useTemplating: true
      inputs:
        semanticType: Pathway
        namespaces: 
          - prefix: "KEGG.PATHWAY"
            name_field: pathway.kegg.name
          - prefix: WIKIPATHWAYS
            name_field: pathway.wikipathways.name
          - prefix: BIOCARTA
            name_field: pathway.biocarta.name
      requestInfo:
        differsByInputNamespace: false
        differsByOutputNamespace: false
        requestBody:
          body:
            q: "{{ queryInputs }}"
            scopes: pathway.kegg.id,pathway.wikipathways.id,pathway.biocarta.id
        parameters:
          fields: entrezgene,pathway.kegg.name,pathway.wikipathways.name,pathway.biocarta.name
          species: human
          size: 1000 
      outputs:
        semanticType: Gene
        namespaces: 
          - prefix: NCBIGene
            id_field: entrezgene
      predicate: has_participant
      source: "infores:cpdb"
      knowledge_level: knowledge_assertion
      agent_type: manual_agent
      ## NO RESPONSE MAPPING: no edge-attributes

2: multiple input namespaces that are queried separately (diff sub-query info)

Must query separately because OMIM and ORPHANET IDs can be mistaken for each other: they have the same "pattern for local unique identifiers" (numeric)
no input name field or output name field info
still has response-mapping for edge-attributes

(Based on MyDisease's disease-phenotype, disease-phenotype2)

  x-bte-kgs-operations:
    disease-phenotype:
      supportBatch: true
      useTemplating: true
      inputs:
        semanticType: Disease
        namespaces: 
          - prefix: OMIM
          - prefix: ORPHANET
      requestInfo:
        differsByInputNamespace: true
        differsByOutputNamespace: false
        byInputNamespace:
          OMIM:
            requestBody:
              body:
                q: "{{ queryInputs }}"
                scopes: hpo.omim
            ## using $ref to make less repetitive
            parameters:
              "$ref": "#/components/x-bte-refs/disease-phenotype-parameters"
          ORPHANET:
            requestBody:
              body:
                q: "{{ queryInputs }}"
                scopes: hpo.orphanet
            parameters:
              "$ref": "#/components/x-bte-refs/disease-phenotype-parameters"
      outputs:
        semanticType: PhenotypicFeature
        namespaces: 
          - prefix: HP
            id_field: hpo.phenotype_related_to_disease.hpo_id
      predicate: has_phenotype
      source: "infores:hpo-annotations"
      knowledge_level: knowledge_assertion
      agent_type: manual_agent
      response_mapping:
        "$ref": "#/components/x-bte-response-mapping/disease-phenotype"
  x-bte-refs:
    disease-pheno-parameters:
      fields: >-
        hpo.phenotype_related_to_disease.hpo_id,
        hpo.phenotype_related_to_disease.pmid_refs,
        hpo.phenotype_related_to_disease.isbn_refs,
        hpo.phenotype_related_to_disease.website_refs,
        hpo.phenotype_related_to_disease.numeric_freq,
        hpo.phenotype_related_to_disease.hp_freq,
        hpo.phenotype_related_to_disease.freq_numerator,
        hpo.phenotype_related_to_disease.freq_denominator
  x-bte-response-mapping:
    disease-phenotype:
      ref_pmid: hpo.phenotype_related_to_disease.pmid_refs
      ref_isbn: hpo.phenotype_related_to_disease.isbn_refs
      ref_url: hpo.phenotype_related_to_disease.website_refs
      "biolink:has_quotient": hpo.phenotype_related_to_disease.numeric_freq
      "biolink:frequency_qualifier": hpo.phenotype_related_to_disease.hp_freq
      "biolink:has_count": hpo.phenotype_related_to_disease.freq_numerator
      "biolink:has_total": hpo.phenotype_related_to_disease.freq_denominator

3: multiple output namespaces (not different sub-query info)

Reverse of example 1
name_field used for outputs
doesn't have response-mapping

(Based on MyGene's involvedInPathway2, involvedInPathway3, involvedInPathway4 operations. Not including involvedInPathway1 because it has a different source)

  x-bte-kgs-operations:
    cpdb-involvedInPathway:
      supportBatch: true
      useTemplating: true
      inputs:
        semanticType: Gene
        namespaces:
          - prefix: NCBIGene
      requestInfo:
        differsByInputNamespace: false
        differsByOutputNamespace: false
        requestBody:
          body:
            q: "{{ queryInputs }}"
            scopes: entrezgene
        parameters:
          fields: >-
            pathway.kegg.id,pathway.kegg.name,
            pathway.wikipathways.id,pathway.wikipathways.name,
            pathway.biocarta.id,pathway.biocarta.name
          species: human
          size: 1000
      outputs:
        semanticType: Pathway
        namespaces: 
          - prefix: "KEGG.PATHWAY"
            id_field: pathway.kegg.id
            name_field: pathway.kegg.name
          - prefix: WIKIPATHWAYS
            id_field: pathway.wikipathways.id
            name_field: pathway.wikipathways.name
          - prefix: BIOCARTA
            id_field: pathway.biocarta.id
            name_field: pathway.biocarta.name
      predicate: participates_in
      source: "infores:cpdb"
      knowledge_level: knowledge_assertion
      agent_type: manual_agent
      ## NO RESPONSE MAPPING

4: multiple input namespaces AND output namespaces, different sub-query info for outputs

2 input namespaces: "PUBCHEM.COMPOUND" and "TTD.DRUG". They have different "patterns for local unique identifiers":
- PUBCHEM.COMPOUND:139600308 (numeric)
- TTD.DRUG:DZJ3D5 (has letters and numbers, bioregisty entry looks different...)
2 output namespaces: MONDO and ICD11
using filter so sub-query info won't differ by input namespace
but output namespaces do have different sub-query info (parameters)
doesn't have response-mapping

(Based on BioThings TTD operations: pubchem_treats_mondo, pubchem_treats_icd11, ttd_drug_id_treats_mondo, ttd_drug_id_treats_icd11)

  x-bte-kgs-operations:
    chemical-treats-disease:
      supportBatch: true
      useTemplating: true
      inputs:
        semanticType: SmallMolecule
        namespaces: 
          - prefix: "PUBCHEM.COMPOUND"
            name_field: subject.name
          - prefix: "TTD.DRUG"
            name_field: subject.name
      requestInfo:
        differsByInputNamespace: false 
        differsByOutputNamespace: true
        byOutputNamespace:
          MONDO:
            requestBody:
              "$ref": "#/components/x-bte-refs/requestBody-chemTreatsDisease"
            parameters:
              fields: object.mondo,object.name,subject.name
              filter: association.predicate:"biolink:treats"
              size: 1000
          ICD11:
            requestBody:
              "$ref": "#/components/x-bte-refs/requestBody-chemTreatsDisease"
            parameters:
              fields: object.icd11,object.name,subject.name
              filter: association.predicate:"biolink:treats" AND (NOT _exists_:object.mondo)
              size: 1000
      outputs:
        semanticType: Disease
        namespaces: 
          - prefix: MONDO
            id_field: object.mondo
            name_field: object.name
          - prefix: ICD11
            id_field: object.icd11
            name_field: object.name
      predicate: treats
      source: "infores:ttd"
      knowledge_level: knowledge_assertion
      agent_type: manual_agent
      ## NO RESPONSE MAPPING
  x-bte-refs:
    requestBody-chemTreatsDisease:
      body:
        q: "{{ queryInputs }}"
        scopes: subject.pubchem_compound,subject.ttd_drug_id

5: multiple input namespaces AND output namespaces, different sub-query info for inputs

3 input ID namespaces: HP, NCIT, SNOMEDCT
- must query separately because HP and SNOMEDCT IDs can be mistaken for each other (both are numeric) -> different request bodies
using filter for cleaner text
3 output ID namespaces: MONDO, NCIT, SNOMEDCT
no source/KL/AT fields because it's in the response-mapping.

Based on 6 current Multiomics EHR Risk operations (since 3 combinations don't actually exist in the data...)

PhenoHP_increased_DiseaseMONDO
PhenoHP_increased_DiseaseNCIT
PhenoHP_increased_DiseaseSNOMEDCT
PhenoNCIT_increased_DiseaseMONDO
PhenoSNOMEDCT_increased_DiseaseMONDO
PhenoSNOMEDCT_increased_DiseaseSNOMEDCT

  x-bte-kgs-operations:
    pheno-increased-disease:
      supportBatch: true
      useTemplating: true
      inputs:
        semanticType: PhenotypicFeature
        namespaces: 
          - prefix: HP
            name_field: subject.name
          - prefix: NCIT
            name_field: subject.name
          - prefix: SNOMEDCT
            name_field: subject.name
      requestInfo:
        differsByInputNamespace: true
        differsByOutputNamespace: false
        byInputNamespace:
          HP:
            requestBody:
              "$ref": "#/components/x-bte-refs/requestInfo_HP"
            parameters:
              "$ref": "#/components/x-bte-refs/params_phenoIncreasedDisease"
          NCIT:
            requestBody:
              "$ref": "#/components/x-bte-refs/requestInfo_NCIT"
            parameters:
              "$ref": "#/components/x-bte-refs/params_phenoIncreasedDisease"
          SNOMEDCT:
            requestBody:
              "$ref": "#/components/x-bte-refs/requestInfo_SNOMEDCT"
            parameters:
              "$ref": "#/components/x-bte-refs/params_phenoIncreasedDisease"
      outputs:
        semanticType: Disease
        namespaces: 
          - prefix: MONDO
            id_field: object.MONDO
            name_field: object.name
          - prefix: NCIT
            id_field: object.NCIT
            name_field: object.name
          - prefix: SNOMEDCT
            id_field: object.SNOMEDCT
            name_field: object.name
      predicate: associated_with
      qualifiers:
        object_direction_qualifier: increased
        object_aspect_qualifier: likelihood
      response_mapping:
        "$ref": "#/components/x-bte-response-mapping/edge-info"
  x-bte-response-mapping:
    edge-info:
      edge-attributes: association.edge_attributes
      trapi_sources: source.edge_sources
  x-bte-refs:
    requestInfo_HP:
      requestBody:
        body:
          q: "{{ queryInputs | rmPrefix() }}"
          scopes: subject.HP
    requestInfo_NCIT:
      requestBody:
        body:
          q: "{{ queryInputs | rmPrefix() }}"
          scopes: subject.NCIT
    requestInfo_SNOMEDCT:
      requestBody:
        body:
          q: "{{ queryInputs | rmPrefix() }}"
          scopes: subject.SNOMEDCT
    params_phenoIncreasedDisease:
      fields: >-
        object.MONDO,object.NCIT,object.SNOMEDCT,
        association.edge_attributes,source.edge_sources,
        subject.name,object.name
      size: 1000
      filter: >-
        subject.type:"biolink:PhenotypicFeature" AND 
        association.predicate:associated_with_increased_likelihood_of AND
        object.type:"biolink:Disease"

rjawesome · 2023-12-05T01:15:39Z

I set up this proposal in the multiple-input-output branch using the smartapi-kg and api-respone-transform.js repositories

colleenXu added enhancement New feature or request needs discussion x-bte labels Oct 21, 2023

This was referenced Oct 23, 2023

summary: x-bte-refactoring related issues #750

Open

x-bte refactoring: what is 1 x-bte operation (unit of annotation)? #752

Closed

x-bte annotation refactoring discussion #656

Open

colleenXu mentioned this issue Feb 2, 2024

adjust SmartAPI yaml, x-bte annotation for Biolink/Monarch API migration #774

Closed

colleenXu assigned colleenXu and tokebe Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x-bte-refactoring: multiple input/output ID namespaces #748

x-bte-refactoring: multiple input/output ID namespaces #748

colleenXu commented Oct 21, 2023 •

edited

Loading

colleenXu commented Oct 22, 2023 •

edited

Loading

rjawesome commented Dec 5, 2023

x-bte-refactoring: multiple input/output ID namespaces #748

x-bte-refactoring: multiple input/output ID namespaces #748

Comments

colleenXu commented Oct 21, 2023 • edited Loading

The motivation

colleenXu commented Oct 22, 2023 • edited Loading

Initial thoughts

First proposal

Examples

rjawesome commented Dec 5, 2023

colleenXu commented Oct 21, 2023 •

edited

Loading

colleenXu commented Oct 22, 2023 •

edited

Loading