Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API endpoint to link data objects with upstream collections #355

Open
aclum opened this issue Nov 3, 2023 · 7 comments
Open

API endpoint to link data objects with upstream collections #355

aclum opened this issue Nov 3, 2023 · 7 comments
Assignees
Labels
enhancement New feature or request

Comments

@aclum
Copy link
Contributor

aclum commented Nov 3, 2023

Running list of difficult Mongo queries

https://docs.google.com/spreadsheets/d/1a9cN9ZDyjVOp6NtHiaUlpP_92sMInZtQV-Q-L5iWcOk/edit?usp=sharing

Please add/edit edit Google Sheet

In working on the jupyter notebooks and fielding user requests we need some endpoints that make it easier to combine study or biosample filter with workflow execution activities and/or data objects.

See related issue #246

Example requests from jupyter notebook work or users requests related to linking data objects

Example searches for supporting API search to match data portal queries:
-Return studies and biosamples from study X that have a processing institution of Y (return study based on ({'id':'nmdc:styX'}), return biosamples based on ({'part_of':'nmdc:styX'}) AND tracing through PlannedProcess classes to determine which Biosamples or ProcessedSamples derived from Biosamples are the values for has_input for class OmicsProcessing where({'processing_institution':'Y'})
-Return studies and biosamples where the annotation results have a hit to 'KEGG.ORTHOLOGY:K00005'. Implementation sketch would search for 'KEGG.ORTHOLOGY:K00005' in functional_annotation_agg slot gene_function_id, get the metagenome_annotation_id, trace back from that to the WorkflowExecutionActivity -> OmicsProcessing-> PlannedProcess Classes -> Biosample/ProcessedSample -> Loop until getting back to a Biosample -> Study

@shreddd to determine if this can be worked on in the next month in advance of the webinar with NEON.

cc @cmungall @brynnz22 @kheal

@aclum
Copy link
Contributor Author

aclum commented Nov 7, 2023

@dwinston
Copy link
Collaborator

after discussion with @PeopleMakeCulture , I'd like to generate a derived single collection that allows graph-like queries across all current collections' documents in a more streamlined (less complicated mongo aggregation queries) manner.

@dwinston
Copy link
Collaborator

@jeffbaumes the approach here may benefit frok the work you did to take the data portal's postgres tables to mongo. If you have any ideas here, please note them.

@PeopleMakeCulture
Copy link
Collaborator

sample query from @brynnz22: https://github.com/microbiomedata/notebook_hackathons/blob/soil-contig-tax/taxonomic_dist_by_soil_layer/python/mongodb_query.txt.js

db.getCollection("biosample_set").aggregate(
    [
        { $match: { 'soil_horizon': { '$in': ['O horizon', 'M horizon'] } } },
        {
            $project: {
                "id": 1,
                "soil_horizon": 1
            }
        },
        {
            $lookup:
            {
                from: "pooling_set",
                localField: "id",
                foreignField: "has_input",
                as: "pooling_set"
            }
        },
        {
            $project: {
                "id": 1,
                "soil_horizon": 1,
                "pooling_set.has_input": 1,
                "pooling_set.has_output": 1
            }
        },
        {
            $lookup:
            {
                from: "processed_sample_set",
                localField: "pooling_set.has_output",
                foreignField: "id",
                as: "processed_sample_set"
            }
        },
        {
            $project: {
                "id": 1,
                "soil_horizon": 1,
                "pooling_set.has_input": 1,
                "pooling_set.has_output": 1,
                "processed_sample_set.id": 1
            }
        },
        {
            $lookup:
            {
                from: "extraction_set",
                localField: "processed_sample_set.id",
                foreignField: "has_input",
                as: "extraction_set"
            }
        },
        {
            $project: {
                "id": 1,
                "soil_horizon": 1,
                "pooling_set.has_input": 1,
                "pooling_set.has_output": 1,
                "processed_sample_set.id": 1,
                "extraction_set.has_input": 1,
                "extraction_set.has_output": 1,
                "extraction_set.id": 1
            }
        },
        {
            $lookup:
            {
                from: "processed_sample_set",
                localField: "extraction_set.has_output",
                foreignField: "id",
                as: "processed_sample_set2"
            }
        },
        {
            $project: {
                "id": 1,
                "soil_horizon": 1,
                "pooling_set.has_input": 1,
                "pooling_set.has_output": 1,
                "processed_sample_set.id": 1,
                "extraction_set.has_input": 1,
                "extraction_set.has_output": 1,
                "extraction_set.id": 1,
                "processed_sample_set2.id": 1
            }
        },
        {
            $lookup:
            {
                from: "library_preparation_set",
                localField: "processed_sample_set2.id",
                foreignField: "has_input",
                as: "library_preparation_set"
            }
        },
        {
            $project: {
                "id": 1,
                "soil_horizon": 1,
                "pooling_set.has_input": 1,
                "pooling_set.has_output": 1,
                "processed_sample_set.id": 1,
                "extraction_set.has_input": 1,
                "extraction_set.has_output": 1,
                "extraction_set.id": 1,
                "processed_sample_set2.id": 1,
                "library_preparation_set.has_input": 1,
                "library_preparation_set.has_output": 1,
                "library_preparation_set.id": 1
            }
        },
        {
            $lookup:
            {
                from: "processed_sample_set",
                localField: "library_preparation_set.has_output",
                foreignField: "id",
                as: "processed_sample_set3"
            }
        },
        {
            $project: {
                "id": 1,
                "soil_horizon": 1,
                "pooling_set.has_input": 1,
                "pooling_set.has_output": 1,
                "processed_sample_set.id": 1,
                "extraction_set.has_input": 1,
                "extraction_set.has_output": 1,
                "extraction_set.id": 1,
                "processed_sample_set2.id": 1,
                "library_preparation_set.has_input": 1,
                "library_preparation_set.has_output": 1,
                "library_preparation_set.id": 1,
                "processed_sample_set3.id": 1
            }
        },
        {
            $lookup:
            {
                from: "omics_processing_set",
                localField: "processed_sample_set3.id",
                foreignField: "has_input",
                as: "omics_processing_set"
            }
        },
        {
            $project: {
                "id": 1,
                "soil_horizon": 1,
                "pooling_set.has_input": 1,
                "pooling_set.has_output": 1,
                "processed_sample_set.id": 1,
                "extraction_set.has_input": 1,
                "extraction_set.has_output": 1,
                "extraction_set.id": 1,
                "processed_sample_set2.id": 1,
                "library_preparation_set.has_input": 1,
                "library_preparation_set.has_output": 1,
                "library_preparation_set.id": 1,
                "processed_sample_set3.id": 1,
                "omics_processing_set.has_input": 1,
                "omics_processing_set.id": 1
            }
        },
        {
            $lookup:
            {
                from: "metagenome_annotation_activity_set",
                localField: "omics_processing_set.id",
                foreignField: "was_informed_by",
                as: "metagenome_annotation_activity_set"
            }
        },
        {
            $project: {
                "id": 1,
                "soil_horizon": 1,
                "pooling_set.has_input": 1,
                "pooling_set.has_output": 1,
                "processed_sample_set.id": 1,
                "extraction_set.has_input": 1,
                "extraction_set.has_output": 1,
                "extraction_set.id": 1,
                "processed_sample_set2.id": 1,
                "library_preparation_set.has_input": 1,
                "library_preparation_set.has_output": 1,
                "library_preparation_set.id": 1,
                "processed_sample_set3.id": 1,
                "omics_processing_set.has_input": 1,
                "omics_processing_set.id": 1,
                "metagenome_annotation_activity_set.was_informed_by": 1,
                "metagenome_annotation_activity_set.has_output": 1
            }
        },
        {
            $lookup:
            {
                from: "data_object_set",
                localField: "metagenome_annotation_activity_set.has_output",
                foreignField: "id",
                as: "data_object_set"
            }
        },
        {
            $project: {
                "id": 1,
                "soil_horizon": 1,
                "pooling_set.has_input": 1,
                "pooling_set.has_output": 1,
                "processed_sample_set.id": 1,
                "extraction_set.has_input": 1,
                "extraction_set.has_output": 1,
                "extraction_set.id": 1,
                "processed_sample_set2.id": 1,
                "library_preparation_set.has_input": 1,
                "library_preparation_set.has_output": 1,
                "library_preparation_set.id": 1,
                "processed_sample_set3.id": 1,
                "omics_processing_set.has_input": 1,
                "omics_processing_set.id": 1,
                "metagenome_annotation_activity_set.was_informed_by": 1,
                "metagenome_annotation_activity_set.has_output": 1,
                "data_object_set.id": 1,
                "data_object_set.data_object_type": "Scaffold Lineage tsv",
                "data_object_set.url": 1
            }
        }
    ]
)

@aclum
Copy link
Contributor Author

aclum commented Jan 16, 2024

See work done by @brynnz22 and @kheal in this notebook to connect study to taxonomic taxonomic information. https://github.com/microbiomedata/notebook_hackathons/tree/main/taxonomic_dist_by_soil_layer
cc @cmungall @shreddd

@PeopleMakeCulture PeopleMakeCulture moved this from Lineup to At bat in Polyneme mixset Jan 20, 2024
@PeopleMakeCulture PeopleMakeCulture moved this from At bat to Lineup in Polyneme mixset Jan 20, 2024
@dwinston dwinston moved this from Lineup to At bat in Polyneme mixset Feb 9, 2024
@PeopleMakeCulture PeopleMakeCulture moved this from At bat to On base in Polyneme mixset Feb 13, 2024
@PeopleMakeCulture PeopleMakeCulture moved this from On base to At bat in Polyneme mixset Feb 15, 2024
@PeopleMakeCulture
Copy link
Collaborator

Duplicates #401

@github-project-automation github-project-automation bot moved this from At bat to Scored in Polyneme mixset Feb 15, 2024
@aclum aclum reopened this Aug 28, 2024
@aclum
Copy link
Contributor Author

aclum commented Aug 28, 2024

The needs here have not been addressed. In order to address this we need

  1. filtering options besides just providing a study
  2. be able to to return more complete intermediate records.
    cc @shreddd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Archived in project
Development

No branches or pull requests

3 participants