Edit `FunctionalAnnotationAggMember` to work with MetaP data #1253

aclum · 2023-10-27T23:43:38Z

We added a nmdc schema Class for the gene aggregation for the sequencing workflows, see https://microbiomedata.github.io/nmdc-schema/FunctionalAnnotationAggMember/, but not for the metaproteomics. I believe we wanted all the collections in mongo to have a slot on class Database, in order to do that we need a Class for the metaproteomics gene function aggregation. The collection in prod mongo that current contains this data is 'metap_gene_function_aggregation'

related to
#1252

Based on that we need class with slots (based on how the collection is currently populated)

metaproteomic_analysis_id
gene_function_id (slot already exists -update domain_of for this slot)
count (slot already exists - update domain_of for this slot)
is_best_protein (need new slot)

all slots should be required

Then Class Database needs a slot 'metap_gene_function_aggregation' with a range of the new metap gene function aggregation Class

@SamuelPurvine @mslarae13 @pdpiehowski @picowatt please confirm requirements, in particular if the slots should be required or not.

cc @turbomam

update schema (Alicia)
write migration code for mongo prod (Michael)

aclum · 2023-11-07T17:56:24Z

This was discussed today at the metap meeting and folks agree on adding this as a Class to the schema with the slots as required. I talked with @cmungall about this last week in person and IGB and he was supportive as well.

mslarae13 · 2023-11-22T19:58:36Z

@aclum is this a "WorkflowExecution" subclass based on the new schema structure we're doing?

SamuelPurvine · 2023-11-22T20:21:32Z

And, as mentioned over in issue 1309, the best_protein slot in this aggregation class should be renamed to something like has_best_protein or is_best_protein and be given range boolian false/true.

aclum · 2023-11-27T18:11:47Z

@mslarae13 no, this is an aggregation table from the output of the MetaproteomicsAnalysisActivity/
i am setting this up the same way that @turbomam created the FunctionalAnnotationAggMember class to aggregate the sequencing based annotation results.

aclum · 2023-12-06T23:00:37Z

@naglepuff or @marySalvi Could you comment on what fields from the metap_gene_function_aggregation collection the data portal ingest code uses? We have to change the name of best_protein since it is already being used in a different way within the schema.

ssarrafan · 2023-12-15T21:44:55Z

@naglepuff @marySalvi can you please respond to this issue?

Will move to the next sprint. @aclum let me know if it should be in the backlog.

naglepuff · 2023-12-15T22:13:43Z

We use:

gene_function_id
count
best_protein

aclum · 2023-12-16T00:39:05Z

Okay, we need to change the name of that so just a note that this will be a minor breaking change to the data portal. best_protein will be come is_best_protein @naglepuff

aclum · 2023-12-16T00:42:21Z

@mbthornton-lbl this is ready for the migration code to be written.

aclum · 2023-12-21T23:48:43Z

Migration code has been written. Ready for code review and to get merged in but I think Mark as signed off for the year so will move to 'in review' for the first sprint in January.

turbomam · 2024-01-09T19:31:00Z

@aclum @lamccue @pdpiehowski @SamuelPurvine @scanon @GrantFujimoto @picowatt

see https://github.com/microbiomedata/nmdc_automation/blob/metap_agg/src/generate_metap_agg.py

I don't think a metaproteomics-based functional annotation aggregation record should have a slot called is_best_protein.

These aggregations are about Biosamples, via the annotation process that took the Biosample's data objects as input. They say "there are N annotations that say this Biosample has at least one gene with some function". The boolean value we are talking about indicates whether any of the individual annotations that have been aggregated are evidenced by a best protein accounting of an observed peptide.

Therefore I propose that the boolean slot should be called evidenced_by_best_protein

the description or a comment could be

Is this aggregated functional annotation on a Biosample based on functions that are associated with some protein that is the best (most parsimonious) accounting for a peptide, whose presence was inferred from some technology like mass spectrometry

turbomam · 2024-01-09T19:47:37Z

This is all irrelevant if we decide that non-best-evidenced functions should not be stored or exposed in the data portal

eecavanna · 2024-05-10T17:07:27Z

One implication of this class definition not being in the schema is that code that performs validation based upon the schema can't check data representing instances of this class (without "hard-coding" special validation code for this class).

Example: Referential integrity validation

For example, code that uses the schema to determine which fields of which Mongo documents might contain the ids of other documents, and which Mongo collections those other documents reside in, wouldn't know where to look for referenced documents that represent instances of the (hypothetical) MetaPGeneFunctionAggMember class.

Example: Document validation

As another example, code that validates whether a given Mongo document conforms to the schema would not be able to validate documents in the (hypothetical) metap_gene_function_agg_member_set collection.

The referential integrity validation-related implication came up recently as some team members have been developing referential integrity checkers.

eecavanna · 2024-05-22T21:05:34Z

I think this is going to be implemented after the Berkeley schema gets merged into the upstream schema.

SamuelPurvine · 2024-05-22T21:36:30Z

@eecavanna that is correct. There's a fair amount of churn about this task, including

coming up with a different name for best_protein that better describes what this thing is/does
moving the slot we are currently naming best_protein off of PeptideQuantification and ProteinQuantification classes and onto something that will fit in the Berkeley schema, likely keeping the range as GeneProduct
adding a new slot that describes the method by which the unique peptide to protein association was determined
removing data that is associated with the PeptideQuantification class from the mongo db
removing data that is associated with the ProteinQuantification class from the mongo db
deleting PeptideQuantification and ProteinQuantification classes from the schema(s)
change the json that is submitted by metaproteomics analysis activity to only include what is currently termed best_protein
reconfigure how the metaproteomics aggregation table is structured and is built to take these changes into consideration

All of this would more or less obviate the re-id-ing effort work for metaproteomics (not currently performed as of this writing), and also break the aggregation table implementation, which 'just' got fixed to work with current data. Getting the metaproteomics re-id-ing done will run up against the Berkeley schema light freeze in a way that virtually guarantees aggregation of technical debt if we try and "make it happen".

kheal · 2024-08-21T16:53:39Z

@aclum and @SamuelPurvine. I've given this some thought, and below is my current proposal. I'd appreciate your input before I try to implement in the schema.

The proteomcis team is moving away from using any annotation of "best protein" and instead will be supplying a list of proteins annotated (at a high level of scrutiny) for each MetaproteomicsAnalysis (see #2157). For the resulting aggregation, I believe we will simply have a functional gene ID and a MetaproteomicsAnalysis ID; which can fit in the current structure of the FunctionalAnnotationAggMember class with minimal changes.

See proposal below:

SamuelPurvine · 2024-08-21T18:11:40Z

Makes sense to me. Curious about whether adding values direxctly to mongoDB for gene_function_id would/will be possible when we start using non-NMDC-generated-metagenome dependent functional annotations? It certainly seems the structure you have here would make relaxing this dependency easier.

Just as an aside, current thinking for replacing best_protein is to use razor_protein with parsimony such that: 1) peptides that belong to one protein have that protein as its razor_protein; 2) peptides belonging to multiple proteins (i.e. non-unique), only one of which has a separate unique peptide, will associate with that protein as the razor_protein; 3) non-unique peptides belonging to proteins with no other unique peptide will be associated with one or more proteins having a maximal count of other peptides associated, with these being the razor_protein entries. We would then upload the entire list of the razor_protein entries for a given WorkflowExecution to allow access via API and aggregation as per @kheal's above suggestion. In the case where there isn't matching between the razor_protein and a GeneProduct from a metagenomics analysis, the WorkflowExecution would supply the list of 'gene_function_id' entries directly to the aggregation table, which is why I'm asking the above question.

I do think this would require the 'WorkflowExecution' to be "aware" of currently extant 'gene_function_id' entries to avoid duplication... Or try and make the import smart enough to ignore attempts at adding duplicates? Sorry for the free thought blathering here...

kheal · 2024-08-21T18:22:50Z

@SamuelPurvine

I do think this would require the 'WorkflowExecution' to be "aware" of currently extant 'gene_function_id' entries to avoid duplication... Or try and make the import smart enough to ignore attempts at adding duplicates? Sorry for the free thought blathering here...

I don't think I quite understand this part. What records are you worried about duplicating? The FunctionalAnnotationAggMember tables when we upload new MetaproteomicsAnalysis records? Those are unique combinations of WorkflowExecution:GeneProduct, so I don't think we need to worry about duplication from that sense.

Or are you worried about current mongo records being duplicated when we implement this? If so, @aclum and I have discussed removing all existing proteomics aggregation table records from mongo and regenerating with the chron job after all the schema and scripting for the aggregation tables have been updated.

SamuelPurvine · 2024-08-21T18:27:30Z

I mean specifically that there is a separate storage mechanism for gene_function_id values, is there not? If the metagenome-free workflow comes up with a gene_function_id that is already in that data chunk, we wouldn't want it to be added, only things not yet entered via the metagenome route. Maybe I'm just thinking further ahead than I should be :)

kheal · 2024-08-21T18:33:14Z

I see. I think we will have to deal with that (either or possibly both in the metadata generation for the MetaproteomicsAnalysis or in the aggregation scripts). But yes, I think that's a 'later us' problem that will become more clear once we have example data that we need to contort into the correct shape. It's encouraging to hear that you think the proposed FunctionalAnnotationAggMember class could work with those data though.

SamuelPurvine · 2024-08-21T18:44:19Z

Yes, this proposal generalizes the FunctionalAnnotationAggMember' class to allow for, checks notes, any WorkflowExectution` to make entries into the functional annotation space.

Is anyone thinking to add a description anywhere for the count slot. Would love to know explicitly what we are counting for this aggregation class...

aclum · 2024-08-22T00:58:30Z

This is sleek, i like it. WRT to duplicates I think we'd want to store both in the aggregation table, but then handle display with nmdc-server logic. It's potentially useful to know the venn diagram of shared gene_function_id between reference based and reference free metap workflows.

SamuelPurvine · 2024-08-22T15:47:31Z

@aclum excellent point! might we think of adding a/the source for a given functional annotation to more easily enable this, or just use the workflow "version", however we are going to denote suchlike? I suppose you could build a query that walks up the chain to biosample, walks back down to a/the related metagenome, pulls an example GeneProduct name and compares it to the GeneProduct IDs from the WorkflowExecution in question... if there's more than one metagenome run then it's moar fun...

aclum · 2024-08-29T18:00:35Z

I'd assumed the reference free workflow would have a different workflow exeuction subclass name which would allow for easy aggregations.
Since the current results are only presence/absence no work has been done on deduplicating such that if reference based annotation runs twice it would only show the results for the latest version.

aclum mentioned this issue Nov 8, 2023

Make FunctionalAnnotationAggMember slots required #1309

Merged

aclum linked a pull request Nov 8, 2023 that will close this issue

Make FunctionalAnnotationAggMember slots required #1309

Merged

aclum added this to 2023 Squad Sprint 24: November 20 - December 1, 2023 Nov 21, 2023

aclum moved this to In Progress in 2023 Squad Sprint 24: November 20 - December 1, 2023 Nov 27, 2023

aclum assigned aclum and mbthornton-lbl Nov 28, 2023

mbthornton-lbl removed this from 2023 Squad Sprint 24: November 20 - December 1, 2023 Dec 1, 2023

ssarrafan added this to 2023 Squad Sprint 25: December 4 - December 15, 2023 Dec 1, 2023

ssarrafan moved this to In Progress in 2023 Squad Sprint 25: December 4 - December 15, 2023 Dec 1, 2023

aclum mentioned this issue Dec 5, 2023

make slots for Class FunctionalAnnotationAggMember required #1252

Closed

aclum mentioned this issue Dec 6, 2023

update code that generates metap_gene_function_aggregation microbiomedata/nmdc_automation#32

Closed

ssarrafan removed this from 2023 Squad Sprint 25: December 4 - December 15, 2023 Dec 15, 2023

ssarrafan added this to 2023 Squad Sprint 26: December 18 - December 29, 2023 Dec 15, 2023

ssarrafan moved this to In Progress in 2023 Squad Sprint 26: December 18 - December 29, 2023 Dec 15, 2023

mbthornton-lbl moved this from In Progress to In Review in 2023 Squad Sprint 26: December 18 - December 29, 2023 Dec 20, 2023

aclum removed this from 2023 Squad Sprint 26: December 18 - December 29, 2023 Dec 21, 2023

aclum added this to 2024 - Sprint 27 - January 2-January 12, 2024 Dec 21, 2023

aclum moved this to In Review in 2024 - Sprint 27 - January 2-January 12, 2024 Dec 21, 2023

aclum unassigned aclum and mbthornton-lbl Jan 10, 2024

aclum added the backlog Issue not assigned to a sprint or not completed during a sprint. Needs to be reprioritized. label Jan 10, 2024

aclum removed this from 2024 - Sprint 27 - January 2-January 12, 2024 Jan 10, 2024

aclum assigned kheal Aug 14, 2024

kheal mentioned this issue Aug 20, 2024

[POST-berkeley-MERGER] Refactor MetaproteomicsAnalysis slots and implement migrator microbiomedata/berkeley-schema-fy24#236

Closed

7 tasks

kheal mentioned this issue Aug 28, 2024

[POST-berkeley-MERGER] Update FunctionalAnnotationAggMember class for compatibility with MetaP Aggregation tables microbiomedata/berkeley-schema-fy24#242

Closed

5 tasks

aclum mentioned this issue Sep 30, 2024

Is it too late to make metap_gene_function_aggregation's name more similar to functional_annotation_agg #1461

Closed

kheal changed the title ~~create class for metap gene function aggregation~~ Edit FunctionalAnnotationAggMember to work with MetaP data Oct 2, 2024

kheal closed this as completed in #2203 Nov 8, 2024

ssarrafan added this to 2024 - Sprint 49 - November 4 - 15, 2024 Nov 8, 2024

ssarrafan moved this to Done in 2024 - Sprint 49 - November 4 - 15, 2024 Nov 8, 2024

ssarrafan removed the backlog Issue not assigned to a sprint or not completed during a sprint. Needs to be reprioritized. label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Edit `FunctionalAnnotationAggMember` to work with MetaP data #1253

Edit `FunctionalAnnotationAggMember` to work with MetaP data #1253

aclum commented Oct 27, 2023 •

edited

Loading

aclum commented Nov 7, 2023

mslarae13 commented Nov 22, 2023

SamuelPurvine commented Nov 22, 2023

aclum commented Nov 27, 2023

aclum commented Dec 6, 2023

ssarrafan commented Dec 15, 2023

naglepuff commented Dec 15, 2023

aclum commented Dec 16, 2023

aclum commented Dec 16, 2023

aclum commented Dec 21, 2023 •

edited

Loading

turbomam commented Jan 9, 2024 •

edited

Loading

turbomam commented Jan 9, 2024

eecavanna commented May 10, 2024 •

edited

Loading

eecavanna commented May 22, 2024

SamuelPurvine commented May 22, 2024

kheal commented Aug 21, 2024

SamuelPurvine commented Aug 21, 2024

kheal commented Aug 21, 2024 •

edited

Loading

SamuelPurvine commented Aug 21, 2024

kheal commented Aug 21, 2024

SamuelPurvine commented Aug 21, 2024

aclum commented Aug 22, 2024

SamuelPurvine commented Aug 22, 2024

aclum commented Aug 29, 2024

Edit FunctionalAnnotationAggMember to work with MetaP data #1253

Edit FunctionalAnnotationAggMember to work with MetaP data #1253

Comments

aclum commented Oct 27, 2023 • edited Loading

aclum commented Nov 7, 2023

mslarae13 commented Nov 22, 2023

SamuelPurvine commented Nov 22, 2023

aclum commented Nov 27, 2023

aclum commented Dec 6, 2023

ssarrafan commented Dec 15, 2023

naglepuff commented Dec 15, 2023

aclum commented Dec 16, 2023

aclum commented Dec 16, 2023

aclum commented Dec 21, 2023 • edited Loading

turbomam commented Jan 9, 2024 • edited Loading

turbomam commented Jan 9, 2024

eecavanna commented May 10, 2024 • edited Loading

Example: Referential integrity validation

Example: Document validation

eecavanna commented May 22, 2024

SamuelPurvine commented May 22, 2024

kheal commented Aug 21, 2024

SamuelPurvine commented Aug 21, 2024

kheal commented Aug 21, 2024 • edited Loading

SamuelPurvine commented Aug 21, 2024

kheal commented Aug 21, 2024

SamuelPurvine commented Aug 21, 2024

aclum commented Aug 22, 2024

SamuelPurvine commented Aug 22, 2024

aclum commented Aug 29, 2024

Edit `FunctionalAnnotationAggMember` to work with MetaP data #1253

Edit `FunctionalAnnotationAggMember` to work with MetaP data #1253

aclum commented Oct 27, 2023 •

edited

Loading

aclum commented Dec 21, 2023 •

edited

Loading

turbomam commented Jan 9, 2024 •

edited

Loading

eecavanna commented May 10, 2024 •

edited

Loading

kheal commented Aug 21, 2024 •

edited

Loading