Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edit FunctionalAnnotationAggMember to work with MetaP data #1253

Closed
2 tasks done
aclum opened this issue Oct 27, 2023 · 25 comments · Fixed by #2203
Closed
2 tasks done

Edit FunctionalAnnotationAggMember to work with MetaP data #1253

aclum opened this issue Oct 27, 2023 · 25 comments · Fixed by #2203
Assignees

Comments

@aclum
Copy link
Contributor

aclum commented Oct 27, 2023

We added a nmdc schema Class for the gene aggregation for the sequencing workflows, see https://microbiomedata.github.io/nmdc-schema/FunctionalAnnotationAggMember/, but not for the metaproteomics. I believe we wanted all the collections in mongo to have a slot on class Database, in order to do that we need a Class for the metaproteomics gene function aggregation. The collection in prod mongo that current contains this data is 'metap_gene_function_aggregation'

related to
#1252

Based on that we need class with slots (based on how the collection is currently populated)

metaproteomic_analysis_id
gene_function_id (slot already exists -update domain_of for this slot)
count (slot already exists - update domain_of for this slot)
is_best_protein (need new slot)

all slots should be required

Then Class Database needs a slot 'metap_gene_function_aggregation' with a range of the new metap gene function aggregation Class

@SamuelPurvine @mslarae13 @pdpiehowski @picowatt please confirm requirements, in particular if the slots should be required or not.

cc @turbomam

  • update schema (Alicia)
  • write migration code for mongo prod (Michael)
@aclum
Copy link
Contributor Author

aclum commented Nov 7, 2023

This was discussed today at the metap meeting and folks agree on adding this as a Class to the schema with the slots as required. I talked with @cmungall about this last week in person and IGB and he was supportive as well.

@mslarae13
Copy link
Contributor

@aclum is this a "WorkflowExecution" subclass based on the new schema structure we're doing?

@SamuelPurvine
Copy link
Contributor

And, as mentioned over in issue 1309, the best_protein slot in this aggregation class should be renamed to something like has_best_protein or is_best_protein and be given range boolian false/true.

@aclum
Copy link
Contributor Author

aclum commented Nov 27, 2023

@mslarae13 no, this is an aggregation table from the output of the MetaproteomicsAnalysisActivity/
i am setting this up the same way that @turbomam created the FunctionalAnnotationAggMember class to aggregate the sequencing based annotation results.

@aclum
Copy link
Contributor Author

aclum commented Dec 6, 2023

@naglepuff or @marySalvi Could you comment on what fields from the metap_gene_function_aggregation collection the data portal ingest code uses? We have to change the name of best_protein since it is already being used in a different way within the schema.

@ssarrafan
Copy link
Collaborator

@naglepuff @marySalvi can you please respond to this issue?

Will move to the next sprint. @aclum let me know if it should be in the backlog.

@naglepuff
Copy link

We use:

gene_function_id
count
best_protein

@aclum
Copy link
Contributor Author

aclum commented Dec 16, 2023

Okay, we need to change the name of that so just a note that this will be a minor breaking change to the data portal. best_protein will be come is_best_protein @naglepuff

@aclum
Copy link
Contributor Author

aclum commented Dec 16, 2023

@mbthornton-lbl this is ready for the migration code to be written.

@aclum
Copy link
Contributor Author

aclum commented Dec 21, 2023

Migration code has been written. Ready for code review and to get merged in but I think Mark as signed off for the year so will move to 'in review' for the first sprint in January.

@turbomam
Copy link
Member

turbomam commented Jan 9, 2024

@aclum @lamccue @pdpiehowski @SamuelPurvine @scanon @GrantFujimoto @picowatt

see https://github.com/microbiomedata/nmdc_automation/blob/metap_agg/src/generate_metap_agg.py

I don't think a metaproteomics-based functional annotation aggregation record should have a slot called is_best_protein.

These aggregations are about Biosamples, via the annotation process that took the Biosample's data objects as input. They say "there are N annotations that say this Biosample has at least one gene with some function". The boolean value we are talking about indicates whether any of the individual annotations that have been aggregated are evidenced by a best protein accounting of an observed peptide.

Therefore I propose that the boolean slot should be called evidenced_by_best_protein

the description or a comment could be

Is this aggregated functional annotation on a Biosample based on functions that are associated with some protein that is the best (most parsimonious) accounting for a peptide, whose presence was inferred from some technology like mass spectrometry

@turbomam
Copy link
Member

turbomam commented Jan 9, 2024

This is all irrelevant if we decide that non-best-evidenced functions should not be stored or exposed in the data portal

@aclum aclum added the backlog Issue not assigned to a sprint or not completed during a sprint. Needs to be reprioritized. label Jan 10, 2024
@eecavanna
Copy link
Collaborator

eecavanna commented May 10, 2024

One implication of this class definition not being in the schema is that code that performs validation based upon the schema can't check data representing instances of this class (without "hard-coding" special validation code for this class).

Example: Referential integrity validation

For example, code that uses the schema to determine which fields of which Mongo documents might contain the ids of other documents, and which Mongo collections those other documents reside in, wouldn't know where to look for referenced documents that represent instances of the (hypothetical) MetaPGeneFunctionAggMember class.

Example: Document validation

As another example, code that validates whether a given Mongo document conforms to the schema would not be able to validate documents in the (hypothetical) metap_gene_function_agg_member_set collection.

The referential integrity validation-related implication came up recently as some team members have been developing referential integrity checkers.

@eecavanna
Copy link
Collaborator

I think this is going to be implemented after the Berkeley schema gets merged into the upstream schema.

@SamuelPurvine
Copy link
Contributor

@eecavanna that is correct. There's a fair amount of churn about this task, including

  • coming up with a different name for best_protein that better describes what this thing is/does
  • moving the slot we are currently naming best_protein off of PeptideQuantification and ProteinQuantification classes and onto something that will fit in the Berkeley schema, likely keeping the range as GeneProduct
  • adding a new slot that describes the method by which the unique peptide to protein association was determined
  • removing data that is associated with the PeptideQuantification class from the mongo db
  • removing data that is associated with the ProteinQuantification class from the mongo db
  • deleting PeptideQuantification and ProteinQuantification classes from the schema(s)
  • change the json that is submitted by metaproteomics analysis activity to only include what is currently termed best_protein
  • reconfigure how the metaproteomics aggregation table is structured and is built to take these changes into consideration

All of this would more or less obviate the re-id-ing effort work for metaproteomics (not currently performed as of this writing), and also break the aggregation table implementation, which 'just' got fixed to work with current data. Getting the metaproteomics re-id-ing done will run up against the Berkeley schema light freeze in a way that virtually guarantees aggregation of technical debt if we try and "make it happen".

@kheal
Copy link
Contributor

kheal commented Aug 21, 2024

@aclum and @SamuelPurvine. I've given this some thought, and below is my current proposal. I'd appreciate your input before I try to implement in the schema.

The proteomcis team is moving away from using any annotation of "best protein" and instead will be supplying a list of proteins annotated (at a high level of scrutiny) for each MetaproteomicsAnalysis (see #2157). For the resulting aggregation, I believe we will simply have a functional gene ID and a MetaproteomicsAnalysis ID; which can fit in the current structure of the FunctionalAnnotationAggMember class with minimal changes.

See proposal below:

Screenshot 2024-08-21 at 9 51 33 AM

@SamuelPurvine
Copy link
Contributor

Makes sense to me. Curious about whether adding values direxctly to mongoDB for gene_function_id would/will be possible when we start using non-NMDC-generated-metagenome dependent functional annotations? It certainly seems the structure you have here would make relaxing this dependency easier.

Just as an aside, current thinking for replacing best_protein is to use razor_protein with parsimony such that: 1) peptides that belong to one protein have that protein as its razor_protein; 2) peptides belonging to multiple proteins (i.e. non-unique), only one of which has a separate unique peptide, will associate with that protein as the razor_protein; 3) non-unique peptides belonging to proteins with no other unique peptide will be associated with one or more proteins having a maximal count of other peptides associated, with these being the razor_protein entries. We would then upload the entire list of the razor_protein entries for a given WorkflowExecution to allow access via API and aggregation as per @kheal's above suggestion. In the case where there isn't matching between the razor_protein and a GeneProduct from a metagenomics analysis, the WorkflowExecution would supply the list of 'gene_function_id' entries directly to the aggregation table, which is why I'm asking the above question.

I do think this would require the 'WorkflowExecution' to be "aware" of currently extant 'gene_function_id' entries to avoid duplication... Or try and make the import smart enough to ignore attempts at adding duplicates? Sorry for the free thought blathering here...

@kheal
Copy link
Contributor

kheal commented Aug 21, 2024

@SamuelPurvine

I do think this would require the 'WorkflowExecution' to be "aware" of currently extant 'gene_function_id' entries to avoid duplication... Or try and make the import smart enough to ignore attempts at adding duplicates? Sorry for the free thought blathering here...

I don't think I quite understand this part. What records are you worried about duplicating? The FunctionalAnnotationAggMember tables when we upload new MetaproteomicsAnalysis records? Those are unique combinations of WorkflowExecution:GeneProduct, so I don't think we need to worry about duplication from that sense.

Or are you worried about current mongo records being duplicated when we implement this? If so, @aclum and I have discussed removing all existing proteomics aggregation table records from mongo and regenerating with the chron job after all the schema and scripting for the aggregation tables have been updated.

@SamuelPurvine
Copy link
Contributor

I mean specifically that there is a separate storage mechanism for gene_function_id values, is there not? If the metagenome-free workflow comes up with a gene_function_id that is already in that data chunk, we wouldn't want it to be added, only things not yet entered via the metagenome route. Maybe I'm just thinking further ahead than I should be :)

@kheal
Copy link
Contributor

kheal commented Aug 21, 2024

I see. I think we will have to deal with that (either or possibly both in the metadata generation for the MetaproteomicsAnalysis or in the aggregation scripts). But yes, I think that's a 'later us' problem that will become more clear once we have example data that we need to contort into the correct shape. It's encouraging to hear that you think the proposed FunctionalAnnotationAggMember class could work with those data though.

@SamuelPurvine
Copy link
Contributor

Yes, this proposal generalizes the FunctionalAnnotationAggMember' class to allow for, checks notes, any WorkflowExectution` to make entries into the functional annotation space.

Is anyone thinking to add a description anywhere for the count slot. Would love to know explicitly what we are counting for this aggregation class...

@aclum
Copy link
Contributor Author

aclum commented Aug 22, 2024

This is sleek, i like it. WRT to duplicates I think we'd want to store both in the aggregation table, but then handle display with nmdc-server logic. It's potentially useful to know the venn diagram of shared gene_function_id between reference based and reference free metap workflows.

@SamuelPurvine
Copy link
Contributor

@aclum excellent point! might we think of adding a/the source for a given functional annotation to more easily enable this, or just use the workflow "version", however we are going to denote suchlike? I suppose you could build a query that walks up the chain to biosample, walks back down to a/the related metagenome, pulls an example GeneProduct name and compares it to the GeneProduct IDs from the WorkflowExecution in question... if there's more than one metagenome run then it's moar fun...

@aclum
Copy link
Contributor Author

aclum commented Aug 29, 2024

I'd assumed the reference free workflow would have a different workflow exeuction subclass name which would allow for easy aggregations.
Since the current results are only presence/absence no work has been done on deduplicating such that if reference based annotation runs twice it would only show the results for the latest version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment