-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Edit FunctionalAnnotationAggMember
to work with MetaP data
#1253
Comments
This was discussed today at the metap meeting and folks agree on adding this as a Class to the schema with the slots as required. I talked with @cmungall about this last week in person and IGB and he was supportive as well. |
@aclum is this a "WorkflowExecution" subclass based on the new schema structure we're doing? |
And, as mentioned over in issue 1309, the best_protein slot in this aggregation class should be renamed to something like has_best_protein or is_best_protein and be given range boolian false/true. |
@mslarae13 no, this is an aggregation table from the output of the MetaproteomicsAnalysisActivity/ |
@naglepuff or @marySalvi Could you comment on what fields from the metap_gene_function_aggregation collection the data portal ingest code uses? We have to change the name of best_protein since it is already being used in a different way within the schema. |
@naglepuff @marySalvi can you please respond to this issue? Will move to the next sprint. @aclum let me know if it should be in the backlog. |
We use:
|
Okay, we need to change the name of that so just a note that this will be a minor breaking change to the data portal. best_protein will be come is_best_protein @naglepuff |
@mbthornton-lbl this is ready for the migration code to be written. |
Migration code has been written. Ready for code review and to get merged in but I think Mark as signed off for the year so will move to 'in review' for the first sprint in January. |
@aclum @lamccue @pdpiehowski @SamuelPurvine @scanon @GrantFujimoto @picowatt see https://github.com/microbiomedata/nmdc_automation/blob/metap_agg/src/generate_metap_agg.py I don't think a metaproteomics-based functional annotation aggregation record should have a slot called These aggregations are about Therefore I propose that the boolean slot should be called the description or a comment could be
|
This is all irrelevant if we decide that non-best-evidenced functions should not be stored or exposed in the data portal |
One implication of this class definition not being in the schema is that code that performs validation based upon the schema can't check data representing instances of this class (without "hard-coding" special validation code for this class). Example: Referential integrity validationFor example, code that uses the schema to determine which fields of which Mongo documents might contain the Example: Document validationAs another example, code that validates whether a given Mongo document conforms to the schema would not be able to validate documents in the (hypothetical) The referential integrity validation-related implication came up recently as some team members have been developing referential integrity checkers. |
I think this is going to be implemented after the Berkeley schema gets merged into the upstream schema. |
@eecavanna that is correct. There's a fair amount of churn about this task, including
All of this would more or less obviate the re-id-ing effort work for metaproteomics (not currently performed as of this writing), and also break the aggregation table implementation, which 'just' got fixed to work with current data. Getting the metaproteomics re-id-ing done will run up against the Berkeley schema light freeze in a way that virtually guarantees aggregation of technical debt if we try and "make it happen". |
@aclum and @SamuelPurvine. I've given this some thought, and below is my current proposal. I'd appreciate your input before I try to implement in the schema. The proteomcis team is moving away from using any annotation of "best protein" and instead will be supplying a list of proteins annotated (at a high level of scrutiny) for each See proposal below: |
Makes sense to me. Curious about whether adding values direxctly to mongoDB for Just as an aside, current thinking for replacing I do think this would require the 'WorkflowExecution' to be "aware" of currently extant 'gene_function_id' entries to avoid duplication... Or try and make the import smart enough to ignore attempts at adding duplicates? Sorry for the free thought blathering here... |
I don't think I quite understand this part. What records are you worried about duplicating? The Or are you worried about current mongo records being duplicated when we implement this? If so, @aclum and I have discussed removing all existing proteomics aggregation table records from mongo and regenerating with the chron job after all the schema and scripting for the aggregation tables have been updated. |
I mean specifically that there is a separate storage mechanism for gene_function_id values, is there not? If the metagenome-free workflow comes up with a gene_function_id that is already in that data chunk, we wouldn't want it to be added, only things not yet entered via the metagenome route. Maybe I'm just thinking further ahead than I should be :) |
I see. I think we will have to deal with that (either or possibly both in the metadata generation for the |
Yes, this proposal generalizes the Is anyone thinking to add a description anywhere for the |
This is sleek, i like it. WRT to duplicates I think we'd want to store both in the aggregation table, but then handle display with nmdc-server logic. It's potentially useful to know the venn diagram of shared gene_function_id between reference based and reference free metap workflows. |
@aclum excellent point! might we think of adding a/the source for a given functional annotation to more easily enable this, or just use the workflow "version", however we are going to denote suchlike? I suppose you could build a query that walks up the chain to biosample, walks back down to a/the related metagenome, pulls an example |
I'd assumed the reference free workflow would have a different workflow exeuction subclass name which would allow for easy aggregations. |
FunctionalAnnotationAggMember
to work with MetaP data
We added a nmdc schema Class for the gene aggregation for the sequencing workflows, see https://microbiomedata.github.io/nmdc-schema/FunctionalAnnotationAggMember/, but not for the metaproteomics. I believe we wanted all the collections in mongo to have a slot on class Database, in order to do that we need a Class for the metaproteomics gene function aggregation. The collection in prod mongo that current contains this data is 'metap_gene_function_aggregation'
related to
#1252
Based on that we need class with slots (based on how the collection is currently populated)
metaproteomic_analysis_id
gene_function_id (slot already exists -update domain_of for this slot)
count (slot already exists - update domain_of for this slot)
is_best_protein (need new slot)
all slots should be required
Then Class Database needs a slot 'metap_gene_function_aggregation' with a range of the new metap gene function aggregation Class
@SamuelPurvine @mslarae13 @pdpiehowski @picowatt please confirm requirements, in particular if the slots should be required or not.
cc @turbomam
The text was updated successfully, but these errors were encountered: