Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor metaproteomics aggregation script #27

Merged
merged 16 commits into from
Dec 3, 2024
Merged

Conversation

kheal
Copy link
Contributor

@kheal kheal commented Nov 7, 2024

This PR will refactor the generate_metap_agg.py script to address #26.

Overall, the generate_metap_agg.py has been refactored to

  1. Use the runtime API to reduce reliance with direct connection to MongoDB and take advantage of validation
  2. Access the raw data objects (not mongo data) to generate the functional annotations for aggregation for metaproteomics results (see Add ADR for metaP mongo data issues#920)
  3. Validate and submit data via API

Will not be ready for release until microbiomedata/nmdc-schema#2203 has been merged in (done).

@kheal kheal changed the title Start to refactor metaproteomics aggregation class Refactor metaproteomics aggregation script Nov 7, 2024
@kheal kheal requested a review from aclum November 7, 2024 22:04
@kheal
Copy link
Contributor Author

kheal commented Nov 7, 2024

@aclum - I'd appreciate any early feedback you have on the general approach, feel free to tag anyone else who should have eyes on this.

I'm not sure how exactly we'll be able to set the API bearer tokens as environmental variables, but that's how I've been doing development (I altered the readme to describe what environmental variables we'll need).

@kheal kheal linked an issue Nov 7, 2024 that may be closed by this pull request
@aclum

This comment was marked as resolved.

@aclum

This comment was marked as resolved.

@aclum

This comment was marked as resolved.

@kheal
Copy link
Contributor Author

kheal commented Nov 9, 2024

@aclum. Thanks for the input. I've rewritten to include a call to get a bearer token with the API username and password (set as environmental variables).

I've also pulled out the functions we can reuse for the classes into an abstract Aggregator class. That should make the work for translating the other aggregator to use the API more straightforward. So the other MetaGenomeFuncAgg really just has to write its own process_activity method and declare its filters to find appropriate records.

I'll leave this in draft until the next release since it depends on the migrated database and next schema release.

generate_metap_agg.py Outdated Show resolved Hide resolved
generate_metap_agg.py Outdated Show resolved Hide resolved
@kheal kheal requested a review from picowatt November 19, 2024 17:46
@kheal
Copy link
Contributor Author

kheal commented Nov 19, 2024

@picowatt I'll let you know when this is ready for review - I'm going to incorporate Alicia's comments and test this after new release first.

@kheal

This comment was marked as outdated.

@kheal kheal marked this pull request as ready for review November 20, 2024 02:43
@kheal
Copy link
Contributor Author

kheal commented Nov 25, 2024

With my updated permissions (thanks @eecavanna), I checked that the json.submit endpoint is working as expected.

I loaded a single metaP's annotations to dev mongo's functional_annotation_agg set.
nmdc:wfmp-11-x0zhd078.1 is the ID of the workflow record that is now included in functional_annotation_agg set in dev mongo.

@aclum - is there a server/data portal issue to make sure the functional searches/ingests are expecting MetaProteomics records in the functional_annotation_agg?

@aclum aclum requested review from dwinston and eecavanna November 25, 2024 19:18
@aclum
Copy link
Collaborator

aclum commented Nov 25, 2024

No, there is not a corresponding nmdc-server ticket yet, would you please make one @kheal ?

@eecavanna @dwinston what is the max payload json:submit can handle? @kheal what is the max length for expected aggregation results?

@aclum aclum requested a review from shreddd November 25, 2024 19:19
@kheal
Copy link
Contributor Author

kheal commented Nov 25, 2024

I wrote the script so that the json:submit only submits one workflow's aggregation results at at time to avoid payload issues - though I haven't testing the full lot yet. I can run the whole script locally to write into dev mongo overnight as a test.

@kheal
Copy link
Contributor Author

kheal commented Nov 25, 2024

No, there is not a corresponding nmdc-server ticket yet, would you please make one @kheal ?

Associated ticket filed here: microbiomedata/nmdc-server#1468

@kheal
Copy link
Contributor Author

kheal commented Nov 26, 2024

Moving this back into draft. Testing revealed the first API call to be exceptionally slow, attempting to fix now.

@kheal kheal marked this pull request as draft November 26, 2024 20:38
generate_metap_agg.py Outdated Show resolved Hide resolved
@kheal
Copy link
Contributor Author

kheal commented Nov 26, 2024

Moving this back into draft. Testing revealed the first API call to be exceptionally slow, attempting to fix now.

I've implemented a partial fix for this that will likely not work for future subclasses of the Aggregator class. I will submit a follow up issue to fix this once the issue on nmdc-runtime is addressed. Issue filed here: #28

@kheal what is the max length for expected aggregation results?

I tested the full run of the MetaP aggregator sweep method and did not encounter any payload issues with json:submit.

@kheal kheal marked this pull request as ready for review November 26, 2024 22:44
Copy link
Collaborator

@shreddd shreddd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks reasonable (my comments are minor and can be ignored if they don't apply).

Also - Is there a way to test this code to make sure it works?

generate_metap_agg.py Show resolved Hide resolved
generate_metap_agg.py Show resolved Hide resolved
@kheal
Copy link
Contributor Author

kheal commented Nov 27, 2024

I think this looks reasonable (my comments are minor and can be ignored if they don't apply).

Also - Is there a way to test this code to make sure it works?

@shreddd I'll add your suggested logging options tomorrow, thanks for that helpful feedback. I've tested the script locally and it successfully loaded the records into the dev mongo. When I reran it, the script did nothing (as expected).

kheal and others added 4 commits November 27, 2024 09:41
From <https://www.mongodb.com/docs/v6.0/reference/operator/query/regex/#index-use>:

> Further optimization can occur if the regular expression is a "prefix expression", which means that all potential matches start with the same string. This allows MongoDB to construct a "range" from that prefix and only match against those values from the index that fall within that range.

> A regular expression is a "prefix expression" if it starts with a caret (^) or a left anchor (\A), followed by a string of simple symbols. For example, the regex /^abc.*/ will be optimized by matching only against the values from the index that start with abc.
generate_metap_agg.py Show resolved Hide resolved
@aclum aclum merged commit b466213 into main Dec 3, 2024
@aclum aclum deleted the refactor_metap_agg branch December 3, 2024 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor metaP aggregation
5 participants