Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update aggregation scripts to use API to submit instead of pymongo #11

Open
aclum opened this issue Aug 5, 2024 · 8 comments
Open

update aggregation scripts to use API to submit instead of pymongo #11

aclum opened this issue Aug 5, 2024 · 8 comments

Comments

@aclum
Copy link
Collaborator

aclum commented Aug 5, 2024

Justification: In order to migrate runtime to the cloud for increased stability we need to transition code that interacts with mongo directly to API queries.

blocked by:
microbiomedata/nmdc-runtime#611 - resolved, we can use json:submit now to enter these records.

Acceptance critera:
both generate_functional_agg.py and generate_metap_agg.py generate a request body which is submitted to a runtime API endpoint instead of using pymongo insert statements.

cc @sanjaypjana @eecavanna @shreddd @mbthornton-lbl

Subtasks:

@eecavanna
Copy link
Collaborator

eecavanna commented Aug 7, 2024

Thanks for summarizing the situation and laying out the acceptance criteria.

I took a look at this today. Here are my English translations of all the database queries performed within generate_functional_agg.py, specifically.

Query 1

"Get all the distinct metagenome_annotation_id values among all documents in the functional_annotation_agg collection."

done = self.agg_col.distinct("metagenome_annotation_id")

Query 2

"For each document in the metagenome_annotation_activity_set collection..."

for actrec in self.act_col.find({}):

Query 3

"Insert these documents into the data_object_set collection."

self.agg_col.insert_many(rows)

Query 4

"Get the document having this id value, from the data_object_set collection."

do = self.do_col.find_one({"id": doid})

Finally, here the aliases that appear in the list of queries above.

self.agg_col = self.db.functional_annotation_agg
self.act_col = self.db.metagenome_annotation_activity_set
self.do_col = self.db.data_object_set

@eecavanna
Copy link
Collaborator

Similarly, here are my English translations of all the database queries performed within generate_metap_agg.py. They mirror the ones in the other file (i.e. same operations, different operands).

Query 1

"Get all the distinct metaproteomic_analysis_id values among all documents in the metap_gene_function_aggregation collection."

done = self.agg_col.distinct("metaproteomic_analysis_id")

Query 2

"For each document in the metaproteomics_analysis_activity_set collection..."

for actrec in self.act_col.find({}):

Query 3

"Insert these documents into the metap_gene_function_aggregation collection."

self.agg_col.insert_many(rows)

Query 4

"Get the document having this id value, from the data_object_set collection."

do = self.do_col.find_one({"id": doid})

Finally, here the aliases that appear in the list of queries above.

self.agg_col = self.db.metap_gene_function_aggregation
self.act_col = self.db.metaproteomics_analysis_activity_set
self.do_col = self.db.data_object_set

@eecavanna
Copy link
Collaborator

At this point, I'm wondering whether the Runtime API already provides the endpoints necessary for performing those operations. If it does, I think this is ready for implementation.

@aclum
Copy link
Collaborator Author

aclum commented Aug 7, 2024

query 4 inserts into the aggregation tables (functional_annotation_agg and metap_gene_function_aggregation) not data_object_set.

the blocked ticket linked in the description, microbiomedata/nmdc-runtime#611 prevents us from using json:submit to add documents via the API. It is possible we could use queries:run, I haven't tested that, but it would be nice to use an endpoint which had more validation. Additionally metap_gene_function_aggregation is not defined in the schema so i believe this disallows using any existing endpoints at this time.

@eecavanna
Copy link
Collaborator

eecavanna commented Aug 8, 2024

query 4 inserts into the aggregation tables (functional_annotation_agg and metap_gene_function_aggregation) not data_object_set.

I think you are referring to the query I referred to as "Query 3." In both files, the query I referred to as "Query 4" is a find_one and not an insertion.

image

The numbering I used was arbitrary (my objective was to catalog the queries, not so much to convey the algorithm) and might not match the order in which the queries are performed.

@eecavanna
Copy link
Collaborator

eecavanna commented Aug 8, 2024

I'll add a topic to the agenda for tomorrow's infrastructure meeting, about addressing the things (in the Runtime) that are—or may be—blocking this.

@ssarrafan
Copy link

@aclum @eecavanna who is this issue assigned to? Who's working on this?

@aclum
Copy link
Collaborator Author

aclum commented Dec 3, 2024

@kheal addressed generate_metap_agg.py in #26 recently. @mbthornton-lbl could work on generate_functional_agg.py but this hasn't been prioritized yet. This is not currently a blocker but would have to get addressed before moving our mongo instance off of SPIN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

3 participants