Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dagster apply_metadata_in op fails to insert functional_annotation_agg documents #611

Closed
aclum opened this issue Jul 24, 2024 · 5 comments · Fixed by #620
Closed

Dagster apply_metadata_in op fails to insert functional_annotation_agg documents #611

aclum opened this issue Jul 24, 2024 · 5 comments · Fixed by #620
Assignees
Labels
bug Something isn't working

Comments

@aclum
Copy link
Contributor

aclum commented Jul 24, 2024

Describe the bug
I tried to submit test data to runtime dev for the functional_annotation_agg, the data validates with json:validate and returns all okay with json:submit but the dagster job fails. Based on the issue see https://dagit-dev.microbiomedata.org/runs/76495c80-9076-4da4-94e5-4baf162a1d6d

KeyError: 'functional_annotation_agg'
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/utils.py", line 54, in op_execution_error_boundary
    yield
  File "/usr/local/lib/python3.10/site-packages/dagster/_utils/__init__.py", line 468, in iterate_with_context
    next_output = next(iterator)
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 141, in _coerce_op_compute_fn_to_iterator
    result = invoke_compute_fn(
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/plan/compute_generator.py", line 129, in invoke_compute_fn
    return fn(context, **args_to_pass) if context_arg_provided else fn(**args_to_pass)
  File "/opt/dagster/lib/nmdc_runtime/site/ops.py", line 530, in perform_mongo_updates
    if all(coll_has_id_index[coll] for coll in docs.keys()):
  File "/opt/dagster/lib/nmdc_runtime/site/ops.py", line 530, in <genexpr>
    if all(coll_has_id_index[coll] for coll in docs.keys()): 

To Reproduce
Steps to reproduce the behavior:

  1. go to runtime dev & login with personal credentials
  2. Submit request (see snippet below)
curl -X 'POST' \
  'https://api-dev.microbiomedata.org/metadata/json:submit' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer $TOKEN' \
  -H 'Content-Type: application/json' \
  -d '{"functional_annotation_agg":[
{"metagenome_annotation_id":"nmdc:wfmtan-13-hemh0a82.1",
"gene_function_id": "KEGG.ORTHOLOGY:K00005",
"count":10},
{"metagenome_annotation_id":"nmdc:wfmtan-13-hemh0a82.1",
"gene_function_id": "KEGG.ORTHOLOGY:K01426",
"count":5}
]}'
  1. go to https://dagit-dev.microbiomedata.org/locations/repo@nmdc_runtime.site.repository%3Arepo/jobs/apply_metadata_in to check on status

Expected behavior
This request should return records

curl -X 'GET' \
  'https://api-dev.microbiomedata.org/nmdcschema/functional_annotation_agg?filter=%7B%22metagenome_annotation_id%22%3A%22nmdc%3Awfmtan-13-hemh0a82.1%22%7D&max_page_size=20' \
  -H 'accept: application/json'

Example user story template:
AS A {user|persona|system},
[INSTEAD OF {current condition}]
I EXPECTED {result}
[SO THAT {value or justification}]
[NO LATER THAN {best by date}]

Screenshots
If applicable, add screenshots to help explain your problem.

Acceptance Criteria

  • Clear to everyone involved
  • Can be tested or verified
  • Either passes or fails (cannot be 50% completed, for example)
  • Focus on the outcome, not how the outcome is achieved
  • As specific as possible (fast page load speed vs. 3-second page load speed)
  • if multiple criteria, present as a bulleted list of short scenarios (see template below)

Example scenario-based template:
Given (some given context or precondition), when (I take this action), then (this will be the specific outcome).

Additional context
I think this is b/c those documents don't have id, normally these records are added directly to pymongo by code from nmdc-aggregator

@aclum aclum added the bug Something isn't working label Jul 24, 2024
@eecavanna
Copy link
Collaborator

eecavanna commented Jul 24, 2024

@aclum and I met via Zoom a few minutes ago. She explained the situation for me and showed me where I can find the error message in the Dagster UI (Dagit).

Since this is the development environment—not the production environment—in order to unblock what she is working on (which is to enable someone else downstream), I was in favor of adding the documents to the database directly (via a Mongo account that has write access to that collection). I want to emphasize that this was in the development environment.

So, I did that during the Zoom call; i.e. I added the two documents to the database. As a result, @aclum's work is not blocked by this issue.

I will continue looking into this issue.

@eecavanna eecavanna changed the title issue submitting functional_annotation_agg with json:submit endpoint on runtime dev Dagster apply_metadata_in op fails to insert functional_annotation_agg documents Jul 24, 2024
@eecavanna
Copy link
Collaborator

Here's the final couple of lines from the stack trace:

  File "/opt/dagster/lib/nmdc_runtime/site/ops.py", line 530, in <genexpr>
    if all(coll_has_id_index[coll] for coll in docs.keys()):

Here are lines 529-531 of the source code the error message is referring to:

    coll_has_id_index = collection_indexed_on_id(mongo.db)
    if all(coll_has_id_index[coll] for coll in docs.keys()):
        replace = True

Source: nmdc-runtime/nmdc_runtime/site/ops.py

Note that the first line of that source code calls a function named collection_indexed_on_id (I don't know why the name is a noun instead of a verb). Based on its name, I assume it returns collections, or names of collections, that have an index on their id field.

The collection @aclum was trying to insert data into is functional_annotation_agg. Documents in that collection do not have an id field. So, I doubt the collection has an index on its id field.

@eecavanna
Copy link
Collaborator

Since I'll be OOO for the next few days (back next Wednesday), I'll hand this issue off to @dwinston and @PeopleMakeCulture.

@eecavanna
Copy link
Collaborator

eecavanna commented Jul 26, 2024

Looks like the collection_indexed_on_id function only accounts for collections whose names end in _set (that explains to me why things involving functional_annotations_agg might not work). I don't know the implications of changing that behavior, on the rest of the code base.

@aclum
Copy link
Contributor Author

aclum commented Aug 5, 2024

@PeopleMakeCulture any update on this?

dwinston added a commit that referenced this issue Aug 7, 2024
interpret as simple insertion. leave note in code about decision to insist on schema-supplied uniqueness signal.

fix #611
dwinston added a commit that referenced this issue Aug 8, 2024
* fix: allow "update" of non-`id`-having document collections

interpret as simple insertion. leave note in code about decision to insist on schema-supplied uniqueness signal.

fix #611

* refactor to add test

* fix: rm abandoned candidate test

* Update nmdc_runtime/site/ops.py

Co-authored-by: eecavanna <[email protected]>

---------

Co-authored-by: eecavanna <[email protected]>
dwinston added a commit that referenced this issue Aug 8, 2024
* fix: allow "update" of non-`id`-having document collections

interpret as simple insertion. leave note in code about decision to insist on schema-supplied uniqueness signal.

fix #611

* refactor to add test

* fix: rm abandoned candidate test

* Update nmdc_runtime/site/ops.py

Co-authored-by: eecavanna <[email protected]>

---------

Co-authored-by: eecavanna <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants