Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent schema validation between /v1/workflows/activities and /metadata/json:validate API endpoints #462

Closed
aclum opened this issue Jan 31, 2024 · 5 comments · Fixed by #497
Assignees
Labels
bug Something isn't working priority

Comments

@aclum
Copy link
Contributor

aclum commented Jan 31, 2024

Does a data object record with url value of null submit w/o errors? This doesn't validate with json:submit. This record ended up in mongo, we believe from the workflows v1 post endpoint so now we need to double check. If this endpoint doesn't validate against the schema this needs to be fixed.

{
    "_id": {"$oid": "65affcdc07a1abea27b3fa0a"},
    "alternative_identifiers": [],
    "compression_type": null,
    "data_object_type": "Metagenome Raw Reads",
    "description": "Raw sequencer read data",
    "file_size_bytes": 2861414297,
    "id": "nmdc:dobj-11-77e9jm46",
    "md5_checksum": null,
    "name": "9422.8.132674.GTTTCG.fastq.gz",
    "type": "nmdc:DataObject",
    "url": null,
    "was_generated_by": null
}
@aclum
Copy link
Contributor Author

aclum commented Feb 12, 2024

I was able to confirm that a record that does not validate with json:validate can be entered into mongo with v1/workflows/activities. Desired behavior is to have a consistent validation across POST endpoints.

When I try this record with json:validate I get a few errors, with v1/workflows/activities I am able to submit the record testing with the dev version of runtime.

cc @Michal-Babins @mbthornton-lbl for coordination. Related ticket to fix things on the record generation side is microbiomedata/nmdc_automation#24.
cc @turbomam @eecavanna @pkalita-lbl b/c I expect this to cause problems with the next migration b/c the data in mongo prod is not currently schema compliant.

I'd like to get the addressed this sprint if possible to prevent more invalid data from getting into mongo. @ssarrafan @shreddd

Test record

{
    "metagenome_annotation_activity_set": [
        {
            "id": "nmdc:wfmgan-11-mmt28267.2",
            "name": "Metagenome Annotation Analysis Activity for nmdc:wfmgan-11-mmt28267.2",
            "started_at_time": "2024-02-07T22:56:28.922223+00:00",
            "ended_at_time": "2024-02-08T06:03:41.210740+00:00",
            "was_informed_by": "nmdc:omprc-11-9mvz7z22",
            "used": null,
            "execution_resource": "NERSC-Perlmutter",
            "git_url": "https://github.com/microbiomedata/mg_annotation",
            "has_input": [
                "nmdc:dobj-11-5eb6v689"
            ],
            "type": "nmdc:MetagenomeAnnotationActivity",
            "has_output": [
                "nmdc:dobj-11-xg79v192",
                "nmdc:dobj-11-ztbqrz10",
                "nmdc:dobj-11-v8ktb336",
                "nmdc:dobj-11-7bwq0v72",
                "nmdc:dobj-11-9fthdp31",
                "nmdc:dobj-11-xpfkx256",
                "nmdc:dobj-11-47q33b43",
                "nmdc:dobj-11-7k704568",
                "nmdc:dobj-11-5qfq9q36",
                "nmdc:dobj-11-5sfp4s54",
                "nmdc:dobj-11-fge4rm69",
                "nmdc:dobj-11-a2fwy597",
                "nmdc:dobj-11-0qywmr14",
                "nmdc:dobj-11-aqf88r67",
                "nmdc:dobj-11-42bfmh62",
                "nmdc:dobj-11-jkfqv467",
                "nmdc:dobj-11-te9q8e26",
                "nmdc:dobj-11-r403jz94",
                "nmdc:dobj-11-f26zbz28",
                "nmdc:dobj-11-qg6mr936",
                "nmdc:dobj-11-1mek9m87",
                "nmdc:dobj-11-wycjgd54",
                "nmdc:dobj-11-x8r3q961"
            ],
            "part_of": [
                "nmdc:omprc-11-9mvz7z22"
            ],
            "version": "v1.0.4",
            "qc_status": null,
            "qc_comment": null,
            "has_failure_categorization": [],
            "gold_analysis_project_identifiers": []
        }
    ]
}

runtime dev json:validate response body

{
  "result": "errors",
  "detail": {
    "metagenome_annotation_activity_set": [
      "None is not of type 'string'",
      "None is not one of ['pass', 'fail']",
      "None is not of type 'string'",
      "None is not of type 'string'"
    ]
  }
}

runtime dev v1/workflows/activities response body

{
  "message": "jobs accepted"
}

@aclum aclum changed the title check if workflow_execution_activities/post_activity_v1_workflows_activities_post endpoint does schema validation Inconsistent schema validation between workflow_execution_activities/post_activity_v1_workflows_activities_post endpoint and metadata/Validate_JSON_metadata_json_validate_post Feb 12, 2024
@PeopleMakeCulture
Copy link
Collaborator

Duplicates: #478

Solution:

  • deprecate v1 endpoint
  • use json:validate for workflows/activities

@aclum
Copy link
Contributor Author

aclum commented Mar 14, 2024

The changes that were made to the endpoints make it so you can't submit data_object_set records, only workflow execution subclasses. The typical submission is a mix of data_object_set records and workflow subclass records. This needs to be hotfixed as it blocks workflows from using these endpoints.
@scanon suggested fix ( if collection_name not in activity_collection_names(mdb) and collection_name != "data_object_set":) on line 83/84

Example record that is expected to pass

curl -X 'POST' \
  'https://api-dev.microbiomedata.org/workflows/activities' \
  -H 'accept: application/json' \
  -H 'Authorization: Bearer $TOKEN' \
  -H 'Content-Type: application/json' \
  -d '{"mags_activity_set": [
    {
      "id": "nmdc:wfmag-11-zcwca422.5",
      "name": "TEST nmdc:wfmag-11-zcwca422.5",
      "started_at_time": "2024-03-13T21:45:28.521604+00:00",
      "ended_at_time": "2024-03-13T23:56:16.431104+00:00",
      "was_informed_by": "nmdc:omprc-11-9mvz7z22",
      "execution_resource": "NERSC-Perlmutter",
      "git_url": "https://github.com/microbiomedata/metaMAGs",
      "has_input": [
        "nmdc:dobj-11-5eb6v689",
        "nmdc:dobj-11-7k80qv75",
        "nmdc:dobj-11-y563v150",
        "nmdc:dobj-11-72e7f129",
        "nmdc:dobj-11-vn9pwz37",
        "nmdc:dobj-11-9x2zaf16",
        "nmdc:dobj-11-9kpz9641",
        "nmdc:dobj-11-9prnyr33",
        "nmdc:dobj-11-feses595",
        "nmdc:dobj-11-0z5rhk53",
        "nmdc:dobj-11-sb28nx57",
        "nmdc:dobj-11-vrcm1x60",
        "nmdc:dobj-11-xx1tb938",
        "nmdc:dobj-11-y3f47w18",
        "nmdc:dobj-11-k62fk420"
      ],
      "type": "nmdc:MagsAnalysisActivity",
      "has_output": [
        "nmdc:dobj-12-gvntvq90"],
      "part_of": [
        "nmdc:omprc-11-9mvz7z22"
      ],
      "version": "v1.0.8"
    }
  ],

  "data_object_set": [
    {
      "id": "nmdc:dobj-12-gvntvq90",
      "name": "nmdc_wfmag-11-zcwca422.5_checkm_qa.out",
      "description": "CheckM for nmdc:wfmag-11-zcwca422.5",
      "file_size_bytes": 14027,
      "md5_checksum": "dcaa3973977aac97c74eb1610ffdba45",
      "data_object_type": "CheckM Statistics",
      "type": "nmdc:DataObject"
    }]
}'

current error on runtime-dev

{
  "detail": "keys must be nmdc-schema activity collection names`"
}

@eecavanna
Copy link
Collaborator

eecavanna commented Mar 15, 2024

Here's a link to a Slack conversation about this issue:

I will rename the issue so its name shows the endpoints as URL paths. For reference, here are the Swagger UI links the original issue name was based upon:

@eecavanna eecavanna changed the title Inconsistent schema validation between workflow_execution_activities/post_activity_v1_workflows_activities_post endpoint and metadata/Validate_JSON_metadata_json_validate_post Inconsistent schema validation between /v1/workflows/activities and /metadata/json:validate API endpoints Mar 15, 2024
@eecavanna
Copy link
Collaborator

eecavanna commented Mar 15, 2024

Here's the validation code used by the /v1/workflows/activities endpoint (the embedded code block is scrollable):

# TODO: Create activity.py in ../models
@router.post("/workflows/activities")
async def post_activity(
activity_set: dict[str, Any],
site: Site = Depends(get_current_client_site),
mdb: MongoDatabase = Depends(get_mongo_db),
):
"""
Please migrate all workflows from `v1/workflows/activities` to this endpoint.
-------
Post activity set to database and claim job.
Parameters
-------
activity_set: dict[str,Any]
Set of activities for specific workflows.
Returns
-------
dict[str,str]
"""
_ = site # must be authenticated
try:
# verify activities in activity_set are nmdc-schema compliant
for collection_name in activity_set:
if collection_name not in activity_collection_names(mdb):
raise ValueError("keys must be nmdc-schema activity collection names`")
# validate request JSON
rv = validate_json(activity_set, mdb)
if rv["result"] == "errors":
raise HTTPException(
status_code=status.HTTP_422_UNPROCESSABLE_ENTITY,
detail=str(rv),
)

Here's the validation code used by the /metadata/json:validate endpoint:

@router.post("/metadata/json:validate", name="Validate JSON")
async def validate_json_nmdcdb(docs: dict, mdb: MongoDatabase = Depends(get_mongo_db)):
"""
Validate a NMDC JSON Schema "nmdc:Database" object.
"""
return validate_json(docs, mdb)

Finally, here's the validate_json utility function they both call (the embedded code block is scrollable):

def validate_json(in_docs: dict, mdb: MongoDatabase):
validator = Draft7Validator(get_nmdc_jsonschema_dict())
docs = deepcopy(in_docs)
validation_errors = {}
known_coll_names = set(nmdc_database_collection_names())
for coll_name, coll_docs in docs.items():
if coll_name not in known_coll_names:
if coll_name == "@type" and coll_docs in ("Database", "nmdc:Database"):
continue
else:
validation_errors[coll_name] = [
f"'{coll_name}' is not a known schema collection name"
]
continue
errors = list(validator.iter_errors({coll_name: coll_docs}))
validation_errors[coll_name] = [e.message for e in errors]
if coll_docs:
if not isinstance(coll_docs, list):
validation_errors[coll_name].append("value must be a list")
elif not all(isinstance(d, dict) for d in coll_docs):
validation_errors[coll_name].append(
"all elements of list must be dicts"
)
if not validation_errors[coll_name]:
try:
with OverlayDB(mdb) as odb:
odb.replace_or_insert_many(coll_name, coll_docs)
except OverlayDBError as e:
validation_errors[coll_name].append(str(e))
if all(len(v) == 0 for v in validation_errors.values()):
# Second pass. Try instantiating linkml-sourced dataclass
in_docs.pop("@type", None)
try:
NMDCDatabase(**in_docs)
except Exception as e:
return {"result": "errors", "detail": str(e)}
return {"result": "All Okay!"}
else:
return {"result": "errors", "detail": validation_errors}

Of the two endpoints listed above, only /v1/workflows/activities does the following additional check (which it does before calling the validate_json utility function):

         # verify activities in activity_set are nmdc-schema compliant 
         for collection_name in activity_set: 
             if collection_name not in activity_collection_names(mdb): 
                 raise ValueError("keys must be nmdc-schema activity collection names`") 

I don't know why the additional check is necessary or why the validate_json utility function doesn't already do it. I don't really know how the /v1/workflows/activities endpoint is used in the real world.

dwinston added a commit that referenced this issue Mar 15, 2024
…ivities

Documents other than activities may be generated and submittable at the same time.

closes #462
dwinston added a commit that referenced this issue Mar 15, 2024
…ivities (#497)

Documents other than activities may be generated and submittable at the same time.

closes #462
dwinston added a commit that referenced this issue Apr 19, 2024
…ivities (#497)

Documents other than activities may be generated and submittable at the same time.

closes #462
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority
4 participants