Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formal specification of Circe cohort definitions #156

Open
ablack3 opened this issue Feb 11, 2021 · 13 comments
Open

Formal specification of Circe cohort definitions #156

ablack3 opened this issue Feb 11, 2021 · 13 comments

Comments

@ablack3
Copy link

ablack3 commented Feb 11, 2021

Is there a document that formally specifies the json representation of circe cohorts and possibly a function that takes in json and returns true or false depending on if the json conforms to the circe specification?

@chrisknoll
Copy link
Collaborator

Not at the moment, but this could be provided by generating a JavaDoc from this class: https://github.com/OHDSI/circe-be/blob/master/src/main/java/org/ohdsi/circe/cohortdefinition/CohortExpression.java. This class definition should be considered the 'formal specification'. We are considering describing the Circe CohortExpression object with a json schema, such that you could use something like this R package that would allow you to provide a JSON document and check for validation issues related to a schema specification.

To do what you want (is a json object 'valid' for a cohort expression) you can try...catch on the CohortExpression.fromJson()to catch if a json is passable into a java object.

We'll leave this issue open as a ask to convert the cohort expression object model into a json schema specification.

@chrisknoll
Copy link
Collaborator

chrisknoll commented Feb 12, 2021

@ablack3 , would something like the attached file work? I'm not happy with some of the generated names, but it will validate a json file, I think.

CohortExpression.txt

@ablack3
Copy link
Author

ablack3 commented Feb 12, 2021

Yes I think so. I was mostly interested for improving my understanding of what was allowed in circe cohort definitions but was also thinking of writing an R function to validate cohort JSON which might be useful for Capr (assuming it does not already exist). Thanks!

@chrisknoll
Copy link
Collaborator

So, with this schema, you can use the R package jsonvalidate to see if the json you generate conforms with the schema I provided. Want to give it a shot?

@ablack3
Copy link
Author

ablack3 commented Feb 12, 2021

Oh great. Yes I'll give it a try with the phenotype library cohorts.

@ablack3
Copy link
Author

ablack3 commented Feb 12, 2021

I think it works! All 797 phenotype library cohorts pass the test.

# validate cohort json files
library(magrittr)

schemaPath <- "C:/Users/Adam Black/Google Drive/OHDSI_R_packages/PhenotypeLibrary/work/CohortExpression.json"
validateCohortJson <- jsonvalidate::json_validator(schemaPath)

pkgDir <- system.file(package = "PhenotypeLibrary")
phenotypeIds <- list.files(pkgDir) %>%
  stringr::str_subset("\\d+")

jsonFiles <- purrr::map(phenotypeIds, ~paste0(glue::glue("{pkgDir}/{.}/"), list.files(glue::glue("{pkgDir}/{.}")))) %>%
  unlist() %>%
  stringr::str_subset("\\.json")

validationResult <- purrr::map_lgl(jsonFiles, validateCohortJson)

mean(validationResult)
#> [1] 1

Created on 2021-02-12 by the reprex package (v0.3.0)

@chrisknoll
Copy link
Collaborator

you should also check that it identifies an invalid json. Make a copy of one that uses ConditionOccurrence , and change the expression:

{
  ConditionOccurrence: {...}
}

to:

{
  ConditionOccurrenceXXX: {...}
}

That should fail validation, I think.

@ablack3
Copy link
Author

ablack3 commented Feb 12, 2021

That passes unfortunately :/

@chrisknoll
Copy link
Collaborator

Hnmm, so...not sure what to think here. On one hand that could just look like another field that isn't invalid, just unused. I am not sure if it should be strick that every field has to be defined, but we make it so that older defs work fine in new versions (the older defs won't have the newer fields defined)...so...I need to think about what we need to do to validate...

Another test is: find a conceptId and set the json from a number to a string (like codesetId: 1 -> codesetId: 'invalid'.

@chrisknoll
Copy link
Collaborator

Hi,
I found tat I generated a schema in a certain JSON Schema version ('2019-09') but there are actually serveral schema versions (04, 06 and 07). I have generated a schema in 6 draft and 7 draft, and atttached here, maybe it will work different with the R library?

CohortExpression_06.txt
CohortExpression_07.txt

@ablack3
Copy link
Author

ablack3 commented Feb 15, 2021

I did a few small checks that I think should all be considered invalid. All three of the schema versions gave identical results.

  1. "CodesetId" : "blah" => Valid
  2. "CONCEPT_ID" : -24134 => Valid
  3. "CONCEPT_ID" : "blah" => Invalid
  4. "ConditionTypeExclude" : "blah" => Valid
  5. "isExcluded" : "blah" => Invalid
  6. "QualifiedLimit" : {"Type" : "blah"} => Valid

@chrisknoll
Copy link
Collaborator

I'll run some examples too, and maybe open an issue on the rjson validator repo with some simple test cases to demonstrate where we think it should result in an invalid eval.

@clarkevans
Copy link

Could the official version of this schema into the repository so that it could be referenced (and bug reports made?). Also, I noticed that some of the fields in the JSON out there are missing, while the "type" is not marked as being optional. For example, "Title" and "cdmVersionRange" is missing in many JSONs in the wild.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants