Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify which types of schema modifications are currently performed by yq #230

Open
pkalita-lbl opened this issue Oct 24, 2024 · 5 comments · May be fixed by #247
Open

Identify which types of schema modifications are currently performed by yq #230

pkalita-lbl opened this issue Oct 24, 2024 · 5 comments · May be fixed by #247
Labels

Comments

@pkalita-lbl
Copy link
Collaborator

This task is to catalog the uses of yq in the submission schema build process. Once we have that list we should decide on how we can codify those into new sheets_and_friends functionality. The output of this task should be issues in the sheets_and_friends repo detailing the new features to add. The eventual goal is to have all schema modifications managed by sheets_and_friends and not have any dependence on yq.

See also:

@turbomam
Copy link
Member

src/nmdc_submission_schema/schema/nmdc_submission_schema.yaml is a good example of a schema that needs modifications

@turbomam
Copy link
Member

grep yq project.Makefile | grep -v _yq | sort | sed 's/^[[:space:]]*//'
# use yq for global modifications
# use yq to add examples when the examples themselves include the packed value separator |
# use yq to add patterns with a secondary condition like mutivalued
# # using | cat > because yq fails to write to STDOUT (permissions error?!)
yq eval-all \
yq eval-all \
#	yq eval-all \
#	yq -i '(.classes.[].slot_usage.[] | select(has("range") | not  ) | .range ) = "string"' $@
yq -i '(.classes.[].slot_usage.[] | select(.name=="chem_administration") | .examples) = [{"value": "agar [CHEBI:2509];2018-05-11|agar [CHEBI:2509];2018-05-22"}, {"value": "agar [CHEBI:2509];2018-05"}]' $@
#	yq -i '(.classes.[].slot_usage.[] | select(.name == "dna_dnase") | .range) = "boolean"' $@
yq -i '(.classes.[].slot_usage.[] | select(.name == "dna_dnase") | .range) = "YesNoEnum"' $@
yq -i '(.classes.[].slot_usage.[] | select(.name == "dnase_rna") | .range) = "YesNoEnum"' $@
yq -i '(.classes.[].slot_usage.[] | select(.name == "oxy_stat_samp") | .range) = "OxyStatSampEnum"' $@
yq -i '(.classes.[].slot_usage.[] | select(.range == "GeolocationValue")  | .pattern) = "^[-+]?([1-8]?\d(\.\d{1,8})?|90(\.0{1,8})?)\s[-+]?(180(\.0{1,8})?|((1[0-7]\d)|([1-9]?\d))(\.\d{1,8})?)$$"' $@
yq -i '(.classes.[].slot_usage.[] | select(.range == "GeolocationValue")  | .range) = "string"' $@
yq -i '(.classes.[].slot_usage.[] | select(.range == "QuantityValue" and .multivalued == true)  | .pattern) = "^([-+]?[0-9]*\.?[0-9]+ +\S.*\|)*([-+]?[0-9]*\.?[0-9]+ +\S.*)$$"' $@
yq -i '(.classes.[].slot_usage.[] | select(.range == "QuantityValue") | .pattern) = "^[-+]?[0-9]*\.?[0-9]+ +\S.*$$"' $@
yq -i '(.classes.[].slot_usage.[] | select(.range=="string") | .multivalued) = false' $@
#yq -i '(.classes.[].slot_usage.[] | select(.string_serialization=="{termLabel} {[termID]}") | .range) = "string"' $@
yq -i '(.classes.[].slot_usage.[] | select(.string_serialization=="{text};{float} {unit}" and .multivalued == true ) | .pattern) = "^([^;\t\r\x0A]+;[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)? [^;\t\r\x0A]+\|)*([^;\t\r\x0A]+;[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)? [^;\t\r\x0A]+)$$"' $@
yq -i '(.classes.[].slot_usage.[] | select(.string_serialization=="{text};{float} {unit}") | .pattern) = "^[^;\t\r\x0A\|]+;[-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)? [^;\t\r\x0A\|]+$$"' $@
yq -i 'del(.classes.Activity)'  $@
yq -i 'del(.classes.Agent)'  $@
yq -i 'del(.classes.AttributeValue)'  $@
yq -i 'del(.classes.ControlledIdentifiedTermValue)'  $@
yq -i 'del(.classes.ControlledTermValue)'  $@
yq -i 'del(.classes.GeolocationValue)'  $@
yq -i 'del(.classes.JgiMgInterface.rules.[] | select(.title == "rna*"))' [email protected]
yq -i 'del(.classes.JgiMtInterface.rules.[] | select(.title == "dna*"))' [email protected]
yq -i 'del(.classes.OntologyClass)'  $@
yq -i 'del(.classes.QuantityValue)'  $@
yq -i 'del(.classes.[].slot_usage.[] | select(.multivalued == "false").inlined)' $@
yq -i 'del(.classes.[].slot_usage.[] | select(.multivalued == "false").inlined_as_list)' $@
yq -i 'del(.classes.TextValue)'  $@
yq -i 'del(.classes.TimestampValue)'  $@
yq -i 'del(.slots.[] | select(.multivalued == "false").inlined)' $@
yq -i 'del(.slots.[] | select(.multivalued == "false").inlined_as_list)' $@
yq -i 'del(.slots.[] | select(.name == "acted_on_behalf_of"))' $@
yq -i 'del(.slots.[] | select(.name == "ended_at_time"))' $@
yq -i 'del(.slots.[] | select(.name == "has_maximum_numeric_value"))' $@
yq -i 'del(.slots.[] | select(.name == "has_minimum_numeric_value"))' $@
yq -i 'del(.slots.[] | select(.name == "has_numeric_value"))' $@
yq -i 'del(.slots.[] | select(.name == "has_raw_value"))' $@
yq -i 'del(.slots.[] | select(.name == "has_unit"))' $@
yq -i 'del(.slots.[] | select(.name == "latitude"))' $@
yq -i 'del(.slots.[] | select(.name == "longitude"))' $@
yq -i 'del(.slots.[] | select(.name == "started_at_time"))' $@
yq -i 'del(.slots.[] | select(.name == "term"))' $@
yq -i 'del(.slots.[] | select(.name == "used"))' $@
yq -i 'del(.slots.[] | select(.name == "was_associated_with"))' $@
yq -i 'del(.slots.[] | select(.name == "was_generated_by"))' $@
yq -i 'del(.slots.[] | select(.name == "was_informed_by"))' $@
yq -i 'del(.slots.[] | select(.name == "was_informed_by"))' $@
yq -i '(.slots.[] | select(.domain == "Activity") | .domain ) = "NamedThing"' $@
yq -i '(.slots.[] | select(.domain == "Agent") | .domain ) = "NamedThing"' $@
yq -i '(.slots.[] | select(.domain == "AttributeValue") | .domain ) = "NamedThing"' $@
yq -i '(.slots.[] | select(.domain == "AttributeValue") | .domain ) = "NamedThing"' $@
yq -i '(.slots.[] | select(.domain == "ControlledTermValue") | .domain ) = "NamedThing"' $@
yq -i '(.slots.[] | select(.domain == "GeolocationValue") | .domain ) = "NamedThing"' $@
#	yq -i '(.slots.[] | select(has("range") | not  ) | .range ) = "string"' $@
#	yq -i '(.slots.[] | select(.name == "dna_dnase") | .range) = "boolean"' $@
yq -i '(.slots.[] | select(.name == "dna_dnase") | .range) = "YesNoEnum"' $@
yq -i '(.slots.[] | select(.name == "dnase_rna") | .range) = "YesNoEnum"' $@
yq -i '(.slots.[] | select(.name == "oxy_stat_samp") | .range) = "OxyStatSampEnum"' $@
yq -i '(.slots.[] | select(.name == "oxy_stat_samp") | .range) = "rel_to_oxygen_enum"' $@
yq -i '(.slots.[] | select(.name == "rel_to_oxygen") | .range) = "rel_to_oxygen_enum"' $@
yq -i '(.slots.[] | select(.name == "sample_link") | .range ) = "string"' $@
yq -i '(.slots.[] | select(.range == "ControlledIdentifiedTermValue") | .range) = "string"' $@
yq -i '(.slots.[] | select(.range == "ControlledTermValue") | .range) = "string"' $@
yq -i '(.slots.[] | select(.range == "GeolocationValue") | .range) = "string"' $@
yq -i '(.slots.[] | select(.range == "OntologyClass") | .range) = "string"' $@
yq -i '(.slots.[] | select(.range == "QuantityValue") | .range) = "string"' $@
yq -i '(.slots.[] | select(.range == "string") | .multivalued ) = false' $@
yq -i '(.slots.[] | select(.range == "TextValue") | .range) = "string"' $@
yq -i '(.slots.[] | select(.range == "TimestampValue") | .range) = "string"' $@

@turbomam
Copy link
Member

turbomam commented Oct 29, 2024

broadly speaking, we might want to

  • change attributes of a slot, a slot usage, or both, based on some criteria
  • delete slots or slot usages based on some criteria
  • change attributes of a class or delete classes based on some criteria
  • change attributes of an enum or delete enums based on some criteria
  • change attributes of a permissible value or delete permissible values based on some criteria

Or do any of those in the absence of any criteria, like removing the name attribute from any element that has one, with a schema slimming goal

we also might want non-binary criteria, like a value is in a set like enum or type names

@turbomam
Copy link
Member

what is the best specification format?

  • TSV (tabular)
  • YAML (or JSON for arbitrary docs with no explicit schema)
  • :LinkML YAML or JSON

@turbomam
Copy link
Member

Are there cases where we would want to require satisfying multiple criteria before doing an operation?

Are there cases where we would want to batch operations together in a more succinct form, like setting the range of slots to string if their range had been ControlledTermValue or GeolocationValue or QuantityValue etc.?

@turbomam turbomam linked a pull request Oct 30, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants