-
-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update and improve json schemas to draft 7 #161
update and improve json schemas to draft 7 #161
Conversation
Great, thanks. Will you address the merge conflict? And can you briefly explain how you managed to upgrade the schemas? Are there software libraries to upgrade schemas to newer versions, or was it hand-work?
Can you give an example of what I should be looking for in this PR? |
Whoops, didn't see that there was a conflict.
Mostly by hand.. I used quicktype to quickly generate raw object definitions, but most of those had to be adjusted for better accuracy. Then I used the validator I liked above to double check that the object types checked out using a few known-accurate objects. I also used a few schemas on schemastore.org for inspiration.
I'll annotate some lines where changes were made and where I think the schema can be improved in this PR after I fix the conflict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few comments. I probably forgot some stuff, but that should start the conversation for now...
"type": "array", | ||
"items": { | ||
"type": "string", | ||
"format": "uri" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to specifically require uri
format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be breaking so I'm gonna leave it.
}, | ||
"additionalProperties": false | ||
} | ||
}, | ||
"type": "object", | ||
"properties": { | ||
"schema": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/rmzelle/ref-extractor/wiki#single-item-zotero-citation again. Zotero and Mendeley both reference the "csl-citation.json" schema in their embedded citation metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but why cant they use JSON schema as it's indended and just reference the base schema using the top-level $id
property? It seems redundant and unnecessary.
}, | ||
"additionalProperties" : false | ||
"required": ["schema", "citationID", "citationItems"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to include citationItems
because it should always return an array, even if empty IMHO.
}, | ||
{ | ||
"properties": { | ||
"circa": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boolean? What is this exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a toggle to indicate whether a date is uncertain. See http://docs.citationstyles.org/en/stable/specification.html#approximate-dates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be narrowed to just boolean then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like citeproc-js might be using "circa": 1
, but yeah, just allowing a boolean would be neater.
csl-data.json
Outdated
"type": [ | ||
"number" | ||
], | ||
"minimum": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to be a better spec fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the "minimum"/"maximum" for "season" here. (I don't think the range makes sense, and it's unrelated to the schema draft update, so to be discussed separately)
"maxItems": 3 | ||
}, | ||
"maxItems": 2 | ||
"literal": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this an "either-or" between literal
and the rest of the fields.
If no literal
, then it should at least have family
(see below)
"suffix": { | ||
"type": "string" | ||
}, | ||
"comma-suffix": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this is, nor static-ordering
or parse-names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two attributes were cruft from the early days. I've removed them from fixtures in the test suite, they can be removed from the schema.
Just realized that I removed I'll add that fix in after I hear back. |
@rmzelle Yikes... also just realized that there are Specifically, these: |
@rmzelle What is the difference between |
None of these have a corresponding CSL variable with the same name. "language" indicates the language of the item in question, and is used by the CSL processor to determine if special processing is necessary (see e.g. http://docs.citationstyles.org/en/stable/specification.html#non-english-items). We chose not to make it a CSL variable, since it generally wouldn't make much sense to print the field contents in a formatted reference. The field contents should ideally be an easily parseable locale code. And even if the field would contain a human readable language descriptor (e.g. "English"), we don't have the infrastructure in place to translate that for CSL styles set to non-English locales (e.g. to "Anglais" for French).
Regarding "journalAbbreviation" and "shortTitle", CSL originally only made the abbreviated "short" versions of "container-title" and "title" available in CSL styles through |
@dsifford, do you still have the versions of the CSL JSON schemas on hand right after you upgraded them from schema draft 3 to 7? I'm happy to review any changes you'd like to suggest to the schema itself, but I much rather have separate pull requests for the schema upgrade and such changes. |
Yep, I still have access to the old schemas through git. None of the changes (to my knowledge) in this PR currently will affect backwards compatibility. Though a couple will make checks a bit more strict (as noted above). After I get around to adding in those 3 missing fields, do you have any other suggestions from there? |
It would be better to put anything that is not purely related to the draft-3 to draft-7 upgrade in one or more separate pull request(s).
Add which fields where? |
|
Right, but I removed them in this PR. I have to add them back in. |
Ah, okay. Yes. Sorry for the confusion. |
I created a PR into this PR to update the Python test suite: dsifford#1. In short, we must upgrade the version of the python There is still one failure in |
Nice job @dhimmel... Just pushed 2 of the spec fixes that were talked about above that add back in a few fields I missed. Only thing left to do is rewind some things that aren't fully 100% backwards compatible and then we can probably merge this and open a new PR for draft-7 related schema updates. |
Update tests for draft-07 JSON Schemas
"type": "string" | ||
} | ||
}, | ||
"additionalProperties" : false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"additionalProperties" : false` is removed, which is why the test fails. Is that Intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it was not.
Although, just so you're aware, there are some large APIs that currently serve "CSL JSON" responses that are peppered with all kinds of additional properties, so not sure how that would possibly affect those API consumers/maintainers...
Here's an example of a typical response from CrossRef
{ "indexed": { "date-parts": [[2018, 12, 18]], "date-time": "2018-12-18T19:41:10Z", "timestamp": 1545162070737 }, "reference-count": 0, "publisher": "BMJ", "license": [ { "URL": "http://www.bmj.com/company/legal-information/terms-conditions/legal-information/tdm-licencepolicy", "start": { "date-parts": [[2018, 12, 13]], "date-time": "2018-12-13T00:00:00Z", "timestamp": 1544659200000 }, "delay-in-days": 0, "content-version": "tdm" } ], "content-domain": { "domain": ["bmj.com"], "crossmark-restriction": true }, "abstract": "Abstract\n \n Objective\n To determine if using a parachute prevents death or major traumatic injury when jumping from an aircraft.\n \n \n Design\n Randomized controlled trial.\n \n \n Setting\n Private or commercial aircraft between September 2017 and August 2018.\n \n \n Participants\n 92 aircraft passengers aged 18 and over were screened for participation. 23 agreed to be enrolled and were randomized.\n \n \n Intervention\n Jumping from an aircraft (airplane or helicopter) with a parachute versus an empty backpack (unblinded).\n \n \n Main outcome measures\n Composite of death or major traumatic injury (defined by an Injury Severity Score over 15) upon impact with the ground measured immediately after landing.\n \n \n Results\n \n Parachute use did not significantly reduce death or major injury (0% for parachute\n v\n 0% for control; P>0.9). This finding was consistent across multiple subgroups. Compared with individuals screened but not enrolled, participants included in the study were on aircraft at significantly lower altitude (mean of 0.6 m for participants\n v\n mean of 9146 m for non-participants; P<0.001) and lower velocity (mean of 0 km/h\n v\n mean of 800 km/h; P<0.001).\n \n \n \n Conclusions\n Parachute use did not reduce death or major traumatic injury when jumping from aircraft in the first randomized evaluation of this intervention. However, the trial was only able to enroll participants on small stationary aircraft on the ground, suggesting cautious extrapolation to high altitude jumps. When beliefs regarding the effectiveness of an intervention exist in the community, randomized trials might selectively enroll individuals with a lower perceived likelihood of benefit, thus diminishing the applicability of the results to clinical practice.\n ", "DOI": "10.1136/bmj.k5094", "type": "article-journal", "created": { "date-parts": [[2018, 12, 13]], "date-time": "2018-12-13T22:34:15Z", "timestamp": 1544740455000 }, "page": "k5094", "update-policy": "http://dx.doi.org/10.1136/crossmarkpolicy", "source": "Crossref", "is-referenced-by-count": 0, "title": "Parachute use to prevent death and major trauma when jumping from aircraft: randomized controlled trial", "prefix": "10.1136", "author": [ { "given": "Robert W", "family": "Yeh", "sequence": "first", "affiliation": [] }, { "given": "Linda R", "family": "Valsdottir", "sequence": "additional", "affiliation": [] }, { "given": "Michael W", "family": "Yeh", "sequence": "additional", "affiliation": [] }, { "given": "Changyu", "family": "Shen", "sequence": "additional", "affiliation": [] }, { "given": "Daniel B", "family": "Kramer", "sequence": "additional", "affiliation": [] }, { "given": "Jordan B", "family": "Strom", "sequence": "additional", "affiliation": [] }, { "given": "Eric A", "family": "Secemsky", "sequence": "additional", "affiliation": [] }, { "given": "Joanne L", "family": "Healy", "sequence": "additional", "affiliation": [] }, { "given": "Robert M", "family": "Domeier", "sequence": "additional", "affiliation": [] }, { "given": "Dhruv S", "family": "Kazi", "sequence": "additional", "affiliation": [] }, { "given": "Brahmajee K", "family": "Nallamothu", "sequence": "additional", "affiliation": [] } ], "member": "239", "published-online": { "date-parts": [[2018, 12, 13]] }, "container-title": "BMJ", "original-title": [], "language": "en", "link": [ { "URL": "http://data.bmj.org/tdm/10.1136/bmj.k5094", "content-type": "unspecified", "content-version": "vor", "intended-application": "text-mining" }, { "URL": "https://syndication.highwire.org/content/doi/10.1136/bmj.k5094", "content-type": "unspecified", "content-version": "vor", "intended-application": "similarity-checking" } ], "deposited": { "date-parts": [[2018, 12, 18]], "date-time": "2018-12-18T19:05:10Z", "timestamp": 1545159910000 }, "score": 1.0, "subtitle": [], "short-title": [], "issued": { "date-parts": [[2018, 12, 13]] }, "references-count": 0, "alternative-id": ["10.1136/bmj.k5094"], "URL": "http://dx.doi.org/10.1136/bmj.k5094", "relation": {}, "ISSN": ["0959-8138", "1756-1833"], "subject": ["General Medicine"], "container-title-short": "BMJ" }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so not sure how that would possibly affect those API consumers/maintainers...
Not at all... they currently return CSL that does not abide by the JSON Schema. I.e. this is the status quo. For Manubot, we actually use the JSON schema to filter the CSL returned by Crossref and remove all the crap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we continue to use allOf
and use refs as I do in this PR, then there's no way currently possible to have additionalProperties: false
. So pick your poison. I personally think it's fine, but if you feel strongly about it, we'd have to scrap this.
cc: @rmzelle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @rmzelle -- My comment above still holds relevance.
@dsifford, I haven't merged this yet since it's my understanding you still have some rewinding to do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review of csl-data.json. I checked all the changes made to csl-data.json by eye, and identified a few which should probably be put in a separate PR. Also a few suggestions on how to organize the definitions.
csl-data.json
Outdated
"type": "number" | ||
}, | ||
"maxItems": 3, | ||
"minItems": 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put this in a new pull request, as it makes the schema stricter.
"type": "array", | ||
"items": { | ||
"type": "number" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is stricter than before (was "string" or "number"). Separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zotero currently formats years as strings rather than integers during CSL export.
I think upgrading the schema draft is a great enhancement. However, I think it is crucial that backwards incompatible changes are not made in the initial draft change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few things that got more strict that I didn't flag, like setting required fields for date-parts (either a "date-parts", "literal" or "raw" date), but those seemed pretty basic. Agree we should be careful here.
csl-data.json
Outdated
"minItems": 1 | ||
}, | ||
"maxItems": 2, | ||
"minItems": 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, minItems is stricter here.
csl-data.json
Outdated
"season": { | ||
"type": [ | ||
"number" | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Value of "season" is stricter than before (was "string" or "number"). Separate PR?
}, | ||
{ | ||
"properties": { | ||
"circa": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like citeproc-js might be using "circa": 1
, but yeah, just allowing a boolean would be neater.
csl-data.json
Outdated
"type": [ | ||
"number" | ||
], | ||
"minimum": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove the "minimum"/"maximum" for "season" here. (I don't think the range makes sense, and it's unrelated to the schema draft update, so to be discussed separately)
csl-data.json
Outdated
"additionalProperties" : false | ||
"required": [ | ||
"family" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we can require "family".
I just checked and a Zotero user can store just a given name, which gives this as exported CSL JSON:
{
"family": "Nielsen",
"given": "Jens"
},
{
"family": "",
"given": "bla"
}
Other implementations might leave out the field entirely.
}, | ||
"language": { | ||
"type": "string" | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would make sense to carve out a separate definition for non-CSL variable metadata fields like "language"?
"type": "string" | ||
}, | ||
"journalAbbreviation": { | ||
"type": "string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should use a separate definition here too, to indicate that "journalAbbreviation" is deprecated in favor of "container-title-short". Idem for "shortTitle" and "title-short".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review of csl-citation.json.
"string", | ||
"number" | ||
] | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original schema requires "id" for each citation item, which is a requirement we should keep.
] | ||
}, | ||
"itemData": { | ||
"$ref": "csl-data.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried using your version schema, but are you sure this shouldn't be "$ref": "csl-data.json/#/items"
like in the original?
Took changes to test suite from #161 (and https://github.com/citation-style-language/schema/pull/161/commits/b3543 2493553c10532dbbc9b7150c7ed303d2ddd specifically)
"version": { | ||
"type": "string" | ||
}, | ||
"year-suffix": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a property of the item in rendered context. Any input value would be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So can this be removed?
"call-number": { | ||
"type": "string" | ||
}, | ||
"citation-label": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generated by processors, not set on input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So can this be removed?
"citation-label": { | ||
"type": "string" | ||
}, | ||
"citation-number": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a property of the item in context. Processors generate the value internally. Any input value will be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So can this be removed?
"event-place": { | ||
"type": "string" | ||
}, | ||
"first-reference-note-number": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a processor-generated value that depends on the position of references in a document. Any value set on input will be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So can this be removed?
I'll try and look at this again this weekend and address issues raised. Sorry for stalling! |
@dhimmel, I prepared a minimal update to draft 6 (it looks like no changes are necessary for draft 7) at #168, inspired by this PR. I would prefer to merge that, and then create one or more additional PRs to pull in improvements/changes proposed here (such as using proper definitions etc.). |
@rmzelle Ah, shoot. I missed your message. I went ahead and addressed all(?) of the comments/concerns raised here. It should theoretically be good to go. But your call. |
Note @dsifford that I'm not a maintainer of this repo, but I appreciate the work you've put in, and hope to see as much of it integrated as makes sense. Is it possible to split this PR up into smaller ones following #168? I think it will help with review and the pace of things, if the PRs are very focused and do one and only one thing. In addition, each change should indicate whether it functionally changes the schema or not.
Note that you have a test failure of @pytest.mark.skip(reason="Setting additionalProperties to false not possible with allOf") |
Any blockers on getting this merged? I'm working on a downstream package and would prefer not to have to start everything over when the updated schema lands. |
Closed this in favor of #168, which have now been merged. Any missing pieces can be re-added via dedicated PRs. |
Follow up from #153, closes #127.
This should be backwards compatible with the old schemas. And because of that, there's some glaring issues with some of the fields and types scatter throughout that were grandfathered in. These I think should be discussed and improved upon in another PR.
If you'd like, copy and paste the
csl-data.json
schema to this validator to give it a shot. Thecsl-citation.json
schema will not validate because it references the current broken schema url.Let me know if you have any questions.
Ping: @rmzelle @dhimmel