Discuss a recent blog proposal (PR 40) #483

benjagm · 2023-09-13T14:04:12Z

benjagm
Sep 13, 2023
Maintainer

Hi everyone!

We recently received this blog proposal PR by researchers external to JSON Schema Community sharing defects found in public schemas. Even though we are excited to receive new contributions in any possible way, the content differs from the initial purpose of the blog, and we are unclear about the goals of the publication.

We are respectful with everyones work, and we recognize the effort behind the study backing the article, however we'd like to start a conversation between authors and the Community to make sure we publish content that supports the JSON Schema Community the best way.

How to start?

We'd like to better understand the goals of the publication.
We'd like to discuss about the rationale behind your conclusions.
We'd like to learn from your work and find ways to improve.
We'd like to discuss with you what is the format to better serve JSON Schema Community.

Who can participate?

We are expecting primarily the blog authors and JSON Schema maintainers, however everyone is invited and we'll love to have more opinions about this.

zx80 · 2023-09-17T09:30:09Z

zx80
Sep 17, 2023

Let's try to answer these points as factually and precisely as possible:

We'd like to better understand the goals of the publication.

This is an academic paper, an empirical study in the field of software engineering. Its scientific goals are to:

survey the JSON Schema specs for features and changes (this meant reading and comparing the 10 versions),
collect data about public schemas (all schemas we could find, without a priori filtering),
analyze them to detect defects (static analysis about typing), this detects
A. actual defects, i.e. schemas with parts which do not make sense from a semantical point of view, e.g. a keyword which is always ignored or a subschema which is never satisfiable.
B. probable defects, i.e. schemas structure which strongly suggest that a type is expected but not declared, i.e. a keyword which is possibly ignored.
analyze these defects (what was the initial intention of the user, this is mostly manual categorization by looking at schemas and some guessing, for which we concluded that most B cases were just plain defects where the user forgot a type), and based on these findings
suggest spec changes which can make invalid these partially broken/doubtful schemas by requiring users to make their intention explicit in the schema description. These changes actually work (they indeed invalidate most defects we found) and are reasonable (we found that many schemas from the corpus already satisfy these constraints).

The last part or the paper is a new data structure description language based on JSON, which was initially a separate paper but got unfortunately merged there at the request of the program chair.

The goal of the paper is not to promote JSON Schema, it is to study it, and to provide evidence-based feedback about the language design.

We'd like to discuss about the rationale behind your conclusions.

About the rational and controversies/disagreements so far, on each parts:

1. labeling the spec as somehow uncompleted, because it is called a draft and is quite unstable from one version to the next, including breaking changes in the future not-yet-published one. The spec team argues that each draft is complete. Well sure... we are playing on words. We still think that we have enough facts to justify this word about the current spec. Given the current intention the next version could be labelled as completed given how we understand this word. Anyway, this is just one word out of several thousands.
2. some suggestions that these are not the right schema sources, that they would be biased and low quality and non representative... We can only disagree: Keeping only some samples to prove a convenient point is exactly what scientists are expected not to do, and would actually constitute a bias. Science cannot only count healed patient and discard those who died. We are open to include any other public sources. We cannot do an academic study about private sources that we cannot access. If the counter argument is that other private sources a very different and you must be trusted on that, this cannot constitute science, and anyway it does not invalidate our claims which are about publicly available data. We cannot publish an academic paper about selected private data!
3.A. these defects are real defects, so it seems hard to argue that they make sense as the semantics is broken, which is why they are in this category. Here is a basic contradictory example:
```
{
  "type": "string",
  "enum": [ 80, 443 ]
}
```
3.B and 4. here there could be some case by case discussion. Having looked manually at a lot of cases, we are confident that most of them (although possibly not all) are straightforward errors and not clever modelling tricks. As the data is public, feel free to look at them. Here is a reconstituted simple case:
```
{
  "descripiton": "Array of components",
  "items": { "$ref": "#/$defs/component" }
}
```
It seems hard to argue that misspelling description and allowing a boolean, a string, a number or an object in this element was the user intention... and even if so, from a software engineering point of view either the description should be clear about that, or the list of types could say so explicitly, which would definitely help the poor schema maintainer.
5. The changes we suggest make most 3.A and 3.B cases syntactically invalid.
Things that we think are not controversial: the proposed changes work for their purpose and are reasonable, i.e. the resulting language is usable to describe complex data structures.
The controversies:
- it changes the philosophy of the spec by restricting it and requiring the user to make their intention explicit:
  Indeed, some description tricks are not possible anymore, and we think that it makes sense to do so from a software engineering point of view.
- there is an emotional aspect because the spec is identified as the source of some problems.
  Indeed, we do not believe that the user is the sole culprit of these defects because the design of the spec allows to silently ignore unknown keywords (hence keyword typos are okay) and keywords which cannot apply to a type (hence misplacing a } is okay). Most random JSON objects are currently valid schemas. These features are facts, and the point that it should not be allowed is our opinion.
- the proposed changes would only fit some use case, especially describing data structures: which use case would be actually broken by our restrictions is unclear to us. We agree that some description would require to be changed, but it is not obvious to us that the new description would be less appropriate.
there seems to be controversies or misunderstanding about claims which are not made in the paper: for instance, we do not say that the spec does not bring benefits as is: Indeed, even if there are a few errors in a schema, it can still be used to validate instances and detect some errors and provide a reasonable documentation which helps developers on the side. This is not contradictory with our claim that a majority of public schemas have defects and that our proposed changes would detect most of them.

We'd like to learn from your work and find ways to improve.

You do not need to take all the suggested spec changes to achieve some of the benefit. There is a continuum of spec changes which can forbid some things in the syntax, and others in the semantics, or keep the errors because you think it does not need to be addressed as it would harm some use cases thus are not worth the benefit.

The spec as is finds that our defective schemas are not defects, i.e. users are free to describe non sense schemas and should not be restricted in their freedom of having silently ignored typos in their code, because...?
Our proposal in on the opposite extreme of the spectrum as it puts many constraints in the syntax itself, so that a syntactically correct schema would avoid most of the reported defects: we think that requiring the user to make their intention explicit is good from a software engineering point of view (this opinion is based on the collected evidence that they do not manage it well with implicit intentions). This also include reducing the number of keywords, because we found evidence that user have a hard time managing so many keywords and write a lot of typos. We built evidence-base opinions.
There could also be some constraints expressed in the syntax and others in the semantic and the validator would have to check it, which we think is very hard both to specify and to implement everywhere: we do not think that such extra burden on implementations is a good idea (this is just an opinion).
The spec could make some changes to avoid some of the defects in the syntax, and others defects would still allowed. This is cherry picking. We do not think that you should do that, but yet again it is an opinion, the spec team is free to do whatever it wants with its spec.

We'd like to discuss with you what is the format to better serve JSON Schema Community.

Dunno!

if the purpose of json-schema.org/blog is self promotion of a technology, then dissenting opinions, whether they are right are wrong or simply different, can be ignored.
if you want to advocate an open process, then you should accept a somehow dissenting opinion as a blog entry, especially if it is argumented with evidences, which is the case, and labelled as an opinion.

It seems that currently rules depend on who you are: advocating breaking changes is okay for the spec team, but not so for outsiders with a differing agenda (in our case, an academic study: we are not serving the JSON Schema Community, we are not paid by a company which has a vested interested in promoting this technology because it is selling services based on that, we are independent, which is a requirement for academics).

1 reply

jdesrosiers Sep 20, 2023
Maintainer

I don't have a problem with posting content on the blog that is critical of JSON Schema. We have plenty to be critical of. But, such content should be the result of discussion including experts as well as the general community to make sure the criticisms are well founded and all perspectives are taken into account. Consider it our version of peer review. That certainly didn't happen with this paper.

After reading the paper, I found there to be several inaccuracies, misunderstandings, and misrepresentations of JSON Schema as well as good deal of opinion and conjecture. Before I think this paper can be promoted on our blog, I think the problems should be corrected and the opinions should be subject to discussion and ideally all sides being represented in the blog post. I'd be happy discuss the parts I think are problematic, but doing so would be a considerable investment of my time and I'd like to know that you're willing to update the paper if I make that investment.

these defects are real defects, so it seems hard to argue that they make sense as the semantics is broken

JSON Schema is a programming language. It's a declarative, domain specific language for the domain of validating an annotating JSON data. In every programming language there are ways to write nonsense programs including programs with unreachable code, programs that always exit in failure, or even programs that never exit at all. Sometimes errors are relatively easy to detect, other times they're complicated or impossible to detect, and sometimes it's ambiguous whether something is an error or intentional. JSON Schema is no different. Just like with other languages, our position is that this sort of error detection should be the responsibility of static analysis tools (linters). There have been a few attempts to make a linter, but nothing has been done that's terribly complete or widely adopted. Many of us consider it a huge gap in the JSON Schema ecosystem. It's something we very much want to tackle, but haven't had the capacity for yet. Although I don't find your recommendations very useful, your data is a very welcome contribution. It helps us identify what the most common errors are so future linting tools can look for those kinds of errors.

which use case would be actually broken by our restrictions is unclear to us.

You could have asked. We'd have told you.

we found evidence that user have a hard time managing so many keywords and write a lot of typos. We built evidence-base opinions.

I think that your conclusion that the number of typos is related to the number keywords available to users is conjecture. Your data only shows what kind of typos happen. They say nothing about why those typos happen. You correctly call your conclusion "opinion", but I don't agree that this one is evidence-based. If you compare the number of typos across versions of JSON Schema and see a correlation of fewer typos in earlier versions that have fewer keywords, that would constitute evidence worthy of mention and further study. However, I see no indication that such an analysis was performed.

However, even if it's true that more keywords means more typos, suggesting removing lesser used keywords isn't helpful. We don't introduce new keywords sitting behind closed doors that we think theoretically might be useful. We are very selective about what keywords we add to JSON Schema. We get requests for new keywords all the time and we reject almost all of them. Every keyword added to JSON Schema was added to address a clear and pervasive need from actual usage and user feedback. Removing a keyword means reintroducing a problem what we've already solved and making JSON Schema less expressive and harder to use. That's not really an option.

It seems that currently rules depend on who you are: advocating breaking changes is okay for the spec team, but not so for outsiders with a differing agenda (in our case, an academic study: we are not serving the JSON Schema Community, we are not paid by a company which has a vested interested in promoting this technology because it is selling services based on that, we are independent, which is a requirement for academics).

I presume you're referring to Greg's post about no longer allowing unknown keywords. That was a topic that we discussed and debated the details of for about a year over all. We discussed the issues internally as well as actively soliciting external feedback as far and as wide as we could including running polls. After all that, we came to a consensus and Greg volunteered to write a post for the community to summarize what all that work amounted to, how we came to our decision, and how the decision would effect users.

You wrote a paper without soliciting feedback from the experts and have engaged in no community discussion. That's a very different situation. I don't expect you, as an "outsider", to have to go through the same level of diligence that we went through before Greg wrote that post, but I think asking for some level of discussion about a controversial topic is more than reasonable. After all sides are heard and understood, if you want to write a blog post summarizing the discussion and presenting all sides of the argument (assuming an agreement wasn't reached), we would love to have you write that post. This is no different from what we would expect of any "insider" who wants to write a controversial post.

For the record, although some of us are paid to do this work, we represent the interests of the JSON Schema community, not the company that employs us. We're no more beholden to or influenced by the company that employs us than you are to the university that employs you.

zx80 · 2023-09-23T09:49:11Z

zx80
Sep 23, 2023

I don't have a problem with posting content on the blog that is critical of JSON Schema. We have plenty to be critical of. But, such content should be the result of discussion including experts as well as the general community to make sure the criticisms are well founded and all perspectives are taken into account. Consider it our version of peer review. That certainly didn't happen with this paper.

The paper has been peer reviewed by researchers. Indeed, it has not been reviewed by the JSON Schema people, but this is not a requirement from our perspective, although it is fine. We read in detail most research papers about JSON Schema.

After reading the paper, I found there to be several inaccuracies, misunderstandings, and misrepresentations of JSON Schema as well as good deal of opinion and conjecture.

Please elaborate your claims, otherwise this is empty:

what inaccuracies?
what misunderstandings?
what misrepresentations?
researchers can have opinions, esp if it is backed by evidence and/or clearly argumented...
researchers can conjecture from time to time, and conjectures are not facts, just proposals for thoughts...

Before I think this paper can be promoted on our blog, I think the problems should be corrected and the opinions should be subject to discussion and ideally all sides being represented in the blog post.

We tried to do a short blog post presenting the main results of the paper. The (interesting) discussion/debate you propose would be a longer one. Why not.

I'd be happy discuss the parts I think are problematic, but doing so would be a considerable investment of my time and I'd like to know that you're willing to update the paper if I make that investment.

The paper has been accepted at a conference after being peer-reviewed. We are fine/would be happy to update the research report based on a detailed review by other people, if we find it appropriate. We may plan to submit an extended version to a Journal.

these defects are real defects, so it seems hard to argue that they make sense as the semantics is broken

JSON Schema is a programming language.

Hm... No, not really, IMHO. It is rather a less ambitious data description language.

It's a declarative, domain specific language for the domain of validating an annotating JSON data.

Yes.

In every programming language there are ways to write nonsense programs including programs with unreachable code, programs that always exit in failure, or even programs that never exit at all.

Mostly yes, but JSON Schema is not a programming language as such.

Probably most data languages may allow some nonsense stuff at the syntax level, but much less so than a programming language because their power of expression/description is usually reduced compared to a full-fledged programming language, so there is less leeway for errors.

Sometimes errors are relatively easy to detect, other times they're complicated or impossible to detect, and sometimes it's ambiguous whether something is an error or intentional.

Sure. Our evidence shows unintentional errors in a majority of public schemas, without ambiguity in most cases.

JSON Schema is no different. Just like with other languages, our position is that this sort of error detection should be the responsibility of static analysis tools (linters).

Ok, this is your position. Too bad, no such linter is available for JSON Schema after 14 years and 10 drafts.

There is a gradation of checks that can be performed about a DSL, from the syntax to the semantics to external tools: the current design puts very few constraints at the syntax/semantics level and all on a hypothetical linter. We indirectly prototyped one for our research.

This research demonstrates that reasonable constraints on the syntax would detect most defects found by our tool in public schemas, a majority of which were proven defective. Our opinion is that the language design should be updated significantly in regard to these findings. We are fine if you want to keep your language as is.

There have been a few attempts to make a linter, but nothing has been done that's terribly complete or widely adopted. Many of us consider it a huge gap in the JSON Schema ecosystem. It's something we very much want to tackle, but haven't had the capacity for yet.

Ok. Our opinion is that you could/should tackle the issue from the spec instead of from the tooling, and this opinion is backed by evidences that this approach works for its/most purposes: we tested the reduced language, it detects most defects, and it is practical. Now you may argue that these constraints are unbearable/unpractical/bad/whatever, but we are yet to be provided with concrete examples and use cases that can be investigated and discussed.

Although I don't find your recommendations very useful,

Ok... so what? The recommendations do work for avoiding defects, at the price of changing the spec.
It seems that on principle you do not wish to amend the spec much, so obviously solutions that do that are not useful for you.
We are fine with that, but the rationale behind the principle is unclear to us.

your data is a very welcome contribution.
It helps us identify what the most common errors are so future linting tools can look for those kinds of errors.

Good! The paper basically describes the defects we investigated, and all our tooling is available online as public domain, feel free to look at, use and improve them.

which use case would be actually broken by our restrictions is unclear to us.

You could have asked. We'd have told you.

Please do proceed, we are actually really interested on that point. Up to now we have not found any actual concrete example of use cases which would be made impractical with the restrictions we have suggested. If a few examples are found, then the cost/benefit of covering them with more or less convenience could be debated, but for now we just had believe me, which is not factual enough.

we found evidence that user have a hard time managing so many keywords and write a lot of typos. We built evidence-base opinions.

I think that your conclusion that the number of typos is related to the number keywords available to users is conjecture.

Hm. Maybe.

It may just be that a language which silently ignores typos makes it likely that typos are not found, but then we are back to the syntax and our suggestion that it should be much more rigid. ISTM that there is a cluster of mistaken keywords around min/max related keywords, though, that is why we proposed this explanation.

As a professor, I have experienced that the more student have to memorize things the harder it is for them. As an engineer who has practiced dozen (dozens?) of programming language in the last 40 years, memory is an issue. Ok, maybe age as well:-) Research shows that the human has a limited memory capacity (eg working memory, Miller 1956...). On the other hand, it is good to be able to give a word to a concept so that it can be named and discussed. So it is complicated. My opinion is that 60 keywords for describing JS data structures looks like a lot anyway.

Your data only shows what kind of typos happen. They say nothing about why those typos happen.

Yes, and no.

We think that these typos happen because THEY ARE NOT DETECTED by tools (validators, ...). In a normal development process, when you have a typo in a programming language, the program does not even compile or cannot run, so usually it would not be committed, and if it is committed it would be fixed at the first occasion and the dev would have an earful from the lead so they would strive to do clean commits. This auto-correction behavior does not happen with JSON Schema because typos are fine wrt to the spec. This is basically what we are suggesting to change.

You correctly call your conclusion "opinion", but I don't agree that this one is evidence-based.

Please pinpoint where our reasoning is wrong, so we can improve/correct the arguments and/or conclusions.

If you compare the number of typos across versions of JSON Schema and see a correlation of fewer typos in earlier versions that have fewer keywords, that would constitute evidence worthy of mention and further study. However, I see no indication that such an analysis was performed.

Interesting point. ISTM that we do not have enough data for that type of analysis.

However, even if it's true that more keywords means more typos, suggesting removing lesser used keywords isn't helpful.

We suggested to remove keywords which are seldom used and present implementation challenges which do not seem justified for the purpose of describing data structures. Now if typos are not ignored and types are checked in the syntax, the issue about typo errors would be fixed, so the argument about reducing these is much weaker. Anyway, this is not the main point of our recommendations.

We don't introduce new keywords sitting behind closed doors that we think theoretically might be useful. We are very selective about what keywords we add to JSON Schema. We get requests for new keywords all the time and we reject almost all of them. Every keyword added to JSON Schema was added to address a clear and pervasive need from actual usage and user feedback. Removing a keyword means reintroducing a problem what we've already solved and making JSON Schema less expressive and harder to use. That's not really an option.

We partially disagree: it is not obvious that more expressive is better, it should be a case by case discussion: this feature allows to write that, which is simpler than that (maybe, define simpler), but it adds these constraints and complexity in the implementation, and so on. Probably you did that with the 60 keywords, but the resulting spec is quite complex and people usually write schemas with defects (fact), and 1/3 of keywords are nearly never used in the public schemas we found (fact).

Note that we also suggested to add a keyword to combine object properties, which would he a significant help in handing inheritance, a use case which does not seem well served by the current proposal.

It seems that currently rules depend on who you are: advocating breaking changes is okay for the spec team, but not so for outsiders with a differing agenda (in our case, an academic study: we are not serving the JSON Schema Community, we are not paid by a company which has a vested interested in promoting this technology because it is selling services based on that, we are independent, which is a requirement for academics).

I presume you're referring to Greg's post about no longer allowing unknown keywords.

Yes, that is one example, there may be other posts as well.

That was a topic that we discussed and debated the details of for about a year over all. We discussed the issues internally as well as actively soliciting external feedback as far and as wide as we could including running polls. After all that, we came to a consensus and Greg volunteered to write a post for the community to summarize what all that work amounted to, how we came to our decision, and how the decision would effect users.

Good.

You wrote a paper without soliciting feedback from the experts

This is debatable: The research process involves feedback from academic colleagues and anonymous reviewers which are experts in their field of research.

and have engaged in no community discussion.

Well, it is what we are doing now by proposing a blog post and discussing the paper on the side.

That's a very different situation. I don't expect you, as an "outsider", to have to go through the same level of diligence

Please do not presume about the level of diligence in the academic community!

that we went through before Greg wrote that post, but I think asking for some level of discussion about a controversial topic is more than reasonable.

Sure. We are doing that. We have tried to present and list the controversies in our first post above in this discussion.

After all sides are heard and understood, if you want to write a blog post summarizing the discussion and presenting all sides of the argument (assuming an agreement wasn't reached), we would love to have you write that post. This is no different from what we would expect of any "insider" who wants to write a controversial post.

This is noted. We'll see.

For the record, although some of us are paid to do this work, we represent the interests of the JSON Schema community, not the company that employs us. We're no more beholden to or influenced by the company that employs us than you are to the university that employs you.

In research, the fact that you are paid by someone who has an interest in your results requires an explicit declaration because it taints the results. This cannot be helped.

3 replies

jdesrosiers Sep 26, 2023
Maintainer

Ok, I'll start addressing some of the issues I have with the paper in separate threads, but there's something I need to ask of you. This conversation has so far not been productive. There's aggressive and defensive language coming from both sides and that needs to be reined in if anything productive is going to come out of this discussion. I promise to engage with you from a place of respect for you, your experience, and the work you've put into your research. I ask that you do the same for me. The goal should be learning from each other and understanding each other's position, not proving each other wrong. If we get to the point where I feel like we can no longer communicate from a place of mutual respect, I'll disengage from the conversation.

I want to respond to a few things briefly, but I'll start an new thread for each topic I want to bring up.

JSON Schema is not a programming language as such.

There's some difference in our use of the term "programming language" here that I want to clear up. My understanding is that domain specific languages and general purpose languages are both categories "programming languages". However, you seem to have responded in a way that indicates that you thought I meant "general purpose" programming language. I'll address your classification of JSON Schema as data description language later, but I wanted to briefly address that miscommunication.

Too bad, no such linter is available for JSON Schema after 14 years and 10 drafts.

This is the kind of thing that could be taken as a passive aggressive attack that I would like to avoid in our up coming discussions. However, I'd like to provide you with some context about how why less has gotten accomplished in 14 years than you might expect. Until last year, JSON Schema has had no funding whatsoever. Everyone working on JSON Schema does so in their free time and takes on only what they feel they have the capacity to take on. Building a linter is a daunting task and it's unfortunate that no one as been willing and able to volunteer to take it on, but that's not something we can do much about without funding. Last year we were fortunate enough for a company to step up and support our work by employing several of us to work on JSON Schema full time, so you can expect more out of us in the next few years. Producing a linter is one of our highest priorities behind getting the next version of the spec out.

A little more history as context. JSON Schema has had a rocky history. There have been multiple different groups working on the spec, sometimes without anyone overlapping from one group to the other. The project even went dormant for years before someone took the initiative to start it up again. I think it's changed leadership five times in its history. We're hopeful that now that we have some financial support, we can be more consistent and produce more.

It seems that on principle you do not wish to amend the spec much, so obviously solutions that do that are not useful for you.
We are fine with that, but the rationale behind the principle is unclear to us.

Yes and no. We're willing to make changes when it's important, but we err on the side of not changing things if we don't have to especially if the change isn't backward compatible. This is the the way the current group working on the spec see things, but that hasn't always been the case. However, it's always been the case that we don't like to change fundamental concepts that have been established long before any of us were around.

You have to remember that JSON Schema is widely used in projects big and small in all kinds of industries. The problem with making unnecessary and/or incompatible changes is that it makes it hard for people to upgrade to the latest version. Because upgrading could break their application, people stick to the older version. That means that implementations need to maintain support for multiple versions that users might be stuck on. It's not a good situation for anyone. Your data shows this as significant fragmentation of versions in use. We think that people generally use the latest version when they start a new schema, but don't upgrade once the schema is created. We would find that a interesting subject of further research.

We're currently in the process of moving to a different paradigm for developing the spec inspired by ECMAScript. This new approach requires any changes to be compatible. This means that upgrading to the latest version is safe and implementations only need to support the latest version. This works because the older versions are always compatible with the latest version. This approach is better for everyone, but does mean that we are limited in some of the kinds of changes we can make.

As a professor, I have experienced that the more student have to memorize things the harder it is for them.

You're not wrong, but I think it's important to note that you don't need to memorize every keyword to use JSON Schema. As you say, the set of commonly used keywords is much smaller. You only really need to learn a few keywords to get started and be productive. You can look up and learn the rest as you need more sophisticated features. If you don't have need for a feature, you don't have to memorize it. Most programming languages (general-purpose or domain-specific) have the same kind of learning process.

The research process involves feedback from academic colleagues and anonymous reviewers which are experts in their field of research.

I didn't mean to imply that your reviewers weren't experts. I have no doubt that they're a experts in their field. However, they're not the experts in JSON Schema. We are.

Please do not presume about the level of diligence in the academic community!

That's not at all what I meant! I'm sorry I didn't communicate that clearly enough. I just meant that I don't expect you to have to put up with months of discussion just to contribute your thoughts. If you invited me to do a guest lecture in your class, you wouldn't hold me to same level of diligence required in the academic community. You wouldn't expect me to have my lecture peer reviewed and diligently cited to a degree that would be expected if I was submitting a research paper. I'm just there to share my knowledge and experience. That's what I meant. I don't expect you to have to go through the level of diligence we expect for introducing a spec change if you're just here to share you research and recommendations.

ssilverman Oct 3, 2023
Collaborator

Ok, this is your position. Too bad, no such linter is available for JSON Schema after 14 years and 10 drafts.

I started working on a linter almost 3 years ago, including user-definable rules. Information about it can be found here: https://github.com/ssilverman/snowy-json#the-linter

Note that I stopped playing with this a little over 2.5 years ago (stopping at the 2019-09 draft) due to time constraints. (It was just a “fun” project for me at the time, useful for creating a schema for a standards doc for the entertainment industry; there were no validators that worked well for me that existed at the time, so I wrote my own.)

zx80 Oct 3, 2023

I started working on a linter almost 3 years ago [...]

Interesting, thanks for the pointer!

It suggests some issues that we did not look for (eg we did not bother to check min/max consistency) and we did more type-oriented checking. It is really a linter, i.e. some reported issues are more about style than defects as such, eg "enum": [] is a strange but legal way to forbid any value, or always having a $id if there is a $schema, or non unique enums (the spec only says SHOULD)…

benjagm · 2023-09-26T10:04:48Z

benjagm
Sep 26, 2023
Maintainer Author

Hi this Ben, Community Manager serving JSON Schema Community.

I am not an expert in JSON Schema so I am not going to join the Technical discussion going on in this issue, instead I'd like to start a parallel discussion regarding the benefits of this blog post for the JSON Schema Community:

The first time I read the message it appeared to me that your goal was just to present a brief highlight of your study, your findings, and your conclusions and then leave the forum, like you really don't care about how the broader community will perceive those findings and if someone will be able to understand it to take some or all of the proposals forward to evolve the spec and/or the tooling.

I am sure that that wasn't your intention but it is what I personally perceived. I see this as an opportunity to improve JSON Schema, and this is why I'dl like to collaborate with you on some elements of the blog:

Adding some more context in the blog to explain that you are researchers from PSL University and why you did the study. This will equip the readers with information to better understand the rest of the information.
The common examples section seems too brief, like you where rushing to present the data and I think it will benefit from some additional structure.
Once the technical discussion is successfully completed, I'd like to work together to include a Community call to action at the end of the blog, so we can make sure the people can understand the goal of your study, the people can share alternative opinions, and the people know what to do with the data, and your proposals and how that can improve the next version of JSON Schema specification and/or its supporting tooling.

I hope this makes sense. 🙏 🙏🏾

1 reply

zx80 Sep 27, 2023

Hello Ben. This makes perfect sense.

The intention of the blog was to be an as short as possible teaser for the paper, so that people could decide whether they are interested in reading it or not, so the content is mostly in the paper, not in the post. We can understand that it can be perceived as rough and lacking context. We expanded it along the line you suggest (add context, caveats, expand defects slightly). The last part is a little unclear for now. Feel free to provide feedback about the updated version.

We are always interested in how our work is perceived, whether people agree or disagree (well, especially if they disagree 😄), and are fine with discussing our results with anyone. If the paper triggers a lively discussion, from our point of view this is good.

zx80 · 2023-09-29T09:16:46Z

zx80
Sep 29, 2023

There's [...] language coming from both sides and that needs to be reined [...]

Ok. You may have gathered that I'm not a soft spoken diplomat, so I must work on that. My colleague is much better at that, so with some care and active proof reading the situation should improve.

I want to respond to a few things briefly, but I'll start an new thread for each topic I want to bring up.

Fine with me.

JSON Schema is not a programming language as such.

There's some difference in our use of the term "programming language" here that I want to clear up. My understanding is that domain specific languages and general purpose languages are both categories "programming languages". However, you seem to have responded in a way that indicates that you thought I meant "general purpose" programming language. I'll address your classification of JSON Schema as data description language later, but I wanted to briefly address that miscommunication.

The computer science usual acceptance is that a programming language is a language which is somehow Turing complete, whether it is GP or DS is irrelevant. A programming language usually includes a part about describing/declaring data structures, and a part about control flow. JSON Schema only focus on the data structure side. There is no control flow as such (you may argue about if/then/else, but this is rather a mean to express predicates about the data structure). A DSL may or may not be a programming language depending on what it does, there is no inclusion one way or the other.

However, I'd like to provide you with some context [...]

Thanks for these contextual information.

[...] Building a linter is a daunting task and it's unfortunate that no one as been willing and able to volunteer to take it on, but that's not something we can do much about without funding. Last year we were fortunate enough for a company to step up and support our work by employing several of us to work on JSON Schema full time, so you can expect more out of us in the next few years. Producing a linter is one of our highest priorities behind getting the next version of the spec out.

As we have written kind-of-a linter (freely open source) for our paper, we can provide an opinion, and this objective seems rather elusive:

a good linter is hard to write, as you say a daunting task:
- what language should you use to write a linter? JS, TS, Python, C, Haskell, Rust, Java?
  Everything is possible and there is not one obvious choice.
- providing a nice UX is difficult, our tool just dumps a rough JSON structure and good luck with that… you do not want to show that to the end user.
- the tool should suggest fixes that the user can apply, which is open-ended.
then you have issues on the usage side:
- the benefit of using a linter only comes from using it, so people who would just use a validator would still write defective schemas, you need to actively engage (installation, dependencies, run) in checking your schema with the tool.
- if the linter improves from version to version, good schemas would become bad, which is quite annoying.
- maybe you would want to interface it with editors, which creates more projects.

So even if a linter is available, it may not be widely used.

A benefit is that the spec does not need to be changed explicitly, which we understand is a primary concern.

The syntax-level language restriction strategy/ies that we have put forward through our proposal would ensure that conforming tools would detect many defects, and the implementation is easy to do and check (with the standard test suite), compared to constraint in the semantics (which means that all conforming tools would have to implement some kind of analysis) or through external tools (linter option above, they must exist and you have to use them).

There has been a few counter arguments.

Against our data and analyses:

the source data is wrong, not well collected or chosen or written by poor engineers or …: well, we took what we found, and have no opinion about what we cannot see. As academics, we cannot dismiss data which would not suit our conclusions.
the reported defects are not really defects, they would be somehow intentional: we have no indication that it would be the case, most reported defects are straightforward errors.
…

About the proposed changes:

they break the current spec: indeed, but…
- you already announced that you will break it (one last time) anyway.
- we thing that it is not that bad, as many schemas already conform to our subset.
they change the spirit/philosophy of the spec: indeed, but from our point of view sticking to past designs is not relevant if the past design leads to defects, in particular:
1. keyword independence, which is already partial anyway.
2. ignoring unknown keywords, which says ok to any typo.
3. some narrow use case and hard to understand/implement features, eg the dynamic stuff, if/then/else, possibly vocabularies as well.
4. the fact that described schemas are loose by default, so forgetting anything results in an open door.
5. the fact that schemas themselves are loose, which is related to the first two points.
they would break some use case:
- we are yet to see any concrete example that could be discussed precisely on their merit.
- if such use case cannot be produced because of confidentiality, it does not changes our paper conclusions which rely on publicly available data.
…

About the general logic: we claim that the changes we suggest fix the identified issues.

there may be other way to fix the issues, although we think that they come with their own problems.
the issues may not be worth fixing, because ?…
…

A little more history as context. JSON Schema has had a rocky history. [...]

Thanks for the history. From the output, it was pretty clear that the road had been bumpy, so your explanations says why.
Our study does not focus on the process, but on the results, and how to improve it.

It seems that on principle you do not wish to amend the spec much, so obviously solutions that do that are not useful for you.
We are fine with that, but the rationale behind the principle is unclear to us.

Yes and no. We're willing to make changes when it's important, but we err on the side of not changing things if we don't have to especially if the change isn't backward compatible. This is the the way the current group working on the spec see things, but that hasn't always been the case. However, it's always been the case that we don't like to change fundamental concepts that have been established long before any of us were around.

Ok, preserving backward compatibility is a usual and useful objective for such a project. This is somehow a change of direction as, up to now, the design seems mainly to have focused on allowing upward compatibility (eg accepting unknown keywords), and most past versions have often broken backward compatibility, including the next. As academics, we are not committed to keep fundamental concepts established a long time ago, especially if these concepts lead to error-prone schemas. We looked at how to try to improve the situation at the spec level. It is very unclear whether it is possible to preserve both worlds, but at least we have checked that a significant part of existing schemas already conform to our proposed changes.

You have to remember that JSON Schema is widely used in projects big and small in all kinds of industries. The problem with making unnecessary and/or incompatible changes is that it makes it hard for people to upgrade to the latest version.

If you introduce significant breaking changes, ISTM that the pain will be the same whether you go all the way or half way.

We're currently in the process of moving to a different paradigm for developing the spec inspired by ECMAScript. This new approach requires any changes to be compatible. [...]

We understand that you mean to focus on backward compatibility, i.e. old schemas are accepted by newer versions, and that will be true after the next version, so this version would be the last to include breaking changes. This suggests to do all appropriate changes now.

We are looking forward to a benevolent, constructive and productive discussion.

2 replies

gregsdennis Oct 2, 2023
Maintainer

the reported defects are not really defects

sticking to past designs is not relevant if the past design leads to defects

More specifically, our (my) complaint is that the reported defects are not JSON Schema defects, but are often defects people experience when hand-authoring JSON. It gets worse if you're writing schemas with YAML.

Your example (from the blog post proposal)

{
  "type": "array",
  "items": {
    "type": "string",
    "uniqueItems": true
  }
}

is an issue of a misplaced }. This has nothing to do with JSON Schema, per se. I interpret this as a typo that could (and does) occur while editing any JSON data. An editor application would notice that there are unbalanced braces and report that to the user at the end. At which point, the developer would merely tack on another } at the end and call it done.

If I were to express this in YAML, it's a difference in indentation, which is arguably even harder to discern visually:

type: array
items:
  type: string
  uniqueItems: true

If this were a subschema nested deep inside another, or if there are a lot more keywords in the subschema, this could be virtually impossible to pick up.

Therefore, normal software development testing practices are highly encouraged. We expect that schema authors are verifying their schemas using both valid and invalid data to ensure that the schema allows only what it should. (This kind of testing is part of why I developed my data generation library. I even had an issue opened about it recently.)

The conclusion that JSON Schema is at fault is presumptuous on the idea that people always edit JSON correctly.

That said, we have received suggestions from other sources to only allow schemas that make sense semantically. The implementation ajv has gone so far as to do this by default. However, the philosophy of the this project is that the specification merely defines a set of constraints, and it's the responsibility of the schema author to ensure that the keywords they use make sense for their application. Having this philosophy has enabled many users to combine keywords in ways we never would have considered reasonable and yet fits their needs. If we suddenly require that only certain keywords can be expressed together, that would not only break these users, it would also make JSON Schema less useful.

zx80 Oct 3, 2023

Your example (from the blog post proposal) [...] is an issue of a misplaced }.

Yes.

This has nothing to do with JSON Schema, per se. I interpret this as a typo that could (and does) occur while editing any JSON data. An editor application would notice that there are unbalanced braces and report that to the user at the end. At which point, the developer would merely tack on another } at the end and call it done.

The above argument is unclear to us.

This example (1) conforms to the spec, (2) is an obvious error, indeed probably triggered by hand editing the schema, maybe with a benevolent JSON editor which suggested a wrong fix.

It has to do with JSON Schema because this kind of schema is allowed by the spec, so there is a clear link from the spec to the defect, obviously through the hand of some user. Our proposal is to rule out at the syntax level such a schema, because then all conforming tool would detect the issue and report it, thus the user would be likely to fix the problem.

You can argue that you want to keep this feature for some reason to be discussed, but the link to the spec seems obvious: the spec gives total freedom, which means that it is very easy so write nonsense schemas without ever realizing it, hence a majority of (public) schemas contain defects, which is one of the facts we have demonstrated in our paper.

From a philosophical perspective, we think that languages can help or hinder users which are using them, so we are looking for ways to help the imperfect user to write perfect schemas.

[...] Therefore, normal software development testing practices are highly encouraged. We expect that schema authors are verifying their schemas using both valid and invalid data to ensure that the schema allows only what it should. (This kind of testing is part of why I developed my data generation library. I even had an issue opened about it recently.)

We do not fully follow the therefore. We agree that testing should be encouraged! However, extensive testing seems impractical and costly for any non trivial schema, especially as you must test every potential type error to check that no type constraint would have been overlooked anywhere, so you should test improbable values everywhere. That could double the number of tests to write. This testing approach moves the potential fix from the spec (once, shared, inforced) to every project (many, duplicated, voluntary).

The conclusion that JSON Schema is at fault is presumptuous on the idea that people always edit JSON correctly.

We do not understand your argument, so we are trying to interpret here.

All broken or not broken schemas we have found are valid JSON, which suggest that people do edit JSON correctly: probably the editor helps, and if not any tool which loads the schema would complain if the schema is not even a correct JSON file, so any JSON error is very likely never to be committed.

What people do not edit correctly are schemas, because a somehow random JSON is often a valid schema, which bring us back to the spec.

That said, we have received suggestions from other sources to only allow schemas that make sense semantically. The implementation ajv has gone so far as to do this by default. However, the philosophy of the this project is that the specification merely defines a set of constraints, and it's the responsibility of the schema author to ensure that the keywords they use make sense for their application.

Indeed, we are not committed to the project philosophy per se: we have just analysed the output and suggested changes to improve it. We understand that your point is that the spec philosophy is not a possible point of discussion (for the spec) nor a subject (for a blog entry).

Having this philosophy has enabled many users to combine keywords in ways we never would have considered reasonable and yet fits their needs. If we suddenly require that only certain keywords can be expressed together, that would not only break these users, it would also make JSON Schema less useful.

This argument lacks factual evidence. We are yet to be provided any concrete samples and illustrations that could be discussed precisely on their merit, in particular what they would look like with a restricted JSON Schema, and whether it would be somehow better/worse/equivalent on various objective or subjective criteria.

Summary

We understand that you seem to consider that:

the issue has to do with users, not the spec
the spec freedom philosophy is not amendable
because it allows use cases that would be broken if it is changed.

We would like to see these use cases so that they could be discussed wrt our proposals.

jdesrosiers · 2023-10-04T23:46:13Z

jdesrosiers
Oct 4, 2023
Maintainer

The first thing I want to cover is foundational concepts. You have classified JSON Schema as a data description language. While we are aware that many people try to use it in that capacity, that's not what it's designed for. When it is used for that purpose, people do encounter many of the difficulties the paper highlights.

JSON Schema is actually designed to be a "data validation language". More specifically, it's a "JSON validation language" because it aims to validate JSON, not necessarily any data. I know this distinction may seem trivial, but when you view things from that perspective, I think it should help explain why some things work the way they do.

(JSON Schema is also designed to be a "JSON annotation language", but that's not relevant to the paper, so we can ignore that aspect for the purposes of this discussion.)

As a validation language, each keyword represents a specific assertion about the data. That means that a JSON Schema is collection of assertions to be applied to a JSON instance. A JSON instance is valid if it passes every assertion in the schema. An empty schema is making no assertions. In general, keywords are designed to assert one thing and be combined as necessary to make more complex assertions. For example, the properties keyword only defines the assertions for a property if it exists. In order to assert that it must be present, you also need to use the required keyword. In a very early version of JSON Schema (draft-02 I think), this was the other way around. properties assumed that the property was required and had an optional keyword to effectively remove that assertion. Keywords that remove assertions is awkward and keywords that do more than one thing are generally less reusable, so this was later changed so that keywords are as atomic as possible and all keywords are additive to make composition and reuse easier.

This is why schemas are "loose" by default. Being "strict" implies an assertion that properties that aren't mentioned in the schema aren't allowed. We use the terms "open" and "closed" generally in the same way you use "loose" and "strict". It's "open" because it's open for extension using composition (allOf, etc.). "Closed" means you've locked that schema down to only the properties declared in that schema and composition isn't going to work to add more properties.

The above are the foundational concepts of JSON Schema. If you want to study problems inherent in these foundational concepts, that's fine, but it's not constructive to ask us to change those concepts. If we did it wouldn't be JSON Schema anymore. It would be something else. This foundation is going to lead to some good properties and some bad. It will work very well in some applications and be problematic in others. (I want to be clear I'm not saying all of you suggestions violate these fundamental concepts, just that any that are are not something we would consider changing.)

Here's an example to illustrate where these properties work to our advantage.

{
  "type": "object",
  "properties": {
    "foo": { "type": "string" },
    "bar": {}
  },
  "additionalProperties": false,

  "if": {
    "properties": {
      "foo": { "const": "bar" }
    },
    "required": ["foo"]
  },
  "then": { "required": ["bar"] }
}

Notice that the if schema doesn't include type. Doing so would be redundant because that assertion is already made above. Including it there would also make for less efficient validation because the same assertion is being evaluated twice.

Notice that the if schema is open (allows additional properties) and if the default was for schemas to be closed, this schema would not work and I would have to use a keyword whose purpose is to "remove" that implied and unwanted assertion.

Notice that the then schema includes only the required keyword. That schema may not make sense by itself, but in this case, all it needs to do and all it should be doing is contributing an assertion.

Notice the "bar" schema is empty. The "bar" property is allowed to be any value. This is rare in real life, but it happens occasionally. Allowing this is consistent with the foundational concepts and would be awkward and inconsistent if it was required to include a type keyword in a way that effectively contributes no assertions. The true schema could have been used in place of {}, but I don't think that changes the point that it conceptually an empty schema should make sense and be allowed.

when properties are not listed they are allowed by default, otherwise they must be explicitly forbidden with additionalProperties.

I know it seems odd for additional properties to be ignored, but I wanted to point out that an important reason for that to be allowed is to support polymorphism. Let's say I have a Schema that represents a vehicle and another that extends that schema to represent a car.

/schemas/vehicle

{
  "type": "object",
  "properties": {
    "isHumanPowered": { "type": "boolean" }
    ...
  },
  "required": ["isHumanPowered", ...]
}

/schemas/car

{
  "$ref": "./vehicle",
  "properties": {
    "make": { "type": "string" },
    "model": { "type": "string" }
  },
  "required": ["make", "model"]
}

(Note that the type is not used in the car schema because the type already declared in the vehicle schema that's included by the reference and including it here as well would be redundant.)

Let's say my application has an isHumanPowered function that takes a vehicle. This function validates its input against the vehicle schema to ensure it's something it can use. If the vehicle schema was closed, it could not accept a car object.

/cars/1

{
  "isHumanPowered": false,
  "make": "Toyota",
  "model": "Tacoma"
}

The desired behavior is that the "make" and "model" properties of this JSON instance are ignored when validating against the vehicle schema. Any kind of vehicle should pass. We could also have a bicycle schema and that should be allowed as well.

This is an example of a good reason to use an open schema, but your argument is well taken that this isn't always what people want and it's easy to forget to use additionalProperties when you do need it. I think it's valid to argue that closed-by-default is safer than open-by-default. If it's closed and you need it to be open, validation will fail and you'll be forced to fix it. If it's open and you need it to be closed, you might not notice the problem. However, since applying a no-additional-properties assertion by default is against the foundation principles, this is a problem I would address with a linter. I'd probably have a rule that requires the user to use additionalProperties (or unevaluatedProperties) but doesn't require a specific value for that keyword, which would force the user to be explicit and think about what the choice their making.

The third choice is that JSON Schema meta schema is loose.

The paper correctly identifies that the reason the meta-schema is loose is to allow for extensions. We agree 100% that this causes more problems than it needs to to support an extension feature. We have a plan to address this in the next release. We consider it important enough to make a breaking change to address this problem. People will have two ways to define custom keywords. One is to use the vocabulary system to define their keywords and the other is to use the x- prefix on their keywords. This allows for the extension use case in a way that a misspelled keyword can't be mistaken as a custom keyword. Consider your suggestion to close the meta-schema accepted and a fix is in progress.

I'll have more next week. I hope this was helpful.

3 replies

zx80 Oct 7, 2023

The first thing I want to cover is foundational concepts. You have classified JSON Schema as a data description language. While we are aware that many people try to use it in that capacity, that's not what it's designed for. When it is used for that purpose, people do encounter many of the difficulties the paper highlights.

Indeed, most public schemas we have collected are really simple data/type descriptions.
It is possibly over 99.5%, especially so if we ignore the test suite…

JSON Schema is actually designed to be a "data validation language".

Hmm…

There are a few features that fall under validation rather than typing, but not that many… regular expression matching (depending on what is checked though), multiple of, contains, min/max about values (which could be argued that this is pretty much typing). Formats or size constraints about arrays or objects are much closer to types.

A key feature of data validation would be to write assertions, which with JSON Schema mostly means using if/then/else/not, the usage rate is very low in public schemas, well below 1%. Moreover, most conditional assertions can be translated to typing (see if-removal paper about that), so are not really about data value validation, but rather a contrived way of expressing type constraints, i.e. typing in disguise, as your example below.

Typical features of data (value) validation are not available, such as:

check that an object that exists elsewhere in the structure
perform comparisons, eg checking that a delivery is performed after the ordering of a product.

This would require boolean and other arithmetic expressions (not available directly, although the boolean expressions can be implemented to some extent with conditions and combinators) and value references (not available at all).

So our view about JSON Schema is that it is:

from a CS perspective, mostly a type declaration language with a few validation features.
from a usage perspective, our data shows that it is nearly only about type declarations.

This differs significantly from the declared validation language outlined above.

More specifically, it's a "JSON validation language" because it aims to validate JSON, not necessarily any data. I know this distinction may seem trivial, but when you view things from that perspective, I think it should help explain why some things work the way they do.

We do not understand this point.

As a validation language, each keyword represents a specific assertion about the data. That means that a JSON Schema is collection of assertions to be applied to a JSON instance. A JSON instance is valid if it passes every assertion in the schema. An empty schema is making no assertions. [...]

We understand (and already understood) this internal logic, but it does not follow that doing so is a good idea, and doing otherwise is a bad idea.

This is why schemas are "loose" by default. Being "strict" implies an assertion that properties that aren't mentioned in the schema aren't allowed. [...]

Yep.

From what we have seen, most schemas would be better off with tight defaults, because people often intend tight but forget something, hence create a defect. Tight defaults should not mean that loose cannot be expressed, it just mean that you have to say it explicitly, so there could be a few more lines. If you forget it, you would be likely to find out while testing, so you would probably fix it, so the drawback is minimal, so our point is that it would reduce the number of defect anyway and still be usable, hence we suggested that as a change.

The above are the foundational concepts of JSON Schema. If you want to study problems inherent in these foundational concepts, that's fine, but it's not constructive to ask us to change those concepts.

We think that our suggestions are both reasonable and effective, so from our point of view this is constructive, although a constructive which is not necessarily compatible with your principles. In particular, the changes we suggest are compatible with the current syntax, as it is a subset, even if it does not adhere to your foundational concepts.

We understand that adhering to these principles is a prerequisite for any changes. Based on these principles, most of our suggestions can probably be rejected directly, although you may cherry pick a few things.

The fact that our proposals to not adhere to these principles does not invalidates our claims about defects in existing schemas, the link to the spec which facilitates/allows them, nor that our suggestions are a possible fix.

If we did it wouldn't be JSON Schema anymore. It would be something else.

We disagree: we think that a compatible subset of JSON Schema, already used by many schemas, is still JSON Schema, somehow.

This foundation is going to lead to some good properties and some bad.

We think that we provided a measure of the bad in the paper.

It will work very well in some applications and be problematic in others. (I want to be clear I'm not saying all of you suggestions violate these fundamental concepts, just that any that are are not something we would consider changing.)

Well, they mostly do violate these fundamental concepts.

Here's an example to illustrate where these properties work to our advantage. [...]

This conditional example is interesting for discussion.

Notice that the if schema doesn't include type. Doing so would be redundant because that assertion is already made above. Including it there would also make for less efficient validation because the same assertion is being evaluated twice.

No, not really. Your are projecting how you think the verification should be implemented, which is the naive procedural approach. The validator could detect redundancy and remove useless evaluations. Some redundancy does not mean less efficient validation. The benefit could be on the software engineering side because the schema is easier to understand and less likely to contain errors.

About the example, it is pretty hard to understand because of the if/then and the required subtlety.

It seems that this 13 properties schema can be rewritten within our subset as a 14 properties schema (assuming that additionalProperties is false by default, and that ... says any type, possibly with some predef or reference):

{
  "oneOf": [
    {
      "type": "object",
      "properties": {
        "foo": { "const": "bar" },
        "bar": { ... }
      },
      "required": [ "foo", "bar" ]
    },
    {
      "type": "object",
      "properties": {
        "foo": { "type": "string" },
        "bla": { ... }
      }
    }
  ]
}

From our point of view this version is better because:

it is quite straightforward to understand, it is one type or the other.
its size is quite similar to the initial version.
it removes 2 concepts (if/then) but adds one (oneOf), so one less concept for the same result.
the any concept is present in both, either as {} or as { ... }, but we see the mandatory explicit version as a software engineering benefit, as discussed below.
the validation can be quite efficient, depending on optimizations in the validator.

Note that if you add two additionalProperties and ignore the any stuff, this is also plain JSON Schema 101, even if it does not adhere to founding principles.

In the real world, probably bar would have an actual type, and probably foo would really be a discriminator, i.e. there would be other constant values which would help distinguish expected properties inside the object.

Notice that the then schema includes only the required keyword. That schema may not make sense by itself, but in this case, all it needs to do and all it should be doing is contributing an assertion.

Sure. Users would probably survive if they had to declare type explicitly in the if/then (one of our rules).

Notice the "bar" schema is empty. The "bar" property is allowed to be any value. This is rare in real life, but it happens occasionally.

This is a interesting aspect. Any actual use case example where the data must not be typed would be very interesting to investigate (i.e. not foo/bar, but some concrete example where an untyped element is really what is wanted).

From a language design point of view, optimizing a design around a rare case in real life is not rational: the logical approach is to optimize the common case and to fit somehow the rare case. Here the rare case is the default (nothing to write!) and the common case is harder.

Notice that the if schema is open (allows additional properties) and if the default was for schemas to be closed, this schema would not work and I would have to use a keyword whose purpose is to "remove" that implied and unwanted assertion.

Yep, the internal logic is consistent, but it does not mean that the other approach cannot be made consistent as well.
Again, the user would probably survive if he had to write additionalProperties in the if because it is not the
expected default value there, especially if it helps other use cases?

Allowing this is consistent with the foundational concepts and would be awkward and inconsistent if it was required to include a type keyword in a way that effectively contributes no assertions. The true schema could have been used in place of {}, but I don't think that changes the point that it conceptually an empty schema should make sense and be allowed.

Well, we think that requiring an explicit type would be less error prone because it would flag many errors for the user.

Also, comparing awkwardness is highly subjective: We find the if construct and the underlying logic very awkward even if there is an internal consistency.

I know it seems odd for additional properties to be ignored, but I wanted to point out that an important reason for that to be allowed us to support polymorphism.

Yes, to support polymorphism in a substractive system? If the additional properties default was inverted, it seems that it could still work by adding the keyword? See below.

Let's say I have a Schema that represents a vehicle and another that extends that schema to represent a car. [...] Let's say my application has an isHumanPowered function that takes a vehicle. This function validates its input against the vehicle schema to ensure it's something it can use. If the vehicle schema was closed, it could not accept a car object. [...]

This is an example of a good reason to use an open schema,

Yes and no: If the schema is close by default, you could simply open it? So the help is marginal?

You can achieve more or less the same thing with close by default schema. Our proposal suggests an additional merge operator to combine properties, which is a very common use case, so in our opinion would deserve a keyword. That could give:

{
  "$defs": {
    "vehicule": {
      "type": "object",
      "properties": {
        "isHumanPowered": { "type": "boolean" },
        ...
      },
      "required": [ "isHumanPowered", ... ],
      "additionalProperties": true  // say you want it open
    },
    "car": {
      "mergeObjects": [
        { "$ref": "/$defs/vehicule" },
        {
          "type": "object",
          "properties": {
            "make": { "type": "string" },
            "model": { "type": "string" }
          },
          "required": [ "make", "model" ],
          "additionalProperties": true  // or possibly false
        }
      ]
    }
  },
  ...
}

The semantics detail of such a merge is quite specific, but it does work. It must be thought as a preprocessed property, which would be expanded into a standard object when the model is loaded, with precise rules to merge each object-specific properties, and the result is interpreted after the expansion.

Note that a merge combinator may make sense in a substractive (open by default) system as well.

but your argument is well taken that this isn't always what people want and it's easy to forget to use additionalProperties when you do need it. I think it's valid to argue that closed-by-default is safer than open-by-default. [...]

Sure!

However, since applying a no-additional-properties assertion by default is against the foundation principles, this is a problem I would address with a linter.

We already gave our thoughts about the linter approach: too much effort, will probably not be used as it should.

I'd probably have a rule that requires the user to use additionalProperties (or unevaluatedProperties) but doesn't require a specific value for that keyword, which would force the user to be explicit and think about what the choice their making.

Indeed, being faithful to foundational principles is not a relevant motivation for academics, and I'm not sure it should be taken by an engineer either: keep it if it works well, ditch it if it doesn't?

The third choice is that JSON Schema meta schema is loose.

The paper correctly identifies that the reason the meta-schema is loose is to allow for extensions. We agree 100% that this causes more problems than it needs to support an extension feature. We have a plan to address this in the next release. [...] Consider your suggestion to close the meta-schema accepted and a fix is in progress.

Ok. Just for the record, we have some reservations about the vocabulary approach.

Summary

We understand that your point of view is that:

JSON Schema is preliminary a data validation language even if it is preliminary used as a data description language (aka type declarations).
these validation features, in particular the key design decisions around keyword (relative) independence, substractive system, open defaults and so on, are internally consistent, and are not really amendable, as they are foundational principles.
you do not consider that a (more or less) compatible subset of JSON Schema which does not adhere to these principles is a constructive proposal.
you seem to concede that some of the choices can be error prone.

You have presented arguments for some of the features and principles, and we think that we have responded about each. Basically, both additive and substractive approaches can work, i.e. provide a language to declare schemas with a certain level of expressiveness. Based on our evidence, we still think that the current approach is intrinsically error-prone.

The foundational principles prerequisite seems pretty much enough to close the discussion and reject most of our suggestions and the corresponding blog entry as incompatible, because they would change the philosophy, even if the syntax is kept.

From an academic perspective, we do not see any strong argument to consider that these principles are good and should be preferred to alternatives: the only argument seems to be that they are already there and that you want to keep them.

We think that it is quite hard to improve the language without amending some of these principles.

Finally, we do not think that it invalidates our paper data, logic or conclusions.

jdesrosiers Oct 13, 2023
Maintainer

I entered this discussion with the following words.

I promise to engage with you from a place of respect for you, your experience, and the work you've put into your research. I ask that you do the same for me. The goal should be learning from each other and understanding each other's position, not proving each other wrong. If we get to the point where I feel like we can no longer communicate from a place of mutual respect, I'll disengage from the conversation.

Unfortunately, it seems that you haven't been able to move away from taking a combative stance in this discussion, so I've decided not to continue. I started with foundational concepts so that we can set a baseline for discussion. I wanted to make clear what JSON Schema is designed for, show an example of it's intended use, and make clear what kinds of changes we can and can't make. More importantly, I wanted set the stage for an open and honest discussion by acknowledging some of the shortcomings of the way JSON Schema is designed while also showing some of it's benefits.

we do not think that it invalidates our paper data, logic or conclusions.

Rather than following my lead, you took every sentence as an attack on your paper. This thread wasn't about your paper. This thread was about setting a baseline for a productive discussion. I was planning to discuss my issues with the paper once we've established that we can continue from place of respect for each other's work. I think it's unfortunate that that didn't happen. I still think we could have learned from each other.

We understand (and already understood) this internal logic, but it does not follow that doing so is a good idea, and doing otherwise is a bad idea.

I wasn't making any such argument. On the contrary, I was very clear that there are both good and bad properties that come from this approach. This is an example of you seeing and creating conflict where there is none which is the main reason I've chosen not to continue this discussion.

So our view about JSON Schema is that it is: from a CS perspective, mostly a type declaration language with a few validation features.

There's no way we can have a productive conversation if you refuse to accept that the thing that we maintain isn't what we say it is. Just because it's possible to view a subset of our keywords from a data definition perspective doesn't mean that's actually what they are or what they are designed for or even used for. Your data can't support the claim that 99.5% of schemas are used for data definition. A schema written from the perspective of validation can look identical to a schema written from the perspective of data definition. A large part of your dataset comes from schemastore.org. The primary use of schemastore.org is to provide IDEs and editors with schemas to validate configuration files. Although we can't know the minds of the authors, I think it's reasonable to assume that most of those schemas are written for the purpose of validation.

Indeed, being faithful to foundational principles is not a relevant motivation for academics, and I'm not sure it should be taken by an engineer either: keep it if it works well, ditch it if it doesn't?

This type of disrespectful comment is another reason I've chosen not to continue this discussion. If you change the fundamental nature of a thing, it isn't the same thing, it's something different. You could define a language based on a subset of JSON Schema keywords. It could be visually indistinguishable from JSON Schema, but if it's workings are fundamentally different, it wouldn't be JSON Schema. You're not the first to propose a strict-by-default, data-definition, JSON Schema-like language. Several exist. Maybe the approach JSON Schema takes will turn out to be a dead end and something else will turn out to be better, but if it that happens, changing JSON Schema to be something else wouldn't make sense. Something new would have to replace it.

It seems that this 13 properties schema can be rewritten within our subset as a 14 properties schema (assuming that additionalProperties is false by default, and that ... says any type, possibly with some predef or reference):

Your schema isn't an alternative for the example I gave. If you put it into a validator you'll see that the intended conditionally required property assertion doesn't work. There is a way to express the conditionally required property assertion without if/then, but otherwise it can't be done without not. I was planning to go into detail in different thread about if/then, its alternatives, and how community demand let to its introduction. Instead, I'll just encourage you to take another look at your schema and try to look at my schema from a different perspective.

Our proposal suggests an additional merge operator to combine properties

A merge keyword very similar to what you describe (if not exactly) was proposed as one of the alternatives that were considered to better support extension of closed schemas. Ultimately, unevaluatedProperties was introduced instead. You can find the discussions somewhere in the GitHub Issues history if you're interested in seeing that discussion and why merge was rejected.

zx80 Oct 15, 2023

[...] I've decided not to continue.

There wasn't any disrespect intended anywhere in our answer, but debate, some disagreement, and distinct perspectives. I'm sorry that you feel that you cannot discuss with us. Care and time was spent to smoothen the answer, although provably not enough.

we do not think that it invalidates our paper data, logic or conclusions.

Rather than following my lead, you took every sentence as an attack on your paper. [...]

Not really, but it is clear that it is how you perceived our answer. The quoted sentence is slightly out of the two previous paragraph context. We did understand and acknowledge the principles you presented and offered our own perspective on each point, because we thought this is normal debating.

Then we reflected that the changes we propose mostly do not conform to these principles, thus can be rejected directly with this filter without further discussion. As it is potentially conclusive, we thought about what changes compatible with the principles could help improve the language (well, this is very unclear) and about the impact on our paper, this later point matters to us because invalid papers should be retracted (in theory), so this question is legitimate.

So our view about JSON Schema is that it is: from a CS perspective, mostly a type declaration language with a few validation features.

There's no way we can have a productive conversation if you refuse to accept the thing that we maintain isn't what we say it is. [...]

I do not think that we claimed that JSON Schema is not designed or cannot be used as a validation language. However, we also gave our perspective about validation features and how much simple typing and advanced validating there is in the language and the corpus we studied. The good thing about perspectives is that there can be different ones about the same facts.

Note that typing is a kind of data validation. Computer science sees a lot of things as typing because of the large body of knowledge built over the past 120 years, so it is very hard to root out that way of thinking for academics.

Your data can't support the claim that 99.5% of schemas are used for data definition. A schema written from the perspective of validation can look identical to a schema written from the perspective of data definition.

Sure. It seems that there is a misunderstanding: We have no idea whether schema designers thought they were doing validation or mere typing, and this is not what we are trying to achieve. We looked at the features they actually used in their schemas, at which values are accepted by these, and concluded what they actually do with JSON Schema is mostly to describe types, which is one (simple) kind of validation, although not the full fledged one.

A large part of your dataset comes from schemastore.org.

The Store source is 653/57,803, slightly over 1%, thus a small part.

[...] Although we can't know the minds of the authors, I think it's reasonable to assume that most of those schemas are written for the purpose of validation.

Sure, very probably. This is not contradictory to our point that these schemas are very often reducible to typing.

It seems that this 13 properties schema can be rewritten within our subset as a 14 properties schema [...]

Your schema isn't an alternative for the example I gave. [...]

Indeed, this is very interesting! If-removal is touchy (hence the seems), and here it did not work because the first schema is a subset of the second. It could work this way if there was a discriminator, but here it needs an exclusion.

it can't be done without not. [...]

The exclusion can be achieved with not (obviously, although it is not in our most restrictive subset), but also pattern (with negative look ahead) and oneOf (as {"not": A} is equivalent to {"oneOf": [true, A]}), which gives the following alternative, unless there is another bug:

{
  "oneOf": [
    {
      "type": "object",
      "properties": {
        "foo": { "const": "bla" },
        "bla": true
      },
      "required": [ "foo", "bla" ],
      "additionalProperties": false
    },
    {
      "type": "object",
      "properties": {
        "foo": { "oneOf": [ { "type": "string" }, { "const": "bla" } ] },
        "bla": true
      },
      "additionalProperties": false
    }
  ]
}

It is 12 keywords and 4 properties long, adding 2 keywords to the previous failed version. It seems equivalent to yours.

zx80 · 2023-10-22T08:57:09Z

zx80
Oct 22, 2023

@benjagm: It seems that the discussion about the paper contents has stalled: Too bad, shame on us.

We have updated the blog entry to highlight the disagreements as we understood them, add context and caveats, so as to possibly make it acceptable… Nevertheless, the blog entry is not about supporting JSON Schema as it is, thus is somehow controversial.

Please consider deciding to: (1) accept the blog entry with its controversies (2) ask for more changes… (3) ditch the entry because you do not want dissenting opinions shown in the community web site.

0 replies

benjagm · 2023-10-24T10:05:24Z

benjagm
Oct 24, 2023
Maintainer Author

Hi @zx80. As you know, we had a lengthy discussion showing everyone's efforts to learn from each other, so first of all we'd like to thank everyone for the effort in engaging this conversation.

After long consideration, in the last OCWM, we have decided not to publish the blog and encourage you to use other channels of the JSON Schema Community, like GitHub Discussions to continue the exchange of ideas, or the #community-announcements Slack channel in JSON Schema Slack Workspace if you like to share your work with the JSON Schema Community or ask for feedback.

The JSON Schema's Blog main goal is to promote the JSON Schema adoption, and this is why the content needs to fulfill specific criteria as per the blog guidelines, and as said before, this was not the case.

The JSON Schema Community is proud to be a diverse and safe space, and we encourage everyone to respectfully share their opinions, having in mind the JSON Schema Code of Conduct and here we celebrate diversity. We are just providing the appropriate channel for this type of content/discussion for this case.

We are mindful that this can be frustrating at this point, but trust us, this was a difficult decision, and we took a lot of care in every step of the process. We hope you stay in the Community so we can continue learning from each other.

2 replies

zx80 Oct 26, 2023

Hello @benjagm,

After long consideration, in the last #499, we have decided not to publish the blog [...]

Ok!

The JSON Schema's Blog main goal is to promote the JSON Schema adoption, and this is why the content needs to fulfill specific criteria as per the blog guidelines, and as said before, this was not the case.

Sure. It seems that these guidelines have been devised after the post was submitted, thus are applied retroactively.

We accept content from JSON Schema maintainers and contributors, implementation maintainers, and JSON Schema Champions.

Ok. The fact that only some people are allowed opinions in this community is unusual from an academic perspective.
We could have been contributors if our proposals had been considered. We are also experimental tool developers.

Anyway, we take good notice that the blog is exclusively about promoting JSON Schema by insiders, thus is not the right place to propose changes and provide opinions for others.

Thanks for all the discussions which have greatly improved our understanding of how the JSON Schema community works.

benjagm Oct 26, 2023
Maintainer Author

I am sorry that my message has been perceived that way.

only some people are allowed opinions in this community is unusual from an academic perspective

We already invited you to share your work in other channels of the JSON Schema Community, therefore is not fair to make that asseveration.

We are proud to collaborate with other researchers in the past, but in the past the researching teams approached us during their research and before the publication of their papers, what created an ideal collaboration context. However, this wasn't the case.

We could have been contributors

We need contributors, but not at any price. My personal vision of Open Source is that the first step to become a contributor should be engage with the Community and that happens on Slack and Github, the channels we are encouraging you to use. Maybe in the academic domain the contribution path starts differently, and maybe this is causing this friction.

I am going to use my last sentence to once again invite you to share your work and line of thinking with the JSON Schema Community starting a discussion in Github and posting in the community-annoucements channel in Slack and all the JSON Schema Community will get it. You voice will be heard, and hopefully the Community will engage.

Discuss a recent blog proposal (PR 40) #483

benjagm Sep 13, 2023 Maintainer

How to start?

Who can participate?

Replies: 7 comments · 12 replies

jdesrosiers Sep 20, 2023 Maintainer

jdesrosiers Sep 26, 2023 Maintainer

ssilverman Oct 3, 2023 Collaborator

benjagm Sep 26, 2023 Maintainer Author

gregsdennis Oct 2, 2023 Maintainer

Summary

jdesrosiers Oct 4, 2023 Maintainer

Summary

jdesrosiers Oct 13, 2023 Maintainer

benjagm Oct 24, 2023 Maintainer Author

benjagm Oct 26, 2023 Maintainer Author

benjagm
Sep 13, 2023
Maintainer

Replies: 7 comments 12 replies

jdesrosiers Sep 20, 2023
Maintainer

jdesrosiers Sep 26, 2023
Maintainer

ssilverman Oct 3, 2023
Collaborator

benjagm
Sep 26, 2023
Maintainer Author

gregsdennis Oct 2, 2023
Maintainer

jdesrosiers
Oct 4, 2023
Maintainer

jdesrosiers Oct 13, 2023
Maintainer

benjagm
Oct 24, 2023
Maintainer Author

benjagm Oct 26, 2023
Maintainer Author