Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Email format not properly validated #403

Closed
staab opened this issue May 8, 2018 · 11 comments
Closed

Email format not properly validated #403

staab opened this issue May 8, 2018 · 11 comments

Comments

@staab
Copy link
Contributor

staab commented May 8, 2018

from jsonschema import Draft4Validator, FormatChecker

validator = Draft4Validator({'type': 'string', 'format': 'email'}, format_checker=FormatChecker())
validator.validate('[email protected] ')

The above code doesn't raise an error, but should, at least as far as I understand RFC 5322.


Upon further investigation, your email checking function is terrible. I've been trusting this library to handle the hard parts of validation for me (particularly emails!), but even a regex would have been a better way to validate emails. Just for instance, this passes validation:

validator.validate('#@%^%#$@#$@#')

I'd really like to see the email validator pass this suite of tests.

It's fair that you haven't implemented the validation, since the jsonschema spec doesn't require format validation, but partway implementing it, and not warning users (either through raising errors on schemas relying on incompletely-implemented formats, or through documentation), that your format checker doesn't actually validate against the RFC, is pretty bad. My trust in this library has been shaken, and I'll be auditing my code to add custom validation logic where necessary, or switching away entirely.

@staab staab changed the title {'format': 'email'} not properly validated Email format not properly validated May 8, 2018
@Julian
Copy link
Member

Julian commented May 8, 2018

The email validator is intentionally unimplemented, because validating email addresses isn't really a reasonable concept. It's the one and only way that jsonschema intentionally deviates from the specification, its done to prevent downstream users from footgunning themselves with a concept that I wish didn't exist in the world, and it was my intention to voice upstream that I think we should remove the email validator. I could have sworn that was noted in the docs, but it's certainly noted in previous issue tickets about this. Patches are certainly welcome.

That being said, your attitude and commentary isn't welcome. Use the library if it helps you, don't if it doesn't, I don't really care about your "trust" in the library.

@staab
Copy link
Contributor Author

staab commented May 8, 2018

I'm sorry, I thought I did a thorough job of checking previous issues and the docs, but I missed those caveats.

My intention here wasn't to complain, I've enjoyed and stuck with this library for about 4 years now, so thanks for your work on it. I am concerned, though, that I spent all that time not validating email addresses, and I imagine your other thousands of users might be in the same place, particularly given the number of issues similar to this one. This did cause a (small) problem in production for us, so if you want to prevent downstream users from footgunning themselves, maybe do a logger.warn in the email or schema validator?

Could you link to the discussion about potentially removing the validator, just out of curiosity?

@staab
Copy link
Contributor Author

staab commented May 8, 2018

After looking through the validation section of the docs again and not seeing anything calling out email validation in particular, I went ahead and created #404.

It still seems worth it to me to remove or warn about the email validator, as well as validators that require dependencies or formats that aren't implemented, even though the library is following jsonschema's specification to let that stuff pass. You'd at least get fewer people creating duplicate issues like this one.

@Julian
Copy link
Member

Julian commented May 11, 2018

@staab how do you feel here after that clarification (thanks again!)

The issue with a warning (beyond the fact that the stdlib crippled its own warning system, which is somewhat fixable) is that it's not really possible to distinguish between someone who's accidentally forgot to install a dependency from someone who actually doesn't mean to use format for validation.

If you still think there's more here though happy to discuss.

@staab
Copy link
Contributor Author

staab commented May 11, 2018

I'm beginning to understand the stickiness of the issue, for sure. I do still feel pretty unsatisfied though, for the following reasons:

  • It seems like it's the fault of the jsonschema spec itself rather than this library, but "failing fast" is generally considered best-practice in software development. The default of returning True for un-implemented formats in general seems problematic for the same reason php's decision to log errors and keep going, even if they're likely to cause a bug, is a bad decision. There's also the principle of being tolerant in the requests you receive, and strict in the responses you give back, but I think jsonschema is doing the opposite given it's role as a validation library (the errors are the api, so raising them as often as possible is self-strictness, not user-strictness). I realize the ship has probably already sailed, since this would be a major backwards incompatibility, but my preference for things like this is to get an exception if the function (read: format) I'm calling isn't defined.
  • Partial implementation, in my opinion, is even worse. If I ask a question, like "is this a valid email", and get a "yes" back, I assume it's answering the question I was asking. In the case of email validation, you're right that it's actually impossible to return a real "yes", which I think makes the naive validation justified, if not totally safe.
  • Docs aren't a real solution to the problem (although they may be the best option in a situation like this), because people generally only read the parts they think they need. Some projects have a Gotchas section; maybe that would attract the proper amount of attention.

As far as warnings, I might suggest requiring the user to configure jsonschema (globally, or even better, as an argument to FormatChecker or DraftXValidator) to opt-in to various dangerous features. If they haven't opted in, log a warning when validating a schema. This would be safe for backwards compatibility (I'd even go so far as to suggest a deprecation warning and upgrade to an exception in the future), and would be more effective than documentation, because no one likes warnings clogging up their logs.

Hope that helps, and I'd be happy to help implement some of this stuff if you decide on a way forward.

@handrews
Copy link

@staab

It seems like it's the fault of the jsonschema spec itself rather than this library, but "failing fast" is generally considered best-practice in software development.

We are working on improving this situation, and others regarding extensibility and modularity (specifically being able to know whether an implementation handles your extensions and failing appropriately if not) See json-schema-org/json-schema-spec#561 (basics) and json-schema-org/json-schema-spec#563 (formats). This approach would retain the default permissive behavior, while providing both schema authors and schema implementors control over what they require / accept.

I would advise against sinking too much time into implementation-specific band-aids on this as we do plan to address it in the next draft. Better to spend that effort adding support for draft-08 as soon as it's out.

@staab
Copy link
Contributor Author

staab commented May 11, 2018

Neat! That sounds like the ideal solution to me.

@Julian
Copy link
Member

Julian commented May 11, 2018

(I'm speaking only about existing drafts, though thankfully @handrews is helpfully chiming in)

It seems like it's the fault of the jsonschema spec itself rather than this library, but "failing fast" is generally considered best-practice in software development.

I think in this comment (and others) you're taking an extreme view on what format is -- one that is inconsistent with the spec by very nature of its defined semantics. You're viewing it as if format has equal standing to other validators, and that therefore someone who puts it in a schema has equal expectations that it needs to function and function 100%, and that if that for some reason wasn't feasible, that the same way that if the properties validator for some reason couldn't work, loud glaring error bells should go off.

This just isn't the case -- the format validator is "second class" -- it's a validator that to some is just informatory. For myself even, that's often been the case, where I use format: email, and my intention is "FYI this is an email". No validation whatsoever is done or needed there, it's simply for documentation.

The fact that you've taken such an extreme definition means that you have different expectations, so I think you're applying some principles here that just don't apply to format in its current form.

Do some users expect format to fail loudly? Certainly. But whether that's because they (or you) are just misunderstanding that format isn't that kind of validator unless explicitly enabled is really the key here for earlier drafts.

If you do press me on whether I think that's a good idea, no probably not, I'd prefer to probably split format into a few that are well defined and likely widely supported, and "the rest", and then to mandate behavior one way or the other.

Partial implementation, in my opinion, is even worse. If I ask a question, like "is this a valid email", and get a "yes" back, I assume it's answering the question I was asking. In the case of email validation, you're right that it's actually impossible to return a real "yes", which I think makes the naive validation justified, if not totally safe.

I think the end of this paragraph is agreeing with the following, but just to be clear again: my issue with the email validator is that I literally call what is implemented in jsonschema full validation -- I disagree with any notion that checking for just an "@" constitutes "partial" implementation -- it's not conforming with the spec, but I maintain it is a fully formed representation of what someone should be doing for email on the backend (and of course again I intend to raise that upstream though if @handrews is here he's now likely seen it :)

Docs aren't a real solution to the problem

Definitely agree with this one :/

As far as warnings, I might suggest requiring the user to configure jsonschema

So I'm somewhat open to this one, if by this what you mean is "no more default behavior for format at all, if you want no format validation, you also need to pass in a NullFormatChecker that does nothing".

I'll admit though that I'm quite behind on reading up on the things @handrews mentioned, so maybe my opinions will change once I'm caught up.

@staab
Copy link
Contributor Author

staab commented May 11, 2018

I haven't read the spec (I read half of draft 3 a number of years ago), so my suggestions here are entirely based on my own instincts. If the philosophy of jsonschema conflicts, with my own, for sure go with jsonschema's.

It sounds like you've heard me loud and clear, so I'm pretty satisfied here. I think the NullFormatChecker (or at least a BaseFormatChecker with all the well-defined ones) is definitely in the spirit of what I've been saying. Also, thanks for your thoughts on format; I didn't realize it was meant to be quite so lenient. At any rate, I've got my implementation sorted out, so thanks for going over this, and I look forward to seeing how jsonschema continues to evolve.

@handrews
Copy link

@staab part of the recent spec work has involved classifying keywords into assertions (which produce a boolean output) annotations (which attach some data in some way that an application can use), and applicators (which apply subschema(s) and combine or modify their results in some way (e.g. not is an applicator that inverts its subschemas result, properties is an applicator that applies based on matching property names, etc.)

format is really more of an annotation, but since it's purpose is related to validation it's also kind of an optional assertion. But it's not a full-fledged assertion keyword. It's an annotation with a recommended default behavior. Applications can then do further validation, such as sending a test email to the address, whether the validator did any of its own validation or not (I think we can all agree that that would not be a good thing for a validator to do automatically).

The hybrid behavior is a little weird, and I'm hoping that by clarifying that framework it's a little more obvious when the spec is read that the keywords behave a bit differently.

@Julian
Copy link
Member

Julian commented May 12, 2018

So, for a concrete resolution here I'm going to close this one, but @staab I've opened #407 to track the possible change in behavior going forward.

Appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants