-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Be more spec conformant #64
base: master
Are you sure you want to change the base?
Conversation
We untagle the relativeRef and uriParser here. They are weirdly intertwined, although they have different BNFs. We also specify the 'hierPartParser' and the 'rrPathParser' (now named 'relativePartParser') more closely according to the spec, which also fixes Soostone#63. Further, we don't run urlDecodeQuery over the path components, just the query components. "tel:+1-816-555-1212" was parsed correctly out of sheer luck, because the 'rrPathParser' didn't run 'urlDecodeQuery' over the first segment, only the subsequent ones. We now use urlDecode' instead for path components.
b438cec
to
ad0488d
Compare
This does adhere to the spec, because we do one-pass parsing and no manual splitting. This is now also much stricter and invalid percent encoding (e.g. one octet only) will not parse.
Allow to parse a valid URI even if there's junk at the end.
@@ -70,8 +70,6 @@ module URI.ByteString | |||
normalizeURIRef', | |||
|
|||
-- * Low level utility functions | |||
urlDecode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if we really need to remove these from the public API. The PR description mentions not calling them internally anymore but I'm missing why these aren't provided to the end user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hasufell just bringing this comment to your attention since I posed it before you requested review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also removed the + to conversion completely, because it has nothing to do with the RFC3986 spec. An end user can post-process the query parameters accordingly, if they want. These are breaking changes. The first commit is not.
It is removed, because we do the "percent encoded" decoding elsewhere and the special treatment of +
is not spec-conformant and has been removed.
@hasufell thank you for your continued work on this. Could you catch me up on the outlook of this MR? Like is it converging on a "complete" state or does it still have a ways to go? |
@MichaelXavier it's been complete and used in GHCup for a while. The most recent commit fc2bd4a was an experiment, because it turned out using the attoparsec parser was impossible due to the excessive use of That's how a proper parser should work. The high-level functions then make sure that we're at the end of the input, but the main parser can be combined freely. I can revert that part. The rest, again, has been used in GHCup for a while. |
@hasufell Sounds good. Yeah if you could revert that and then review the comment I left yesterday, I'd like to look at getting this merged. I'm all for making the parser more correct and spec compliant but would like to minimize disruption to the API as this is a pretty low-level library. |
This reverts commit fc2bd4a.
We untagle the relativeRef and uriParser here. They are weirdly intertwined, although they have different BNFs.
We also specify the 'hierPartParser' and the 'rrPathParser' (now named 'relativePartParser') more closely according to the spec, which also fixes #63.
Further, we don't run urlDecodeQuery over the path components, just the query components. "tel:+1-816-555-1212" was parsed correctly out of sheer luck, because the 'rrPathParser' didn't run 'urlDecodeQuery' over the first segment, only the subsequent ones. We now use urlDecode' instead for path components.
The second commit makes the
pchar
orpct-encoded
more spec conformant: we now decode consistently during parsing and don't need a two-pass strategy. For that we have to be careful to not use anyBS.break
etc. after parsing, which is the case now.I also removed the
conversion completely, because it has nothing to do with the RFC3986 spec. An end user can post-process the query parameters accordingly, if they want. These are breaking changes. The first commit is not.
+
toIt is quite possible that these changes impact performance negatively, but imo, correctness here is much more important than speed.