Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be more spec conformant #64

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

hasufell
Copy link
Collaborator

@hasufell hasufell commented Dec 31, 2023

We untagle the relativeRef and uriParser here. They are weirdly intertwined, although they have different BNFs.

We also specify the 'hierPartParser' and the 'rrPathParser' (now named 'relativePartParser') more closely according to the spec, which also fixes #63.

Further, we don't run urlDecodeQuery over the path components, just the query components. "tel:+1-816-555-1212" was parsed correctly out of sheer luck, because the 'rrPathParser' didn't run 'urlDecodeQuery' over the first segment, only the subsequent ones. We now use urlDecode' instead for path components.


The second commit makes the pchar or pct-encoded more spec conformant: we now decode consistently during parsing and don't need a two-pass strategy. For that we have to be careful to not use any BS.break etc. after parsing, which is the case now.

I also removed the + to conversion completely, because it has nothing to do with the RFC3986 spec. An end user can post-process the query parameters accordingly, if they want. These are breaking changes. The first commit is not.


It is quite possible that these changes impact performance negatively, but imo, correctness here is much more important than speed.

We untagle the relativeRef and uriParser here. They
are weirdly intertwined, although they have different BNFs.

We also specify the 'hierPartParser' and the 'rrPathParser'
(now named 'relativePartParser') more closely according to the
spec, which also fixes Soostone#63.

Further, we don't run urlDecodeQuery over the path components,
just the query components. "tel:+1-816-555-1212" was parsed
correctly out of sheer luck, because the 'rrPathParser'
didn't run 'urlDecodeQuery' over the first segment, only the
subsequent ones. We now use urlDecode' instead for path
components.
@hasufell hasufell mentioned this pull request Dec 31, 2023
@hasufell hasufell force-pushed the issue-63 branch 2 times, most recently from b438cec to ad0488d Compare January 1, 2024 13:14
This does adhere to the spec, because we do one-pass parsing
and no manual splitting. This is now also much stricter
and invalid percent encoding (e.g. one octet only) will
not parse.
Allow to parse a valid URI even if there's junk at the end.
@@ -70,8 +70,6 @@ module URI.ByteString
normalizeURIRef',

-- * Low level utility functions
urlDecode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if we really need to remove these from the public API. The PR description mentions not calling them internally anymore but I'm missing why these aren't provided to the end user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hasufell just bringing this comment to your attention since I posed it before you requested review.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also removed the + to conversion completely, because it has nothing to do with the RFC3986 spec. An end user can post-process the query parameters accordingly, if they want. These are breaking changes. The first commit is not.

It is removed, because we do the "percent encoded" decoding elsewhere and the special treatment of + is not spec-conformant and has been removed.

@MichaelXavier
Copy link
Contributor

@hasufell thank you for your continued work on this. Could you catch me up on the outlook of this MR? Like is it converging on a "complete" state or does it still have a ways to go?

@hasufell
Copy link
Collaborator Author

@MichaelXavier it's been complete and used in GHCup for a while.

The most recent commit fc2bd4a was an experiment, because it turned out using the attoparsec parser was impossible due to the excessive use of endOfInput. I removed those uses, but at the same time, it means that the error messages are a little less clear: if we can't parse some query, we just stop and have a partial result.

That's how a proper parser should work. The high-level functions then make sure that we're at the end of the input, but the main parser can be combined freely.

I can revert that part.

The rest, again, has been used in GHCup for a while.

@MichaelXavier
Copy link
Contributor

@hasufell Sounds good. Yeah if you could revert that and then review the comment I left yesterday, I'd like to look at getting this merged. I'm all for making the parser more correct and spec compliant but would like to minimize disruption to the API as this is a pretty low-level library.

@hasufell hasufell requested a review from MichaelXavier January 2, 2025 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Does not accept valid file:c:/foo paths
2 participants