Be more spec conformant #64

hasufell · 2023-12-31T12:00:38Z

We untagle the relativeRef and uriParser here. They are weirdly intertwined, although they have different BNFs.

We also specify the 'hierPartParser' and the 'rrPathParser' (now named 'relativePartParser') more closely according to the spec, which also fixes #63.

Further, we don't run urlDecodeQuery over the path components, just the query components. "tel:+1-816-555-1212" was parsed correctly out of sheer luck, because the 'rrPathParser' didn't run 'urlDecodeQuery' over the first segment, only the subsequent ones. We now use urlDecode' instead for path components.

The second commit makes the pchar or pct-encoded more spec conformant: we now decode consistently during parsing and don't need a two-pass strategy. For that we have to be careful to not use any BS.break etc. after parsing, which is the case now.

I also removed the + to conversion completely, because it has nothing to do with the RFC3986 spec. An end user can post-process the query parameters accordingly, if they want. These are breaking changes. The first commit is not.

It is quite possible that these changes impact performance negatively, but imo, correctness here is much more important than speed.

We untagle the relativeRef and uriParser here. They are weirdly intertwined, although they have different BNFs. We also specify the 'hierPartParser' and the 'rrPathParser' (now named 'relativePartParser') more closely according to the spec, which also fixes Soostone#63. Further, we don't run urlDecodeQuery over the path components, just the query components. "tel:+1-816-555-1212" was parsed correctly out of sheer luck, because the 'rrPathParser' didn't run 'urlDecodeQuery' over the first segment, only the subsequent ones. We now use urlDecode' instead for path components.

This does adhere to the spec, because we do one-pass parsing and no manual splitting. This is now also much stricter and invalid percent encoding (e.g. one octet only) will not parse.

Allow to parse a valid URI even if there's junk at the end.

MichaelXavier · 2024-12-30T18:57:08Z

src/URI/ByteString.hs

@@ -70,8 +70,6 @@ module URI.ByteString
    normalizeURIRef',

    -- * Low level utility functions
-    urlDecode,


I'm curious if we really need to remove these from the public API. The PR description mentions not calling them internally anymore but I'm missing why these aren't provided to the end user.

@hasufell just bringing this comment to your attention since I posed it before you requested review.

I also removed the + to conversion completely, because it has nothing to do with the RFC3986 spec. An end user can post-process the query parameters accordingly, if they want. These are breaking changes. The first commit is not.

It is removed, because we do the "percent encoded" decoding elsewhere and the special treatment of + is not spec-conformant and has been removed.

MichaelXavier · 2024-12-30T18:58:55Z

@hasufell thank you for your continued work on this. Could you catch me up on the outlook of this MR? Like is it converging on a "complete" state or does it still have a ways to go?

hasufell · 2024-12-31T07:45:38Z

@MichaelXavier it's been complete and used in GHCup for a while.

The most recent commit fc2bd4a was an experiment, because it turned out using the attoparsec parser was impossible due to the excessive use of endOfInput. I removed those uses, but at the same time, it means that the error messages are a little less clear: if we can't parse some query, we just stop and have a partial result.

That's how a proper parser should work. The high-level functions then make sure that we're at the end of the input, but the main parser can be combined freely.

I can revert that part.

The rest, again, has been used in GHCup for a while.

MichaelXavier · 2024-12-31T19:40:09Z

@hasufell Sounds good. Yeah if you could revert that and then review the comment I left yesterday, I'd like to look at getting this merged. I'm all for making the parser more correct and spec compliant but would like to minimize disruption to the API as this is a pretty low-level library.

This reverts commit fc2bd4a.

hasufell mentioned this pull request Dec 31, 2023

Encoded + sign in URI #55

Open

hasufell force-pushed the issue-63 branch 2 times, most recently from b438cec to ad0488d Compare January 1, 2024 13:14

Decode percent encoded characters properly

4fb5ed1

This does adhere to the spec, because we do one-pass parsing and no manual splitting. This is now also much stricter and invalid percent encoding (e.g. one octet only) will not parse.

hasufell force-pushed the issue-63 branch from ad0488d to 4fb5ed1 Compare January 1, 2024 13:16

Fix doctests

fc987cb

This was referenced Jan 2, 2024

Chokes on URL with + sign haskell/ghcup-hs#408

Closed

file:foo/bar is broken haskell/ghcup-hs#965

Closed

hasufell mentioned this pull request Jan 19, 2024

Improve URI handling haskell/ghcup-hs#978

Closed

hseg mentioned this pull request Sep 11, 2024

Support GHC-9.6 (rebased) haskell/ghcup-hs#1127

Closed

Don't prematurly stop parsing

fc2bd4a

Allow to parse a valid URI even if there's junk at the end.

MichaelXavier reviewed Dec 30, 2024

View reviewed changes

Revert "Don't prematurly stop parsing"

d473630

This reverts commit fc2bd4a.

hasufell requested a review from MichaelXavier January 2, 2025 09:52

MichaelXavier approved these changes Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be more spec conformant #64

Be more spec conformant #64

hasufell commented Dec 31, 2023 •

edited

Loading

MichaelXavier Dec 30, 2024

MichaelXavier Jan 4, 2025

hasufell Jan 9, 2025

MichaelXavier commented Dec 30, 2024

hasufell commented Dec 31, 2024

MichaelXavier commented Dec 31, 2024

Be more spec conformant #64

Are you sure you want to change the base?

Be more spec conformant #64

Conversation

hasufell commented Dec 31, 2023 • edited Loading

MichaelXavier Dec 30, 2024

Choose a reason for hiding this comment

MichaelXavier Jan 4, 2025

Choose a reason for hiding this comment

hasufell Jan 9, 2025

Choose a reason for hiding this comment

MichaelXavier commented Dec 30, 2024

hasufell commented Dec 31, 2024

MichaelXavier commented Dec 31, 2024

hasufell commented Dec 31, 2023 •

edited

Loading