Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement readNatural plus readInt and readWord for 8, 16, 32, 64 bit and native machine bit sizes #438

Merged
merged 1 commit into from
Dec 16, 2021

Conversation

vdukhovni
Copy link
Contributor

@vdukhovni vdukhovni commented Nov 12, 2021

Some applications want to read either unsigned or explicitly 64-bit integers
(e.g. warp). Provide the missing interfaces. Based on a suggestion by @Bodigrim the code has been moved to a single module enabling simpler maintenance and reduced duplication. A very helpful nudge...

@sjakobi
Copy link
Member

sjakobi commented Nov 12, 2021

Some applications want to read either unsigned or explicitly 64-bit integers
(e.g. warp).

Could you check whether the warp maintainers are actually inclined to use these additions?

@vdukhovni
Copy link
Contributor Author

vdukhovni commented Nov 12, 2021

Some applications want to read either unsigned or explicitly 64-bit integers
(e.g. warp).

Could you check whether the warp maintainers are actually inclined to use these additions?

I don't think that's necessary. It is IMHO sufficient to observe that they did something really kludgey and didn't even check for overflows. Warp maintainers are unlikely to switch quickly to relying on an as yet unreleased version of bytestring, presumably they'll want support GHC 8.8 and 8.10 for some time still.

While seeing the warp kludge inspired the change, I think the result is cleaner code. The base functionality is in readWord64, which is now cleaner than the original readInt (no need to worry about signs and separate bounds for underflow vs overflow), and the other functions are simple wrappers. It is easy to add readInt32 and word32 if that is deemed desirable (maybe even Word/Int16 and Word/Int8, though I really don't expect demand for those).

So I think the new refactored code is justified on its own merits, but of course opinions may differ, so I'm open to further discussion...

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to accept these additions – it does seem likely that they will be useful.

@Bodigrim What do you think?

I haven't looked at the implementation yet.

@Bodigrim
Copy link
Contributor

Bodigrim commented Nov 12, 2021

I'm on board with these changes, looks like a good idea. Have not looked at implementation details yet.

@vdukhovni
Copy link
Contributor Author

I'm on board with these changes, looks like a good idea. Have not looked at implementation details yet.

Thanks! In terms of implementation, it is basically a small refactor readInt -> readWord64, which actually leads to simplification of not having to deal with signs or sign-specific overflow/underflow detection. So the code is mostly the same, but cleaner.

With that out of the way, the rest of the API is then just wrappers that read a Word64 and if it is in range return the requested data type (Int, Word, Int64), and if not then Nothing. In particular it is easy, and perhaps even warranted (opinions sought) to add readInt32, readWord32, and perhaps even smaller sizes, though I don't see much demand for 16-bit decimals in real life. But perhaps reading IP4 addresses from a ByteString could benefits from a readWord8 used to reach each quad.

Don't know whether there's demand for readHexWord64 et. al. In network protocols we'd typically expect either binary wire forms or decimals (in HTTP, SMTP, ...) Hex serialisation forms are not terribly common...

Comment on lines 1017 to 1019
| w <= wordMaxAsWord64
, let !i = fromIntegral w
, i >= 0 = Just (i, str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| w <= wordMaxAsWord64
, let !i = fromIntegral w
, i >= 0 = Just (i, str)
| w <= intMaxAsWord64 = Just (fromIntegral w, str)

Seems a bit simpler, no?

I'd also suggest to introduce monomorphic helpers word64ToInt = fromIntegral, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to be sure to force fromIntegral w, so that ideally GHC would elide the thunk. Hence !i = ...
Otherwise, you're right, your bounds check is equivalent, so I can drop one conditional. So it would be:

        | w <= intMaxAsWord64 = let !i = fromIntegral w in Just (i, str)

@vdukhovni
Copy link
Contributor Author

I just pushed a fixup commit, that addresses the int bounds check, and tidies up some comments, cosmetic issues. Whoever decides to merge should squash first (I can do that as a force push once all is approved if you like).

cvtneg (Just (w, str))
| w <= wordMaxAsWord64
, let !i = negate $ fromIntegral w
, i <= 0 = Just (i, str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably also use w <= negatedIntMinAsWord64 to save a comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you cleanly express negatedIntMinAsWord64? The naive fromIntegral (minBound :: Int) :: Word64 isn't it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My less naive version is: fromIntegral (fromIntegral (minBound :: Int) :: Word) :: Word64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which, withTypeApplications I guess becomes:

intMinAsWord64  = fromIntegral @Word @Word64 $ fromIntegral @Int @Word minBound 

Do you want to force "neg" into the name, or is it OK to leave it implicit that this is an absolute value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you.

@vdukhovni
Copy link
Contributor Author

Oops, forgot to add the monomorphic helpers. I guess another fixup commit?
Do such monomorphic helpers help performance, or improve type safety? Or just readability?

@Bodigrim
Copy link
Contributor

Readability (it is easier to validate correctness of casts when types are explicit) and type-guided refactoring (it's harder to introduce a mistake after changing some types).

Since we target GHC 8.0+, alternatively you can use {-# LANGUAGE TypeApplications #-} to spell out types of fromIntegral, if you like it better.

@vdukhovni
Copy link
Contributor Author

Readability (it is easier to validate correctness of casts when types are explicit) and type-guided refactoring (it's harder to introduce a mistake after changing some types).

Since we target GHC 8.0+, alternatively you can use {-# LANGUAGE TypeApplications #-} to spell out types of fromIntegral, if you like it better.

OK, we're on the same page then. Just wanted to know I wasn't missing something deeper...

@vdukhovni
Copy link
Contributor Author

I think that's everything noted so far...

@vdukhovni
Copy link
Contributor Author

What we don't have is a 32-bit test environment to really make sure that 32-bit Ints are handled correctly, we presently only know this by code inspection, but perhaps 32-bit systems are no longer a concern? In any case, don't know how to specify that in CI, or find GHC builds for 32-bit systems, ...

@vdukhovni
Copy link
Contributor Author

vdukhovni commented Nov 13, 2021

I guess if we actually add the corresponding readInt32 and readWord32 wrappers, then we'd know. :-)

@Bodigrim
Copy link
Contributor

We do have 32-bit CI, both arm and intel: https://cloud.drone.io/haskell/bytestring/132

@vdukhovni
Copy link
Contributor Author

I didn't know that bytestring has 32-bit CI tests. No worries.

About minBound :: Int as Word64, another way to write it, that uses explicit negation of a Word64 vaue is:

intMinAsWord64 = negate $ fromIntegral @Int @Word64 minBound

Any preference for that over the non-negating previous version?

@Bodigrim
Copy link
Contributor

negate is probably more illuminating, but I do not have a strong preference.

@vdukhovni
Copy link
Contributor Author

negate is probably more illuminating, but I do not have a strong preference.

Perhaps so, but I do like not calling negate on unsigned quantities. I'll do whatever Simon says. :-)

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeApplications is a blessing!

Data/ByteString/Char8.hs Outdated Show resolved Hide resolved
Data/ByteString/Char8.hs Outdated Show resolved Hide resolved
Data/ByteString/Char8.hs Outdated Show resolved Hide resolved
Data/ByteString/Char8.hs Outdated Show resolved Hide resolved
Data/ByteString/Internal.hs Show resolved Hide resolved
tests/Properties/ByteString.hs Outdated Show resolved Hide resolved
Data/ByteString/Lazy/Char8.hs Outdated Show resolved Hide resolved
@vdukhovni
Copy link
Contributor Author

Two more commits (ultimately to be squashed I think). These should address all outstanding issues.

@@ -1,7 +1,12 @@
{-# LANGUAGE AllowAmbiguousTypes #-}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'll hate myself for this suggestion, when I'll need to backport something to bytestring-0.11, but...)

Let's move read{Word,Int}-related functionality into a separate internal module, so that the main one remains unpolluted by new language extensions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, works for me. I don't think it needs to be visible to users, so it should be no problem...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move read{Word,Int}-related functionality into a separate internal module, so that the main one remains unpolluted by new language extensions.

I don't mind how this has played out for this PR, but what's the motivation for not using TypeApplications in D.B.Char8?

@Bodigrim Bodigrim added this to the 0.12.0.0 milestone Nov 13, 2021
@vdukhovni
Copy link
Contributor Author

@Bodigrim's suggestion to factor out the code into a separate module turned out to be a major win. This pretty much eliminated all the code duplication, and made it easier to mimic the more efficient code path from the overflow-checked functions to similarly simplify and rework readInteger also adding readNatural.

I like the result. Sorry it is basically a clean slate now, but actually should be easy to review...

@vdukhovni
Copy link
Contributor Author

Thanks for the prompt review, pushed a fixup.

Data/ByteString/Lazy/ReadInt.hs Outdated Show resolved Hide resolved
Data/ByteString/Lazy/ReadInt.hs Outdated Show resolved Hide resolved
tests/Properties.hs Outdated Show resolved Hide resolved
tests/Properties/ByteString.hs Outdated Show resolved Hide resolved
@vdukhovni
Copy link
Contributor Author

And we comparing both n bits by n bits against n bits by 64 bits.
In one case it is:

  • n of 64 x 64, then n/2 of 128 x 128, then n/4 of 256 * 256, ..., finally one (n * 32) x (n * 32)
    The other case is:
  • 64 x 64, 64 * 128, 64 x 192, 64 x 256, ..., 64 x (n-1) * 64

Either way there are n-1 products, but the size distribution of operations is different.

@Bodigrim
Copy link
Contributor

And how big is eps in 1 + eps?

Any eps > 0 will do.

And we comparing both n bits by n bits against n bits by 64 bits.

There is no difference in native backend, because it takes O(n*m) time to multiply n bits by m bits. But in libgmp backend, which is used by default on the majority of platforms, the former approach is faster.

See https://github.com/Bodigrim/fast-digits/blob/aef0438dba67d49b814857663c24e283ea95c2ed/src/Data/FastDigits.hs#L114-L119. It deals with a reverse scenario, when you split a large number into digits, but asymptotics are the same. It is faster to divide-and-conquer than chip digits one by one. Same for reconstructing a number from digits.

@vdukhovni
Copy link
Contributor Author

So it sounds like I'll be reverting the most recent fixup, but I'm not sure what to say about asymptotic performance benefits of divide-and-conquer in the code commentary. I think it is difficult to be accurate...

@Bodigrim
Copy link
Contributor

You can refer https://gmplib.org/manual/Multiplication-Algorithms in a comment, it lists all subquadratic algorithms. For reasonably-long integers I'd quote Toom-4 estimate, O(n^1.4).

@vdukhovni
Copy link
Contributor Author

Reverted to divide and conquer and addressed comment nits.

@vdukhovni vdukhovni force-pushed the readWord branch 5 times, most recently from 87a5277 to 17aba92 Compare December 11, 2021 19:55
@vdukhovni
Copy link
Contributor Author

This might be it? Any further nits?

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, these are the last nitpicks from my side.

Data/ByteString/Lazy/ReadInt.hs Outdated Show resolved Hide resolved
Data/ByteString/Lazy/ReadInt.hs Outdated Show resolved Hide resolved
Data/ByteString/Lazy/ReadInt.hs Outdated Show resolved Hide resolved
@vdukhovni vdukhovni force-pushed the readWord branch 2 times, most recently from a2a5813 to fa2a08f Compare December 13, 2021 21:07
Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Overall count of changes in Data is 568 insertions, 313 deletions, and the delta of 250 lines is mostly due to comments. And we provide a lot of new safe and fast (on par with bytestring-lexing) routines.

@vdukhovni
Copy link
Contributor Author

  • Dropped leading indents from CPP guards that are no longer nested.
  • Negate full-width words to avoid an extra instruction:
fromIntegral @Int64 @a $ negate $ fromIntegral @Word64 @Int64 acc

rather than:

negate $ fromIntegral @Word64 @a acc

the latter looks simpler but involves more underlying conversions to actually perform the negation (NCG backend, GHC 9.2).

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments regarding the library.

In the future I hope that we can avoid huge PRs like this one. I'd much prefer to review a series of smallish PRs that incrementally builds the same functionality.

Data/ByteString/Lazy/ReadInt.hs Outdated Show resolved Hide resolved
Data/ByteString/Lazy/ReadInt.hs Outdated Show resolved Hide resolved
Data/ByteString/Lazy/ReadNat.hs Outdated Show resolved Hide resolved
Data/ByteString/Lazy/ReadNat.hs Outdated Show resolved Hide resolved
@vdukhovni
Copy link
Contributor Author

A few more comments regarding the library.

In the future I hope that we can avoid huge PRs like this one. I'd much prefer to review a series of smallish PRs that incrementally builds the same functionality.

I am surprised to see this called a large PR... It is just 100 lines of code or so in each of two modules. It is hard to see how this would happen incrementally, it is a rewrite that makes readInt polymorphic over all the fixed sizes and does more of the foreign memory ops in a single loop rather than back and forth via the various combinators.

That said, I don't have more such things in the pipeline, unless you'd welcome a follow for analogous functions for hex, perhaps even octal, ... (full range of functions from bytestring-lexing, but without uncaught overflows).

Some applications want to read either unsigned or explicitly 64-bit integers
(e.g. warp).  Provide all the missing overflow-checked interfaces.

    * readInt8, readInt16, readInt32, readInt64
    * readWord, readWord8, readWord16, readWord32, readWord64
    * readNatural

Cleaned up the code and improved tests.  Uses Word as the accumular
for all types other than Int64 and Word64, which use Word64.  When
words are 64 bit uses base 10^19 rather than 10^9 when assembling
Natural and Integer values.
@sjakobi
Copy link
Member

sjakobi commented Dec 16, 2021

I am surprised to see this called a large PR... It is just 100 lines of code or so in each of two modules. It is hard to see how this would happen incrementally, it is a rewrite that makes readInt polymorphic over all the fixed sizes and does more of the foreign memory ops in a single loop rather than back and forth via the various combinators.

That said, I don't have more such things in the pipeline, unless you'd welcome a follow for analogous functions for hex, perhaps even octal, ... (full range of functions from bytestring-lexing, but without uncaught overflows).

It's not just the number of lines changed. It's also the huge number of review comments and corresponding changes. I think the repeated squashing of commits also made it very hard to review this PR incrementally. IIRC the scope of the PR was also increased by adding readNatural.

This isn't meant to criticize you, @vdukhovni. I wish I had realized my problem earlier and communicated it.

In the future I'll try to request limiting the size of similar PRs at an early stage. For example, it would be possible to add naive implementations combined with tests and benchmarks in a first PR and do optimizations in a second one.

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +132 to +141
prop_readIntBoundsCC = rdWordBounds @Word
&& rdWordBounds @Word8
&& rdWordBounds @Word16
&& rdWordBounds @Word32
&& rdWordBounds @Word64
&& rdIntBounds @Int
&& rdIntBounds @Int8
&& rdIntBounds @Int16
&& rdIntBounds @Int32
&& rdIntBounds @Int64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that combining properties with && will make them a bit hard to debug if they do end up failing. Combining them with (.&&.) should make that easier, I think.

At this stage, it's probably not worth changing though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know about ".&&.", improving the tests without touching the main code may be reasonable? Your call...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to get this PR finished as soon as possible. Feel free to send more improvements in follow-up PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to get this PR finished as soon as possible. Feel free to send more improvements in follow-up PRs.

Makes sense. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good use case for conjoin.

@Bodigrim Bodigrim merged commit b701111 into haskell:master Dec 16, 2021
@Bodigrim
Copy link
Contributor

Thanks!

@Bodigrim
Copy link
Contributor

I think the repeated squashing of commits also made it very hard to review this PR incrementally.

+1, please do not squash any sizeable branches during review process. We'll squash when merging. (I remember that I fat-fingered rebase instead of squash in #309, sorry for that.)

For example, it would be possible to add naive implementations combined with tests and benchmarks in a first PR and do optimizations in a second one.

+1, to avoid possible disappointment, I'd prefer a simple PR to agree on API first and elaborate implementation in subsequent PRs.

I personally do not have use cases for octal / hexadecimal parsing, but I assume their implementation could be much simpler than for decimals?.. Providing it out of the box would be nice. @sjakobi what do you think?

@sjakobi
Copy link
Member

sjakobi commented Dec 18, 2021

I personally do not have use cases for octal / hexadecimal parsing, but I assume their implementation could be much simpler than for decimals?.. Providing it out of the box would be nice. @sjakobi what do you think?

For reading hexadecimal numbers, I'm aware of a use case in http-client. Readers for which types did you intend to provide here, @vdukhovni?

What's the use case for reading octal numbers? If we aren't aware of one, I think it would be better to wait until one comes up – API size does affect maintainability after all.

Also, let's please move this discussion to a proper issue. A closed PR is a bad place for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants