Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builder: unsound reuse of buffers #690

Closed
adamgundry opened this issue Aug 2, 2024 · 3 comments · Fixed by #691
Closed

Builder: unsound reuse of buffers #690

adamgundry opened this issue Aug 2, 2024 · 3 comments · Fixed by #691
Milestone

Comments

@adamgundry
Copy link
Member

Summary

toLazyByteString :: Builder -> LazyByteString has a race condition that may generate wrong results if two threads concurrently evaluate the result. This originally manifested itself in an application using Cassava, which would produce CSV files in which part of the file was overwritten with data from later in the file. The underlying cause appears to be an interaction between unsafeDupablePerformIO and buffer reuse in buildStepToCIOS arising since version 0.11.5.0 (specifically #581).

Reproducer

Since the race condition is non-deterministic, it is not entirely trivial to reproduce, but the following module races two threads parsing the same lazy bytestring (which should get the same result) and typically hits a "bad" case after a few iterations.

Full code
#!/usr/bin/env cabal
{- cabal:
build-depends:
  base, async, bytestring, cassava
ghc-options: -threaded -with-rtsopts=-N
-}
{- project:
with-compiler: ghc-9.8.1
optimization: False
-}
-- It is important we disable optimization for this reproducer, as otherwise the
-- bug goes away (presumably due to inlining, CSE or suchlike).  But the issue
-- has been observed with optimizations enabled as well.
--
-- I've reproduced on bytestring 0.11.5.3 (GHC 9.6.4, cassava 0.5.3.0) and
-- bytestring HEAD (594966bbfa84255adfb644d43212e5e05f73c235) (GHC 9.8.1, cassava 0.5.3.1).

import qualified Control.Concurrent.Async as Async
import qualified Control.Monad            as Monad
import qualified Data.ByteString          as B
import qualified Data.ByteString.Builder  as BB
import           Data.ByteString.Builder  (Builder)
import qualified Data.ByteString.Char8    as BS
import qualified Data.ByteString.Lazy     as LBS
import qualified Data.Csv                 as CSV
import qualified Data.List as List
import Prelude hiding (unlines)

-- Generate some test data.  This is based on Cassava code but minimized to make
-- clear there is no dependency on anything unsafe.
build :: IO LBS.LazyByteString
build = encode <$> mkRecord

-- The magic number 1853 means that show [1..1853] occupies 8159 bytes.
-- Thus we end up with chunks close to maximalCopySize = 8160.
mkRecord :: IO [[BS.ByteString]]
mkRecord = pure $ let xs = [ 1 .. 1853 ] :: [Int]
                  in [[BS.pack (show xs), BS.pack (show xs)]]

encode :: [[BS.ByteString]] -> LBS.LazyByteString
encode = BB.toLazyByteString
       . unlines (BB.word8 10)
       . map (mconcat . intersperse (BB.word8 44) . map BB.byteString . map escape)

unlines :: Builder -> [Builder] -> Builder
unlines _ [] = mempty
unlines sep (b:bs) = b <> sep <> unlines sep bs

intersperse :: Builder -> [Builder] -> [Builder]
intersperse _   []      = []
intersperse sep (x:xs)  = x : prependToAll sep xs

prependToAll :: Builder -> [Builder] -> [Builder]
prependToAll _   []     = []
prependToAll sep (x:xs) = sep <> x : prependToAll sep xs

escape :: BS.ByteString -> BS.ByteString
escape s
         = LBS.toStrict . BB.toLazyByteString $
            BB.word8 dquote
            <> B.foldl
                (\ acc b -> acc <> if b == dquote
                    then BB.byteString (BS.pack "\"\"")
                    else BB.word8 b)
                mempty
                s
            <> BB.word8 dquote
  where
    dquote = 34
    cr     = 13


-- Consume the resulting bytesting by parsing the CSV file.  I haven't managed
-- to minimize the reproducer to avoid the need for this, but I don't believe
-- Cassava is doing anything relevant here, we just need a suitable demand
-- pattern from both threads to trigger the race.

consume :: LBS.LazyByteString -> IO CSV.Csv
consume = either fail pure . CSV.decode CSV.NoHeader

main :: IO ()
main = Monad.replicateM_ 100 $ do
    bs <- build
    -- Forcing the length of the bytestring makes the problem disappear
    -- print (LBS.length bs)
    (r1, r2) <- Async.concurrently (consume bs) (consume bs)
    if r1 == r2
        then putStrLn "Ok!"
        else do fail "Bad!"

Diagnosis

Due to the use of unsafeDupablePerformIO for performance, it is critical that any IO actions executed when running a Builder can be interrupted or executed multiple times. In principle, filling a buffer is safe provided the buffer is used only once and the same bytes are written each time. However, wrapChunk in buildStepToCIOS re-uses a buffer in the trimming case.

More precisely, I believe what is happening is the following:

  • The Builder includes large chunks using insertChunk that hit the trim case of wrapChunk.
  • Two threads are racing to insert the same chunk, sharing a buffer fpbuf.
  • Thread A creates a fresh fpbuf' and starts copying bytes from fpbuf.
  • Thread B also creates a fresh fpbuf', copies it, yields bs and then continues using the same buffer.
  • Thread B starts filling fpbuf with more data from later in the output.
  • Thread A is still copying bytes from fpbuf, so it copies bogus data from later in the output to earlier in (its version of) the output.

It looks like the bug was introduced in #581 which is first present in 0.11.5.0 (see 5c4d236 and 0c030bb).

Solution

An easy fix is for the trim case of wrapChunk to unconditionally allocate a new buffer after trimming, rather than re-using the old buffer. This appears to fix the problem in my reproducer. (I think it is safe to use the same buffer if the chunk is actually empty, though.) I don't know how much impact this will have on performance, but presumably it will slightly reduce it for builders inserting many trimmed chunks.

Replacing the unsafeDupablePerformIO calls with unsafePerformIO should also fix the issue, but is likely to significantly impact performance (4ac4be0 claimed introducing unsafeDupablePerformIO "increases the performance of bytestring chunk insertion by 20%.").

Perhaps users who know that they will not evaluate the resulting bytestring in multiple threads concurrently could somehow indicate this and opt in to more aggressive buffer reuse, e.g. via a custom AllocationStrategy?

I notice that using customStrategy for AllocationStrategy can in principle choose to reuse buffers "if it can guarantee that this referentially transparent" (sic). It's not clear to me what the caller needs to guarantee, but this issue suggests that (amongst other things?) the caller must ensure that the resulting lazy bytestring is evaluated only by a single thread. Could we document this more clearly?

adamgundry added a commit to adamgundry/bytestring that referenced this issue Aug 2, 2024
adamgundry added a commit to adamgundry/bytestring that referenced this issue Aug 8, 2024
`toLazyByteString :: Builder -> LazyByteString` had a race condition that could
generate wrong results if two threads concurrently evaluated the result.  This
bug was introduced in haskell#581 (5c4d236) and first
present in release 0.11.5.0 (as 0c030bb).

Due to the use of `unsafeDupablePerformIO` for performance, it is critical that
any IO actions executed when running a `Builder` can be interrupted or executed
multiple times.  In principle, filling a buffer is safe provided the buffer is
used only once and the same bytes are written each time. However, `wrapChunk` in
`buildStepToCIOS` would re-use a buffer in the trimming case after copying its
contents to produce a new trimmed chunk. This is safe when run in a single
thread, but if two threads simultaneously execute the code, one of them may
still be copying the contents while the other starts overwriting the buffer.

This patch fixes `wrapChunk` to unconditionally allocate a new buffer after
trimming, rather than re-using the old buffer. This will presumably come at a
slight performance cost for builders inserting many trimmed chunks.
@clyring
Copy link
Member

clyring commented Aug 9, 2024

noDuplicate# as in unsafePerformIO is indeed a rather large and expensive hammer, if memory serves. We don't really need its full power since we don't care if the initial "allocate a new buffer" effect gets duplicated: I'm reasonably sure eagerly black-holing the "continue work on the next chunk" thunks would be sufficient, and this only requires a single CAS-write and branch. But I don't think there is currently any way to request eager black-holing of a specific thunk from Haskell code.

Perhaps users who know that they will not evaluate the resulting bytestring in multiple threads concurrently could somehow indicate this and opt in to more aggressive buffer reuse, e.g. via a custom AllocationStrategy?

I notice that using customStrategy for AllocationStrategy can in principle choose to reuse buffers "if it can guarantee that this referentially transparent" (sic). It's not clear to me what the caller needs to guarantee, but this issue suggests that (amongst other things?) the caller must ensure that the resulting lazy bytestring is evaluated only by a single thread. Could we document this more clearly?

In practice, I think the only practical ways for a client to guarantee this are:

  1. ...running the entire Haskell program single-threaded
  2. ...using buildStepToCIOS directly and consuming the chunks in an obviously single-threaded or non-duplicable way

Well, option 2 could have reasonable applications, but I'm not entirely sure about the merits of documenting that this is possible. It seems less dangerous to for such applications to use their own analogue of the AllocationStrategy type, to emphasize that buffer-reuse is generally not compatible with the consumers in the bytestring Builder internal modules.

@adamgundry
Copy link
Member Author

I agree with the above. But I'm wondering what the concrete plan is to address this issue @clyring @Bodigrim? As a soundness bug leading to wrong-results it seems fairly severe. I've proposed a fix in #691 on which I'd appreciate review.

clyring pushed a commit that referenced this issue Sep 18, 2024
`toLazyByteString :: Builder -> LazyByteString` had a race condition that could
generate wrong results if two threads concurrently evaluated the result.  This
bug was introduced in #581 (5c4d236) and first
present in release 0.11.5.0 (as 0c030bb).

Due to the use of `unsafeDupablePerformIO` for performance, it is critical that
any IO actions executed when running a `Builder` can be interrupted or executed
multiple times.  In principle, filling a buffer is safe provided the buffer is
used only once and the same bytes are written each time. However, `wrapChunk` in
`buildStepToCIOS` would re-use a buffer in the trimming case after copying its
contents to produce a new trimmed chunk. This is safe when run in a single
thread, but if two threads simultaneously execute the code, one of them may
still be copying the contents while the other starts overwriting the buffer.

This patch fixes `wrapChunk` to unconditionally allocate a new buffer after
trimming, rather than re-using the old buffer. This will presumably come at a
slight performance cost for builders inserting many trimmed chunks.
@clyring clyring added this to the 0.11.5.4 milestone Sep 18, 2024
adamgundry added a commit to adamgundry/bytestring that referenced this issue Sep 19, 2024
Bodigrim pushed a commit that referenced this issue Sep 19, 2024
* Clarify documentation of 'customStrategy' based on #690

* Remove outdated comments on AllocationStrategy

(These were not visible in the Haddock output anyway.)
Bodigrim pushed a commit that referenced this issue Oct 9, 2024
`toLazyByteString :: Builder -> LazyByteString` had a race condition that could
generate wrong results if two threads concurrently evaluated the result.  This
bug was introduced in #581 (5c4d236) and first
present in release 0.11.5.0 (as 0c030bb).

Due to the use of `unsafeDupablePerformIO` for performance, it is critical that
any IO actions executed when running a `Builder` can be interrupted or executed
multiple times.  In principle, filling a buffer is safe provided the buffer is
used only once and the same bytes are written each time. However, `wrapChunk` in
`buildStepToCIOS` would re-use a buffer in the trimming case after copying its
contents to produce a new trimmed chunk. This is safe when run in a single
thread, but if two threads simultaneously execute the code, one of them may
still be copying the contents while the other starts overwriting the buffer.

This patch fixes `wrapChunk` to unconditionally allocate a new buffer after
trimming, rather than re-using the old buffer. This will presumably come at a
slight performance cost for builders inserting many trimmed chunks.
Bodigrim pushed a commit that referenced this issue Oct 9, 2024
* Clarify documentation of 'customStrategy' based on #690

* Remove outdated comments on AllocationStrategy

(These were not visible in the Haddock output anyway.)
Bodigrim pushed a commit that referenced this issue Oct 15, 2024
`toLazyByteString :: Builder -> LazyByteString` had a race condition that could
generate wrong results if two threads concurrently evaluated the result.  This
bug was introduced in #581 (5c4d236) and first
present in release 0.11.5.0 (as 0c030bb).

Due to the use of `unsafeDupablePerformIO` for performance, it is critical that
any IO actions executed when running a `Builder` can be interrupted or executed
multiple times.  In principle, filling a buffer is safe provided the buffer is
used only once and the same bytes are written each time. However, `wrapChunk` in
`buildStepToCIOS` would re-use a buffer in the trimming case after copying its
contents to produce a new trimmed chunk. This is safe when run in a single
thread, but if two threads simultaneously execute the code, one of them may
still be copying the contents while the other starts overwriting the buffer.

This patch fixes `wrapChunk` to unconditionally allocate a new buffer after
trimming, rather than re-using the old buffer. This will presumably come at a
slight performance cost for builders inserting many trimmed chunks.
Bodigrim pushed a commit that referenced this issue Oct 15, 2024
* Clarify documentation of 'customStrategy' based on #690

* Remove outdated comments on AllocationStrategy

(These were not visible in the Haddock output anyway.)
@Bodigrim Bodigrim modified the milestones: 0.11.5.4, 0.12.2.0 Oct 15, 2024
@Bodigrim
Copy link
Contributor

This is meant to be both 0.11.5.4 and 0.12.2.0, but GitHub allows only one milestone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants