Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boxed mutable vector is buggy in doctests #385

Closed
lehins opened this issue Apr 28, 2021 · 16 comments
Closed

Boxed mutable vector is buggy in doctests #385

lehins opened this issue Apr 28, 2021 · 16 comments

Comments

@lehins
Copy link
Contributor

lehins commented Apr 28, 2021

Since I've added some doctests for grow function we've been experiencing some occasional segfaults on CI, but it was hard to spot where exactly the problem is lurking, since examples themselves seem to be correct and problem can't be replicated outside of doctests. I've experienced very similar problem elsewhere with boxed MutableArray# and SmallMutableArray#, so I had a hunch (#382 (comment)) that it had to do with boxed arrays and today we got this error on CI:

src/Data/Vector/Mutable.hs:368: failure in expression `V.unsafeFreeze mv''
expected: [10,20,888,999,777]
 but got: [10,20,283563033355,999,777]
                 ^

Examples: 339  Tried: 338  Errors: 0  Failures: 1

See https://github.com/haskell/vector/runs/2461343762?check_suite_focus=true

I'll inline example from haddock for the record:

-- >>> import qualified Data.Vector as V
-- >>> import qualified Data.Vector.Mutable as MV
-- >>> mv <- V.thaw $ V.fromList ([10, 20, 30] :: [Integer])
-- >>> mv' <- MV.grow mv 2
--
-- The two extra elements at the end of the newly allocated vector will be
-- uninitialized and will result in an error if evaluated, so me must overwrite
-- them with new values first:
--
-- >>> MV.write mv' 3 999
-- >>> MV.write mv' 4 777
-- >>> V.unsafeFreeze mv'
-- [10,20,30,999,777]
--
-- It is important to note that the source mutable vector is not affected when
-- the newly allocated one is mutated.
--
-- >>> MV.write mv' 2 888
-- >>> V.unsafeFreeze mv'
-- [10,20,888,999,777]
-- >>> V.unsafeFreeze mv
-- [10,20,30]

It is clear that there is something really weird going on. I am not 100% sure where the problem is really coming from, although I do suspect it probably is doctests, but since it originated here I'd like to have it as a starting point.

@Shimuuar You've expressed interest in investigating it. Any ideas?

@andrewthad
Copy link
Contributor

This doctest is bad. It calls unsafeFreeze on the same mutable vector twice, which is explicitly undefined behavior.

@lehins
Copy link
Contributor Author

lehins commented Apr 28, 2021

@andrewthad You know, I thought exact same thing a moment ago and did just that in 997615a

However, this is not described anywhere in a good detail. The only place I saw any mention of this was in primitive: "The array should not be modified after the conversion.", which doesn't tell us the reason why it should not be modified. The primop itself doesn't talk about this at all.

I know you are more familiar with the actual primops, so I have few questions for you:

  • how come this is not a problem for MutableByteArray#, just the boxed arrays?
  • why mutable array cannot be mutated? Is there some flag that is flipped when unsafeFreeze is called on it which somehow affects GC?
  • how come I cannot reproduce it with compiled code or in ghci, just in doctests?
  • would this be a legit fix (not that this would be put into documentation, just curious):
>>> mv <- V.thaw $ V.fromList ([10, 20, 30] :: [Integer])
>>> mv' <- MV.grow mv 2
>>> MV.write mv' 3 999
>>> MV.write mv' 4 777
>>> v' <- V.unsafeFreeze mv'
>>> print v'
[10,20,30,999,777]
>>> mv'' <- V.unsafeThaw v'
>>> MV.write mv'' 2 888
>>> v''' <- V.unsafeFreeze mv''
>>> print v'''
[10,20,888,999,777]
>>> V.unsafeFreeze mv
[10,20,30]

@andrewthad
Copy link
Contributor

Here's the reason I'm certain of for why mutating after performing an in-place freeze is bad. For any type of array (Array# or ByteArray#), let's say that you do this:

foo :: Char
foo = runST $ do
  dst <- new 3 -- starts out uninitialized
  write dst 0 'a'
  write dst 1 'b'
  write dst 2 'c'
  arr <- unsafeFreeze dst
  let x = index arr 1
  write dst 1 'd'
  pure x

What is the value of foo? Should it be 'd' or 'b'. It depends on the optimizations that GHC applies. I've not testing this case specifically, but at -O0 and at -O2, GHC might produce programs with different behaviors. (It should be possible for this to happen with either primitive or vector.) This is because one of GHC's core-to-core involving sinking let bindings, that is, floating them inward toward the place whether they are actually used. Another pass does the opposite, hoisting them upward, which could also change the behavior of a program that fails to respect the no-use-after-freeze invariant, but the floating inward pass is the one that matter more in this example. Essentially, the program could become:

foo :: Char
foo = runST $ do
  dst <- new 3 -- starts out uninitialized
  write dst 0 'a'
  write dst 1 'b'
  write dst 2 'c'
  arr <- unsafeFreeze dst
  write dst 1 'd'
  let x = index arr 1
  pure x

I'll write up some more observations later tonight.

@lehins
Copy link
Contributor Author

lehins commented Apr 28, 2021

Right. The floating out part certainly makes sense, but that is not the problem in the original doctest. There is exact sequencing because the value is printed to stdout, thus it can't be floated out.

@lehins
Copy link
Contributor Author

lehins commented Apr 28, 2021

Digging through the actual implementation does not reveal the reason to me. unsafeThawArray# and unsafeFreezeArray# thawArray# and freezeArray# are both implemented using the same macro cloneArray: namely create a new array of pointers and copy over all of the addresses with memcpy. The only difference between the two is that the new array is created with a different flag: stg_MUT_ARR_PTRS_DIRTY_info and stg_MUT_ARR_PTRS_FROZEN_CLEAN_info respectfully.

This is not the most efficient way to freeze/thaw arrays, although it is probably necessary for GC, but it should allow for the behavior in the original doctest, because the mutable array that is being frozen is left untouched upon freezing. That would explain why I can't reproduce it outside doctests. My guess is doctests executes those tests in some weird way.

@andrewthad
Copy link
Contributor

unsafeThaw# and unsafeFreeze# do not use memcpy. They just mark the header word since they are in-place operations.

@lehins
Copy link
Contributor Author

lehins commented Apr 28, 2021

Oh damn, you are right, I was looking at freezeArray#!!!

@andrewthad
Copy link
Contributor

andrewthad commented Apr 28, 2021

Also, fwiw, I consider unsafeThaw# to be a generally bad primop. The only somewhat legitimate use for it that I've ever seen is in unordered-containers, where is used to make in-place modifications to a hashmap that is very carefully used linearly. Typically though, if you know that you are going to have to in-place thaw an array, you shouldn't have frozen it yet. And then since you'll be using write and read (and not index), all operations are guaranteed to be sequenced correctly.

You're looking at stg_freezzeArrayzh and stg_thawArrayzh, which are out-of-line primops. EDIT: Ah, you already figured this out.

@lehins
Copy link
Contributor Author

lehins commented Apr 28, 2021

Now that I looked into the right primop it all makes total sense. @andrewthad Thank you for your feedback.

Also, fwiw, I consider unsafeThaw# to be a generally bad primop.

100% agree: #139 (comment)

@lehins
Copy link
Contributor Author

lehins commented Apr 29, 2021

Fix for the bug is in #384

@andrewthad
Copy link
Contributor

I still have no idea what's really happening that causes the original problem. Only a hand-wavy "this is undefined behavior, so anything goes". Other random thoughts. You asked:

how come this is not a problem for MutableByteArray#, just the boxed arrays?

I don't know. Writing and in-place freezing are both more complicated for boxed arrays than for MutableByteArray#. In particular, for boxed array, you have four closure types (two for mutable and two for immutable) in GHC, and these cause the garbage collector (but not any primops) to treat them differently:

#define MUT_ARR_PTRS_CLEAN            43
#define MUT_ARR_PTRS_DIRTY            44
#define MUT_ARR_PTRS_FROZEN_DIRTY     45
#define MUT_ARR_PTRS_FROZEN_CLEAN     46

By contrast, unboxed arrays just a single closure type used for both ByteArray# and MutableByteArray#:

#define ARR_WORDS                     42

So there's that. And then there's also the issue of GHC, as a memory ordering concern, has to make sure that writes to the memory backing a freshly-allocated boxed value are seen before its pointer gets written to an array. This only matters when multithreading is in play, and historically, this caused problems on ARM, but Ben Gamari fixed all of this in the GHC 8.10.x series.

Those are the differences that I'm aware of that might matter. But I don't really know how doctest works, and it might be introducing other problems.

@andrewthad
Copy link
Contributor

I like the fix you chose. Using freeze is definitely the right thing to do here. If there are still issues that happen after switching to freeze, then it's a doctest issue for sure.

@Shimuuar
Copy link
Contributor

So far I can't reproduce bug with either GHC or doctests (I'm using GHC 8.8)

src/Data/Vector/Mutable.hs:368: failure in expression `V.unsafeFreeze mv''
expected: [10,20,888,999,777]
 but got: [10,20,283563033355,999,777]

What worry me is 283563033355. What is it? Could it be that GC happening at inopportune moment is able to write some junk to vector?

@lehins
Copy link
Contributor Author

lehins commented Apr 29, 2021

So far I can't reproduce bug with either GHC or doctests (I'm using GHC 8.8)

I am able to reproduce it with doctests, but very rarely. It is not triggered consistently and outside doctests I've never seen it happened at all and I've written some property tests, tried it ghci while forcing GC at various steps.

What worry me is 283563033355. What is it?

I suspect it reads some arbitrary memory location and interprets its contents as Integer. What happened to the value that was written into it? I don't know, maybe GC collected it or something. It is a bit of a black box to me. (which would explain why we don't see this behavior for unboxed arrays, since GC doesn't deal with actual elements)

One way or another unsafeFreezeArray# does mutate a header in the mutable boxed array and thus changes its state. This means that after such a call RTS might be interpreting it in some totally different way in some specific circumstances. Looking at the primops convinced me that mutable boxed array should not be mutated after unsafeFreezeArray#. That being said, I do believe that this piece of code is actually safe:

>>> v' <- V.unsafeFreeze mv'
>>> print v'
[10,20,30,999,777]
>>> mv'' <- V.unsafeThaw v'
>>> MV.write mv'' 2 888

because unsafeThaw reverts the flag back to mutable and as long as this code is executed in the same thread we are guaranteed the ordering, because v' is printed and not bound to something that can be floated out.

@lehins
Copy link
Contributor Author

lehins commented Apr 29, 2021

Closing this issue with #384 PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants