Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocaml5-issue] dummy found! in Lin Dynarray stress test with Domain on musl trunk #528

Open
jmid opened this issue Jan 19, 2025 · 4 comments
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime

Comments

@jmid
Copy link
Collaborator

jmid commented Jan 19, 2025

On #521 the musl trunk workflow triggered a dummy found! error exiting the Lin Dynarray stress test:
https://github.com/ocaml-multicore/multicoretests/actions/runs/12846368030/job/35821476446?pr=521

random seed: 225104823
generated error fail pass / total     time test name

[ ]    0    0    0    0 / 1000     0.0s Lin Dynarray test with Domain
[ ]    0    0    0    0 / 1000     0.0s Lin Dynarray test with Domain (generating)
[✓]    2    0    1    1 / 1000     2.1s Lin Dynarray test with Domain

dummy found!
File "_build/.dune/default/src/dynarray/dune", line 3, characters 7-16:
3 |  (name lin_tests)
           ^^^^^^^^^
(cd _build/default/src/dynarray && ./lin_tests.exe --verbose)
Command exited with code 1.
[ ]    0    0    0    0 / 1000     0.0s Lin Dynarray stress test with Domain

This is caused by these lines implementing a get instrumented with a tag check:

let get_check a i =
let v = Dynarray.get a i in
if not (Obj.is_int (Obj.repr v)) then (Printf.eprintf "dummy found!\n%!"; exit 1) else v

IIUC, this means the test is observing a non-int dummy value from the unboxed implementation from ocaml/ocaml#12885.
A code comment in stdlib/dynarray.ml reads:

   [...] This dummy is *not* type-safe, it is not a valid value of type ['a], so
   we must be very careful never to return it to the user. After
   accessing any element of the array, we must check that it is not
   the dummy. In particular, this dummy must be distinct from any
   other value the user could provide -- we ensure this by using
   a dynamically-allocated mutable reference as our dummy.

At the same time, the module is documented to be unsafe for parallel usage:

Unsynchronized accesses

    Unsynchronized accesses to dynamic arrays are a programming error.

   Concurrent accesses to dynamic arrays must be synchronized
   (for instance with a {!Mutex.t}). Unsynchronized accesses to
   a dynamic array are a programming error that may lead to an invalid
   dynamic array state, on which some operations would fail with an
   [Invalid_argument] exception.

Can we conclude that parallel usage may be type unsafe? Ping @OlivierNicole: WDYT? 🤔

@OlivierNicole
Copy link
Contributor

Can we conclude that parallel usage may be type unsafe? Ping @OlivierNicole: WDYT? 🤔

It shouldn’t be, and yet it seems that it is.

I guess my next move, if I wanted to track this, would be to modify the test to return a result or something like that, so that qcheck-stm will show me the sequence of commands that led to the error rather than merely exiting; and try to work it out from there.

Cc-ing @gasche who might be interested to know.

@gasche
Copy link

gasche commented Jan 20, 2025

It looks like you found a bug in the implementation indeed. What is the set of operations that are used in this test?

Slightly more information: there is a dynamic check in the Dynarray implementation that should ensures that we fail with an exception instead of ever returning a dummy value. Here this check does not suffice to guarantee this property. Different dynarrays may have different dummy values (but only marshalling creates distinct dummies, and I suppose you don't exercise it?), and it may be the case that an implementation bug causes two different dummies to be mixed up in the same backing array, which would result in the dynamic check being incomplete. For example, the implementation of blit looks suspiciously complex and could hide such a bug.

@gasche
Copy link

gasche commented Jan 20, 2025

I suppose that the operations being exercised are the ones listed in

[ val_ "get_check" get_check (t @-> int @-> returning_or_exc elem);
val_ "set" Dynarray.set (t @-> int @-> elem @-> returning_or_exc unit);
val_ "length" Dynarray.length (t @-> returning int);
val_freq 3 "add_last" Dynarray.add_last (t @-> elem @-> returning_or_exc unit);
val_ "append_seq" Dynarray.append_seq (t @-> seq_small elem @-> returning_or_exc unit);
val_ "get_last" Dynarray.get_last (t @-> returning_or_exc elem);
val_ "pop_last" Dynarray.pop_last (t @-> returning_or_exc elem);
val_freq 2 "remove_last" Dynarray.remove_last (t @-> returning_or_exc unit);
val_ "clear" Dynarray.clear (t @-> returning_or_exc unit);
val_ "truncate" Dynarray.truncate (t @-> int @-> returning_or_exc unit);
val_ "ensure_capacity" Dynarray.ensure_capacity (t @-> int @-> returning_or_exc unit);
val_ "fit_capacity" Dynarray.fit_capacity (t @-> returning_or_exc unit);
(*val_ "blit" Dynarray.blit (t @-> int_not_too_big @-> t @-> int_not_too_big @-> int_not_too_big @-> returning_or_exc unit);*)
val_freq 2 "set_capacity" Dynarray.set_capacity (t @-> int @-> returning_or_exc unit);
val_ "reset" Dynarray.reset (t @-> returning_or_exc unit);
]

I see in particular that blit is commented out, so it should not be the one causing this issue. (It's not obvious whether there is marshalling involved in the machinery around it; could a dynarray used in the test be produced by marshalling then unmarshalling?)

@jmid
Copy link
Collaborator Author

jmid commented Jan 20, 2025

Thanks for the feedback both! 🙏

It looks like you found a bug in the implementation indeed. What is the set of operations that are used in this test?

The test just exercises the operations you list above. Indeed neither blit nor marshalling is involved.
Since the Lin test exits with the above message when encountering a dummy value, we regrettably don't know which sequence of operations caused it... 😬
As Olivier suggests, adjusting the STM test to express "unexpected value", e.g., as a result type, should allow us to print and understand what caused it.

I'll see if I can create a reproducer. A few quick local experiments however indicate that this is a rare, hard-to reproduce one, which also explain why it hasn't shown up before. Further note, that it showed up on a musl workflow, which tend to exhibit sufficiently different behaviour from glibc to surface things like these...

@jmid jmid added the ocaml5-issue A potential issue in the OCaml5 compiler/runtime label Jan 20, 2025
@jmid jmid changed the title dummy found! in Lin Dynarray stress test with Domain on musl trunk [ocaml5-issue] dummy found! in Lin Dynarray stress test with Domain on musl trunk Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime
Projects
None yet
Development

No branches or pull requests

3 participants