-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consistent 32bit CI error #29923
Comments
I just got another of these. |
This is reproducible for me on an i686 build from a 64 bit ubuntu 18.04 host. The bug reduces to:
|
Well, it makes absolutely no sense, but reverting e2f7b2f fixes this for me locally. |
More precisely, reverting only the following line seems to fix the problem, but I don't understand why. diff --git a/base/array.jl b/base/array.jl
index 9e8da6c384..74823710ab 100644
--- a/base/array.jl
+++ b/base/array.jl
@@ -272,7 +272,6 @@ function copyto!(dest::Array{T}, doffs::Integer, src::Array{T}, soffs::Integer,
throw(BoundsError())
end
unsafe_copyto!(dest, doffs, src, soffs, n)
- return dest
end
# Outlining this because otherwise a catastrophic inference slowdown Edit: having woken my machine up from sleep state, this one line change no longer "fixes" the problem. Which makes sense if there's something nasty and intermittent going on here. Uninitialized memory read or some such thing, perhaps. |
It looks like my machine has fallen out of whatever state it was in which allowed me to reproduce this. What I did when reproducing was to build 0999799 with This was yesterday, but the build of 0999799 no longer exhibits the problem today. I'll chalk that up to it being intermittent (I don't think it's just me loosing my mind :-) ). |
Got another window of reproducibility; it looks like the cached inferred code for Applying the following patch: diff --git a/base/compiler/typeinfer.jl b/base/compiler/typeinfer.jl
index 01fb88b024..e1912b71da 100644
--- a/base/compiler/typeinfer.jl
+++ b/base/compiler/typeinfer.jl
@@ -594,6 +594,11 @@ function typeinf_type(method::Method, @nospecialize(atypes), sparams::SimpleVect
inf = code.inferred
if !isa(inf, CodeInfo) || (inf::CodeInfo).inferred
i == 2 && ccall(:jl_typeinf_end, Cvoid, ())
+ println("typeinf_type code = ", code)
+ println("typeinf_type ", method, " code.rettype = ", code.rettype)
+ println("typeinf_type ", code.inferred)
+ println("typeinf_type ", ccall(:jl_uncompress_ast, Any, (Any, Any), method,
+ code.inferred::Vector{UInt8}))
return code.rettype
end
end I get the following output:
Given these |
This would also explain why it's not seen on 64 bit linux, as we don't use GMP for supporting |
Here's another oddity - julia> @code_typed rem(UInt128(1), UInt128(1))
CodeInfo(
778 1 ─ %1 = invoke Base.BigInt(_2::UInt128)::BigInt │
│ %2 = invoke Base.BigInt(_3::UInt128)::BigInt │
│ %3 = Base.GMP.MPZ.tdiv_r::typeof(tdiv_r) │╻ rem
│ %4 = invoke %3(%1::BigInt, %2::BigInt)::BigInt ││
│ %5 = invoke Base.UInt128(%4::BigInt)::UInt128 │
└── return %5 │
) => UInt128
julia> Base.return_types(rem, (UInt128, UInt128))
1-element Array{Any,1}:
Any |
More fun: $ ./julia
[...]
julia> Base.return_types(rem, (UInt128, UInt128))
1-element Array{Any,1}:
Any
# Now replace `rem` with exactly the same definition, to cause recompilation:
julia> Base.eval(:(
function rem(x::UInt128, y::UInt128)
return UInt128(rem(BigInt(x), BigInt(y)))
end
))
rem (generic function with 136 methods)
julia> Base.return_types(rem, (UInt128, UInt128))
1-element Array{Any,1}:
UInt128 So I guess the incorrect return type for |
The following patch dumps inferred diff --git a/src/gf.c b/src/gf.c
index ae4a1413b7..414a52e369 100644
--- a/src/gf.c
+++ b/src/gf.c
@@ -432,6 +432,13 @@ JL_DLLEXPORT jl_method_instance_t* jl_set_method_inferred(
li->functionObjectsDecls.functionObject = NULL;
li->functionObjectsDecls.specFunctionObject = NULL;
li->rettype = rettype;
+ if (li->def.method->name == jl_symbol("rem")) {
+ jl_printf(JL_STDOUT, "jl_set_method_inferred rem; rettype = ");
+ jl_static_show(JL_STDOUT, li->rettype);
+ jl_printf(JL_STDOUT, " ");
+ jl_static_show(JL_STDOUT, li->specTypes);
+ jl_printf(JL_STDOUT, "\n");
+ }
jl_gc_wb(li, rettype);
li->inferred = inferred;
jl_gc_wb(li, inferred); It demonstrates that the 32 bit build sometimes infers the wrong type for
sometimes produces
and other times produces
I'm not sure whether this depends on being inside the sysimg build or not. |
Ok, to reproduce this, you need an i686 build, apply the patch above and to run the sysimg build step several times until you see the uninferred version of
twice in the log. Another change which might have caused this to appear - #29795 |
Nice work tracking this down, @c42f! @Keno, @vtjnash, @JeffBezanson—any idea what's going on? |
Thanks, it's a head scratcher that's for sure. I caught the following stack trace from inside
|
This is no longer what I'd describe as "intermittent"—it seems to happen quite consistently now. |
I think I have a handle on this one: On 32bit systems, To verify, I've added function BigInt(x::UInt128)
x == 0 && return BigInt(Culong(0))
nd = sizeof(x)<<3 - leading_zeros(x) # instead of ndigits(x, base=2)
z = MPZ.realloc2(nd)
s = sign(x)
s == -1 && (x = -x)
x = unsigned(x)
size = 0
limbnbits = sizeof(Limb) << 3
while nd > 0
size += 1
unsafe_store!(z.d, x % Limb, size)
x >>>= limbnbits
nd -= limbnbits
end
z.size = s*size
z
end which is a copy of Not sure how to properly fix this, though. Options include:
|
@martinholters Nice! That sounds like part of the puzzle and means the problem started happening with the change at 6fa3e45#diff-c39f3bbf0cb1c2381243b42647d9ff1dR287 Though it doesn't explain the inconsistency of sometimes getting |
The symptoms are mitigated by #30032 but the underlying inference issue remains. |
Thinking about this again, my analysis above doesn't explain anything. If the arguments during recursion don't change, inference is happy with it. Only if the same method gets called with different argument types, problems may appear. I.e. inference may bail out (infer as |
I ran valgrind against the offending sysimg compile step. In one run, the following log occurred
Which is extremely suspicious on the face of it, but repeated reruns with |
@martinholters I think I've got a repro of this outside of bootstrap. Try the following: # Force recompilation
julia> Base.eval(:(rem(x::UInt128, y::UInt128) = UInt128(rem(BigInt(x), BigInt(y)))))
rem (generic function with 136 methods)
# Compile directly gives the right result
julia> Base.return_types(rem, (UInt128, UInt128))
1-element Array{Any,1}:
UInt128
# Now force recompilation, entering inference via `parse`:
julia> Base.eval(:(rem(x::UInt128, y::UInt128) = UInt128(rem(BigInt(x), BigInt(y)))))
julia> parse(UInt128, "1")
julia> Base.return_types(rem, (UInt128, UInt128))
1-element Array{Any,1}:
Any I think the reason this is inconsistent from one sysimg compile to the next is that the compilation inside In particular, in test runs I found that |
Any time there is inconsistency in the handling of cycles during inference we are likely to have this inconsistent behavior due to the inconsistent ordering coming from If we sorted the list of specializations Line 375 in 9a7c79c
|
Seems to have been missing since the original implementation of IPO constant-prop. It's impressive how long we got away with just poisoning the cache with `Any` if there was a self-recursive call. Fix #29923
That's a really great repro, and helped narrow this down quickly (spoiler: it wasn't where I expected). Usually it's enough to apply this diff to spot where inference diverges from my expectations: diff --git a/base/compiler/abstractinterpretation.jl b/base/compiler/abstractinterpretation.jl
index e30b5496b0..65c1d6a2b0 100644
--- a/base/compiler/abstractinterpretation.jl
+++ b/base/compiler/abstractinterpretation.jl
@@ -300,6 +300,11 @@ function abstract_call_method(method::Method, @nospecialize(sig), sparams::Simpl
newsig = limit_type_size(sig, comparison, sv.linfo.specTypes, sv.params.TUPLE_COMPLEXITY_LIMIT_DEPTH, spec_len)
if newsig !== sig
+ println()
+ println(sig)
+ println(newsig)
+ print_callstack(sv)
+ println()
# continue inference, but note that we've limited parameter complexity
# on this call (to ensure convergence), so that we don't cache this result
if call_result_unused(sv) But I should perhaps have attempt to record (or repeat?) this debug session, since this didn't turn up anything, I had to go hunting further. Next I took a look at what was going into the cache: diff --git a/base/compiler/typeinfer.jl b/base/compiler/typeinfer.jl
index 01fb88b024..b26860d341 100644
--- a/base/compiler/typeinfer.jl
+++ b/base/compiler/typeinfer.jl
@@ -28,6 +27,15 @@ function typeinf(frame::InferenceState)
# empty!(frames)
min_valid = frame.min_valid
max_valid = frame.max_valid
+ if cached && frame.parent !== nothing
+ println()
+ for caller in results
+ println(caller.argtypes, " => ", caller.result)
+ # println(caller.src)
+ end
+ print_callstack(frame)
+ println()
+ end
if cached || frame.parent !== nothing
for caller in results
opt = caller.src This was suspiciously pointing the finger at something permitting results to go into the cache that were clearly bogus, and had the structure of a self-recursive call on a constant (these are marked
Yes. While we try to make inference fairly predictable and stable(*), sorting the precompile lists there by something stable (such as Method min_world when iterating over bindings) might help with faster isolation of similar problems. (*) In particular, we want it to have the property that any particular call to inference (e.g. code_typed or Base.return_types) should not return a wider result than expected due to cache effects. We're just fine with it giving unpredictably narrower answers based on stuff that happened to be done in the past, and just we're just fine with it giving entirely unpredictable answers when running recursively, as part of converging inference (up to the limit that it shouldn't widen the final type given as the answer, but that does mean that the set of possible optimizations and inlining afterwards is expected to be unpredictable in general). |
This should help in making ordering-dependent bugs such as JuliaLang#29923 (an error in caching of inferred methods) deterministic, which will make them much easier to find in the future.
This should help in making ordering-dependent bugs such as #29923 (an error in caching of inferred methods) deterministic, which will make them much easier to find in the future.
Great! I was hoping someone with knowledge of inference would step in and fix this. Thanks to Martin as well for giving just the hint I needed to repro it.
How about sorting |
This should help in making ordering-dependent bugs such as #29923 (an error in caching of inferred methods) deterministic, which will make them much easier to find in the future.
This should help in making ordering-dependent bugs such as JuliaLang#29923 (an error in caching of inferred methods) deterministic, which will make them much easier to find in the future.
Seems to have been missing since the original implementation of IPO constant-prop. It's impressive how long we got away with just poisoning the cache with `Any` if there was a self-recursive call, but it requires a fairly convoluted setup to trigger this case. Fix #29923
Seems to have been missing since the original implementation of IPO constant-prop. It's impressive how long we got away with just poisoning the cache with `Any` if there was a self-recursive call, but it requires a fairly convoluted setup to trigger this case. Fix #29923
This seems to have started recently
e.g. https://travis-ci.org/JuliaLang/julia/jobs/450035986
The text was updated successfully, but these errors were encountered: