-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix type instability in getproperty(::Schema, ::Symbol)
#340
Conversation
I have two comments:
|
Keep in mind that this PR does not modify the code's behavior at all. It just switches from If return types === nothing ? (T !== nothing ? T : nothing) : ntuple(i -> fieldtype(types, i), Val(fieldcount(types)))
Any tips on how to do this? I've never unit-tested compiler type inference before. It feels like something that would be closely tied to a specific Julia version too. |
Oh it looks like there is |
The point of my question is that I want to make sure that if the compiler tries to infer it it is not too costly. (maybe it does not make a difference, but I want to make sure; in general passing
Yes |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Happened to have some free time this evening so ran some detailed benchmarking; see below. Both TTFX and runtime are shown. The key takeaways are:
So, up to you. I personally think it's worth it, especially because large number of columns are already dealt with in the package. Furthermore you'd get type stability in downstream packages which will bring even more speedups. Some of the code used below:Benchmarks: using Tables
using BenchmarkTools
using Plots
compiled = BenchmarkGroup()
ttfx = Dict{Int,Float64}()
all_n = map(i -> round(Int, 2^i), 1.0:0.5:12.0);
function get_types(X)
return Tables.schema(X).types
end
# Test run:
test_X = NamedTuple([Symbol("x$(j)") => [1.0] for j = 1:1])
@timed get_types(test_X)
for n in all_n
local X = NamedTuple([Symbol("x$(j)") => [1.0] for j = 1:n])
ttfx[n] = (@timed get_types(X)).time
end
for n in all_n
compiled[n] = @benchmarkable(get_types(X), setup = (X = NamedTuple([Symbol("x$(j)") => [1.0] for j = 1:$n])))
end
tune!(compiled)
compiled_results = run(compiled, verbose=true)
times = [minimum(compiled_results["compiled"][n]).time for n in all_n]
# (Save to CSV file) Plots: using Plots
using CSV
# mm for plots:
using Plots.PlotMeasures
ttfx = (
old=CSV.read("ttfx_old.csv", NamedTuple),
new=CSV.read("ttfx_new.csv", NamedTuple),
)
ttx = (
old=CSV.read("ttx_old.csv", NamedTuple),
new=CSV.read("ttx_new.csv", NamedTuple),
)
theme(:bright)
p = plot(
plot(
[ttfx.old.n, ttfx.new.n],
[ttfx.old.times, ttfx.new.times],
xaxis=(:log, "Columns in Table"),
yaxis=(:log, "Time [s]"),
title="Time to first `.types`",
label=["old" "new"],
),
plot(
[ttx.old.n, ttx.new.n],
[ttx.old.times, ttx.new.times],
xaxis=(:log, "Columns in Table"),
yaxis=(:log, "Time [ns]"),
title="Compiled runtime of `.types`",
label=["old" "new"],
),
layout=(1,2),
size=(1000,500),
# Prevent xlabel being cut off:
margin=5mm,
# Darker major grid lines:
gridalpha=0.5,
)
savefig(p, "plot.png")
|
You mean that if one uses large number of columns then anyway type-unstable table would be used (like |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #340 +/- ##
==========================================
+ Coverage 94.46% 94.58% +0.11%
==========================================
Files 7 7
Lines 723 739 +16
==========================================
+ Hits 683 699 +16
Misses 40 40
☔ View full report in Codecov by Sentry. |
I was just reading that from the docstring:
From which it sounds clear to me that users would not use thousands of columns in the type parameter anyways. (If they did, they would be screwed in other ways as well) |
If the compilation time is a big worry for you, I could do this instead: n = fieldcount(types)
if n <= 512
return ntuple(i -> fieldtype(types, i), Val(n))
else
return Tuple(fieldtype(types, i) for i=1:n)
end since |
I will be traveling for a bit so please feel free to edit the branch directly and merge if you want |
The decision maker is @quinnj here. I wanted to make sure we tested everything before making a decision. I agree that users probably do not use super wide tables in type-stable mode. |
Thanks, no worries. In any case with the change in 4d4c962, there is not really any regressions here – in compile time or run time – even for large tables. At worst it's about the same performance; at best you get a 100x-1000x runtime speedup, and downstream packages can do type inference. |
Ping @quinnj when you get a chance could you take a look at potentially merging this? Would be a big help for downstream packages |
Ping @quinnj |
Ping @quinnj 😅 |
@quinnj is OOO for a few weeks so we need to wait |
Gotcha, no worries! |
Hey @quinnj, would you be free to take a look at merging this? It’s holding up some type stability issues in MLJ. Thanks, Miles |
Sending another ping. This is still a source of type instabilities throughout the MLJ ecosystem, but is a really simple fix. |
Ping |
Is there another way to reach @quinnj that he checks more frequently? |
I think @quinnj is off-line on all channels currently. |
Awesome! Thanks! |
do you need a release or |
A release would be necessary (to get type stability across MLJ ecosystem) |
Fixes #339. The issue was that inside
getproperty
, the generator for table types:is type unstable.
This PR changes it to:
which is stable.
Edit: 4d4c962 changes it so that
getproperty
falls back to the type unstableTuple
when the number of columns goes above 512. This prevents any major slow downs in compile time.This gives some major speed ups in both runtime and compile time:
e.g., here is an inspection of
@code_warntype
on a function that returns.types
for a schema:whereas before this change, it was inferring this as
Vararg{DataType}
(bad).@quinnj @bkamins could you please review and merge when you get a chance? This is necessary to fix a variety of type instabilities in MLJ interfaces.
cc @ablaom @OkonSamuel. This should speed up calls to
fit
andpredict
across the MLJ ecosystem as it removes a key type instability.Thanks!
Miles