Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix type instability in getproperty(::Schema, ::Symbol) #340

Merged
merged 7 commits into from
Oct 20, 2023

Conversation

MilesCranmer
Copy link
Contributor

@MilesCranmer MilesCranmer commented Aug 5, 2023

Fixes #339. The issue was that inside getproperty, the generator for table types:

Tuple(fieldtype(types, i) for i = 1:fieldcount(types))

is type unstable.

This PR changes it to:

ntuple(i -> fieldtype(types, i), Val(fieldcount(types)))

which is stable.

Edit: 4d4c962 changes it so that getproperty falls back to the type unstable Tuple when the number of columns goes above 512. This prevents any major slow downs in compile time.

This gives some major speed ups in both runtime and compile time:

plot

e.g., here is an inspection of @code_warntype on a function that returns .types for a schema:

julia> X = (x=randn(32), y=randn(32));

julia> f(X) = Tables.schema(X).types
f (generic function with 1 method)

julia> @code_warntype f(X)
MethodInstance for f(::@NamedTuple{x::Vector{Float64}, y::Vector{Float64}})
  from f(X) @ Main REPL[5]:1
Arguments
  #self#::Core.Const(f)
  X::@NamedTuple{x::Vector{Float64}, y::Vector{Float64}}
Body::Tuple{DataType, DataType}
1%1 = Tables.schema::Core.Const(Tables.schema)
│   %2 = (%1)(X)::Core.Const(Tables.Schema:
 :x  Float64
 :y  Float64)
│   %3 = Base.getproperty(%2, :types)::Core.Const((Float64, Float64))
└──      return %3

whereas before this change, it was inferring this as Vararg{DataType} (bad).

@quinnj @bkamins could you please review and merge when you get a chance? This is necessary to fix a variety of type instabilities in MLJ interfaces.

cc @ablaom @OkonSamuel. This should speed up calls to fit and predict across the MLJ ecosystem as it removes a key type instability.

Thanks!
Miles

@bkamins
Copy link
Member

bkamins commented Aug 5, 2023

I have two comments:

  1. Have you benchmarked how Val scales for wide tables (like 1000+ columns)?
  2. Could you please add a test that checks for type stability of the result?

@MilesCranmer
Copy link
Contributor Author

  1. Have you benchmarked how Val scales for wide tables (like 1000+ columns)?

Keep in mind that this PR does not modify the code's behavior at all. It just switches from Tuple(... for i=1:n) to ntuple(..., Val{n}), taking advantage of the fact that n is a compile-time constant (if types is not nothing). The return type is the same, it just helps the compiler infer it.

If types are nothing, as I guess would happen for a large table like that (?), the Val would not even get hit:

return types === nothing ? (T !== nothing ? T : nothing) : ntuple(i -> fieldtype(types, i), Val(fieldcount(types)))
  1. Could you please add a test that checks for type stability of the result?

Any tips on how to do this? I've never unit-tested compiler type inference before. It feels like something that would be closely tied to a specific Julia version too.

@MilesCranmer
Copy link
Contributor Author

Oh it looks like there is Test.@inferred. Is that what you mean?

@bkamins
Copy link
Member

bkamins commented Aug 5, 2023

The return type is the same, it just helps the compiler infer it.

The point of my question is that I want to make sure that if the compiler tries to infer it it is not too costly. (maybe it does not make a difference, but I want to make sure; in general passing Val to ntuple usually leads to code generation).

Oh it looks like there is Test.@inferred. Is that what you mean?

Yes

@MilesCranmer

This comment was marked as outdated.

@MilesCranmer

This comment was marked as outdated.

@MilesCranmer

This comment was marked as outdated.

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Aug 6, 2023

Happened to have some free time this evening so ran some detailed benchmarking; see below. Both TTFX and runtime are shown.

plot

The key takeaways are:

  • Runtime:
    • 100-1000x speedup for number of cols < 40
    • 100x speedup for number of cols > 40
  • Time to first .types:
    • 1-3x speedup for number of cols < 150
    • increasingly slower for number of cols > 150, with 7x slowdown at 1000 columns

So, up to you. I personally think it's worth it, especially because large number of columns are already dealt with in the package.

Furthermore you'd get type stability in downstream packages which will bring even more speedups.

Some of the code used below:

Benchmarks:

using Tables
using BenchmarkTools
using Plots

compiled = BenchmarkGroup()
ttfx = Dict{Int,Float64}()

all_n = map(i -> round(Int, 2^i), 1.0:0.5:12.0);

function get_types(X)
    return Tables.schema(X).types
end

# Test run:
test_X = NamedTuple([Symbol("x$(j)") => [1.0] for j = 1:1])
@timed get_types(test_X)

for n in all_n
    local X = NamedTuple([Symbol("x$(j)") => [1.0] for j = 1:n])
    ttfx[n] = (@timed get_types(X)).time
end

for n in all_n
    compiled[n] = @benchmarkable(get_types(X), setup = (X = NamedTuple([Symbol("x$(j)") => [1.0] for j = 1:$n])))
end

tune!(compiled)
compiled_results = run(compiled, verbose=true)

times = [minimum(compiled_results["compiled"][n]).time for n in all_n]

# (Save to CSV file)

Plots:

using Plots
using CSV

# mm for plots:
using Plots.PlotMeasures

ttfx = (
    old=CSV.read("ttfx_old.csv", NamedTuple),
    new=CSV.read("ttfx_new.csv", NamedTuple),
)

ttx = (
    old=CSV.read("ttx_old.csv", NamedTuple),
    new=CSV.read("ttx_new.csv", NamedTuple),
)

theme(:bright)

p = plot(
    plot(
        [ttfx.old.n, ttfx.new.n],
        [ttfx.old.times, ttfx.new.times],
        xaxis=(:log, "Columns in Table"),
        yaxis=(:log, "Time [s]"),
        title="Time to first `.types`",
        label=["old" "new"],
    ),
    plot(
        [ttx.old.n, ttx.new.n],
        [ttx.old.times, ttx.new.times],
        xaxis=(:log, "Columns in Table"),
        yaxis=(:log, "Time [ns]"),
        title="Compiled runtime of `.types`",
        label=["old" "new"],
    ),
    layout=(1,2),
    size=(1000,500),
    # Prevent xlabel being cut off:
    margin=5mm,
    # Darker major grid lines:
    gridalpha=0.5,
)
savefig(p, "plot.png")

@bkamins
Copy link
Member

bkamins commented Aug 6, 2023

especially because large number of columns are already dealt with in the package.

You mean that if one uses large number of columns then anyway type-unstable table would be used (like DataFrame)?

@codecov
Copy link

codecov bot commented Aug 6, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (dac28d9) 94.46% compared to head (fc33a5a) 94.58%.
Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #340      +/-   ##
==========================================
+ Coverage   94.46%   94.58%   +0.11%     
==========================================
  Files           7        7              
  Lines         723      739      +16     
==========================================
+ Hits          683      699      +16     
  Misses         40       40              
Files Coverage Δ
src/Tables.jl 89.32% <100.00%> (+0.24%) ⬆️

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MilesCranmer
Copy link
Contributor Author

I was just reading that from the docstring:

Encoding the names & types as type parameters allows convenient use of the type in generated functions and other optimization use-cases, but users should note that when names and/or types are the nothing value, the names and/or types are stored in the storednames and storedtypes fields. This is to account for extremely wide tables with columns in the 10s of thousands where encoding the names/types as type parameters becomes prohibitive to the compiler. So while optimizations can be written on the typed names/types type parameters, users should also consider handling the extremely wide tables by specializing on Tables.Schema{nothing, nothing}.

From which it sounds clear to me that users would not use thousands of columns in the type parameter anyways. (If they did, they would be screwed in other ways as well)

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Aug 6, 2023

If the compilation time is a big worry for you, I could do this instead:

n = fieldcount(types)
if n <= 512
    return ntuple(i -> fieldtype(types, i), Val(n))
else
    return Tuple(fieldtype(types, i) for i=1:n)
end

since n is a compile time constant this would not affect the type stability for less than 512 columns.

@MilesCranmer
Copy link
Contributor Author

I added it. For ncol>512 it falls back to the type unstable method.

plot

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Aug 6, 2023

I will be traveling for a bit so please feel free to edit the branch directly and merge if you want

@bkamins
Copy link
Member

bkamins commented Aug 7, 2023

The decision maker is @quinnj here. I wanted to make sure we tested everything before making a decision. I agree that users probably do not use super wide tables in type-stable mode.

@MilesCranmer
Copy link
Contributor Author

Thanks, no worries. In any case with the change in 4d4c962, there is not really any regressions here – in compile time or run time – even for large tables. At worst it's about the same performance; at best you get a 100x-1000x runtime speedup, and downstream packages can do type inference.

@MilesCranmer
Copy link
Contributor Author

Ping @quinnj when you get a chance could you take a look at potentially merging this? Would be a big help for downstream packages

@MilesCranmer
Copy link
Contributor Author

Ping @quinnj

@MilesCranmer
Copy link
Contributor Author

Ping @quinnj 😅

@bkamins
Copy link
Member

bkamins commented Sep 5, 2023

@quinnj is OOO for a few weeks so we need to wait

@MilesCranmer
Copy link
Contributor Author

Gotcha, no worries!

@MilesCranmer
Copy link
Contributor Author

Hey @quinnj, would you be free to take a look at merging this? It’s holding up some type stability issues in MLJ. Thanks, Miles

@MilesCranmer
Copy link
Contributor Author

Sending another ping. This is still a source of type instabilities throughout the MLJ ecosystem, but is a really simple fix.

@MilesCranmer
Copy link
Contributor Author

Ping

@MilesCranmer
Copy link
Contributor Author

Is there another way to reach @quinnj that he checks more frequently?

@bkamins
Copy link
Member

bkamins commented Oct 19, 2023

I think @quinnj is off-line on all channels currently.

src/Tables.jl Outdated Show resolved Hide resolved
src/Tables.jl Outdated Show resolved Hide resolved
@bkamins bkamins merged commit b3ec016 into JuliaData:main Oct 20, 2023
8 checks passed
@MilesCranmer MilesCranmer deleted the fix-schema-instability branch October 20, 2023 10:06
@MilesCranmer
Copy link
Contributor Author

Awesome! Thanks!

@bkamins
Copy link
Member

bkamins commented Oct 20, 2023

do you need a release or main branch is enough for now?

@MilesCranmer
Copy link
Contributor Author

A release would be necessary (to get type stability across MLJ ecosystem)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Type stability of Tables.matrix(::NamedTuple)
2 participants