Fix type instability in `getproperty(::Schema, ::Symbol)` #340

MilesCranmer · 2023-08-05T19:18:19Z

Fixes #339. The issue was that inside getproperty, the generator for table types:

Tuple(fieldtype(types, i) for i = 1:fieldcount(types))

is type unstable.

This PR changes it to:

ntuple(i -> fieldtype(types, i), Val(fieldcount(types)))

which is stable.

Edit: 4d4c962 changes it so that getproperty falls back to the type unstable Tuple when the number of columns goes above 512. This prevents any major slow downs in compile time.

This gives some major speed ups in both runtime and compile time:

e.g., here is an inspection of @code_warntype on a function that returns .types for a schema:

julia> X = (x=randn(32), y=randn(32));

julia> f(X) = Tables.schema(X).types
f (generic function with 1 method)

julia> @code_warntype f(X)
MethodInstance for f(::@NamedTuple{x::Vector{Float64}, y::Vector{Float64}})
  from f(X) @ Main REPL[5]:1
Arguments
  #self#::Core.Const(f)
  X::@NamedTuple{x::Vector{Float64}, y::Vector{Float64}}
Body::Tuple{DataType, DataType}
1 ─ %1 = Tables.schema::Core.Const(Tables.schema)
│   %2 = (%1)(X)::Core.Const(Tables.Schema:
 :x  Float64
 :y  Float64)
│   %3 = Base.getproperty(%2, :types)::Core.Const((Float64, Float64))
└──      return %3

whereas before this change, it was inferring this as Vararg{DataType} (bad).

@quinnj @bkamins could you please review and merge when you get a chance? This is necessary to fix a variety of type instabilities in MLJ interfaces.

cc @ablaom @OkonSamuel. This should speed up calls to fit and predict across the MLJ ecosystem as it removes a key type instability.

Thanks!
Miles

bkamins · 2023-08-05T21:04:10Z

I have two comments:

Have you benchmarked how Val scales for wide tables (like 1000+ columns)?
Could you please add a test that checks for type stability of the result?

MilesCranmer · 2023-08-05T21:14:39Z

Have you benchmarked how Val scales for wide tables (like 1000+ columns)?

Keep in mind that this PR does not modify the code's behavior at all. It just switches from Tuple(... for i=1:n) to ntuple(..., Val{n}), taking advantage of the fact that n is a compile-time constant (if types is not nothing). The return type is the same, it just helps the compiler infer it.

If types are nothing, as I guess would happen for a large table like that (?), the Val would not even get hit:

return types === nothing ? (T !== nothing ? T : nothing) : ntuple(i -> fieldtype(types, i), Val(fieldcount(types)))

Could you please add a test that checks for type stability of the result?

Any tips on how to do this? I've never unit-tested compiler type inference before. It feels like something that would be closely tied to a specific Julia version too.

MilesCranmer · 2023-08-05T21:23:29Z

Oh it looks like there is Test.@inferred. Is that what you mean?

bkamins · 2023-08-05T21:26:24Z

The return type is the same, it just helps the compiler infer it.

The point of my question is that I want to make sure that if the compiler tries to infer it it is not too costly. (maybe it does not make a difference, but I want to make sure; in general passing Val to ntuple usually leads to code generation).

Oh it looks like there is Test.@inferred. Is that what you mean?

Yes

MilesCranmer · 2023-08-06T02:21:32Z

Happened to have some free time this evening so ran some detailed benchmarking; see below. Both TTFX and runtime are shown.

The key takeaways are:

Runtime:
- 100-1000x speedup for number of cols < 40
- 100x speedup for number of cols > 40
Time to first .types:
- 1-3x speedup for number of cols < 150
- increasingly slower for number of cols > 150, with 7x slowdown at 1000 columns

So, up to you. I personally think it's worth it, especially because large number of columns are already dealt with in the package.

Furthermore you'd get type stability in downstream packages which will bring even more speedups.

Some of the code used below:

Benchmarks:

using Tables
using BenchmarkTools
using Plots

compiled = BenchmarkGroup()
ttfx = Dict{Int,Float64}()

all_n = map(i -> round(Int, 2^i), 1.0:0.5:12.0);

function get_types(X)
    return Tables.schema(X).types
end

# Test run:
test_X = NamedTuple([Symbol("x$(j)") => [1.0] for j = 1:1])
@timed get_types(test_X)

for n in all_n
    local X = NamedTuple([Symbol("x$(j)") => [1.0] for j = 1:n])
    ttfx[n] = (@timed get_types(X)).time
end

for n in all_n
    compiled[n] = @benchmarkable(get_types(X), setup = (X = NamedTuple([Symbol("x$(j)") => [1.0] for j = 1:$n])))
end

tune!(compiled)
compiled_results = run(compiled, verbose=true)

times = [minimum(compiled_results["compiled"][n]).time for n in all_n]

# (Save to CSV file)

Plots:

using Plots
using CSV

# mm for plots:
using Plots.PlotMeasures

ttfx = (
    old=CSV.read("ttfx_old.csv", NamedTuple),
    new=CSV.read("ttfx_new.csv", NamedTuple),
)

ttx = (
    old=CSV.read("ttx_old.csv", NamedTuple),
    new=CSV.read("ttx_new.csv", NamedTuple),
)

theme(:bright)

p = plot(
    plot(
        [ttfx.old.n, ttfx.new.n],
        [ttfx.old.times, ttfx.new.times],
        xaxis=(:log, "Columns in Table"),
        yaxis=(:log, "Time [s]"),
        title="Time to first `.types`",
        label=["old" "new"],
    ),
    plot(
        [ttx.old.n, ttx.new.n],
        [ttx.old.times, ttx.new.times],
        xaxis=(:log, "Columns in Table"),
        yaxis=(:log, "Time [ns]"),
        title="Compiled runtime of `.types`",
        label=["old" "new"],
    ),
    layout=(1,2),
    size=(1000,500),
    # Prevent xlabel being cut off:
    margin=5mm,
    # Darker major grid lines:
    gridalpha=0.5,
)
savefig(p, "plot.png")

bkamins · 2023-08-06T10:53:13Z

especially because large number of columns are already dealt with in the package.

You mean that if one uses large number of columns then anyway type-unstable table would be used (like DataFrame)?

codecov · 2023-08-06T10:54:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (dac28d9) 94.46% compared to head (fc33a5a) 94.58%.
Report is 4 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #340      +/-   ##
==========================================
+ Coverage   94.46%   94.58%   +0.11%     
==========================================
  Files           7        7              
  Lines         723      739      +16     
==========================================
+ Hits          683      699      +16     
  Misses         40       40

Files	Coverage Δ
src/Tables.jl	`89.32% <100.00%> (+0.24%)`	⬆️

... and 4 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MilesCranmer · 2023-08-06T18:33:24Z

I was just reading that from the docstring:

Encoding the names & types as type parameters allows convenient use of the type in generated functions and other optimization use-cases, but users should note that when names and/or types are the nothing value, the names and/or types are stored in the storednames and storedtypes fields. This is to account for extremely wide tables with columns in the 10s of thousands where encoding the names/types as type parameters becomes prohibitive to the compiler. So while optimizations can be written on the typed names/types type parameters, users should also consider handling the extremely wide tables by specializing on Tables.Schema{nothing, nothing}.

From which it sounds clear to me that users would not use thousands of columns in the type parameter anyways. (If they did, they would be screwed in other ways as well)

MilesCranmer · 2023-08-06T19:31:22Z

If the compilation time is a big worry for you, I could do this instead:

n = fieldcount(types)
if n <= 512
    return ntuple(i -> fieldtype(types, i), Val(n))
else
    return Tuple(fieldtype(types, i) for i=1:n)
end

since n is a compile time constant this would not affect the type stability for less than 512 columns.

MilesCranmer · 2023-08-06T19:48:49Z

I added it. For ncol>512 it falls back to the type unstable method.

MilesCranmer · 2023-08-06T19:52:02Z

I will be traveling for a bit so please feel free to edit the branch directly and merge if you want

bkamins · 2023-08-07T00:56:13Z

The decision maker is @quinnj here. I wanted to make sure we tested everything before making a decision. I agree that users probably do not use super wide tables in type-stable mode.

MilesCranmer · 2023-08-07T01:03:44Z

Thanks, no worries. In any case with the change in 4d4c962, there is not really any regressions here – in compile time or run time – even for large tables. At worst it's about the same performance; at best you get a 100x-1000x runtime speedup, and downstream packages can do type inference.

MilesCranmer · 2023-08-12T03:29:39Z

Ping @quinnj when you get a chance could you take a look at potentially merging this? Would be a big help for downstream packages

MilesCranmer · 2023-08-25T20:44:47Z

Ping @quinnj

MilesCranmer · 2023-09-05T12:21:20Z

Ping @quinnj 😅

bkamins · 2023-09-05T12:35:01Z

@quinnj is OOO for a few weeks so we need to wait

MilesCranmer · 2023-09-05T12:43:15Z

Gotcha, no worries!

MilesCranmer · 2023-09-24T09:42:10Z

Hey @quinnj, would you be free to take a look at merging this? It’s holding up some type stability issues in MLJ. Thanks, Miles

MilesCranmer · 2023-10-05T09:48:27Z

Sending another ping. This is still a source of type instabilities throughout the MLJ ecosystem, but is a really simple fix.

MilesCranmer · 2023-10-19T09:12:27Z

Ping

MilesCranmer · 2023-10-19T09:15:14Z

Is there another way to reach @quinnj that he checks more frequently?

bkamins · 2023-10-19T12:28:37Z

I think @quinnj is off-line on all channels currently.

src/Tables.jl

MilesCranmer · 2023-10-20T10:06:09Z

Awesome! Thanks!

bkamins · 2023-10-20T10:15:32Z

do you need a release or main branch is enough for now?

MilesCranmer · 2023-10-20T10:30:26Z

A release would be necessary (to get type stability across MLJ ecosystem)

Fix instability in getproperty(::Schema, :types)

f02b974

MilesCranmer mentioned this pull request Aug 5, 2023

Type stability of Tables.matrix(::NamedTuple) #339

Closed

Add test for type stability of .types

f3fa24a

This comment was marked as outdated.

Sign in to view

Avoid ntuple for fieldcount above 512

4d4c962

MilesCranmer added 2 commits August 6, 2023 22:37

Add test to assert type instability for wide tables

08b4886

Localize variable definition in getproperty

9aa551a

bkamins reviewed Oct 19, 2023

View reviewed changes

src/Tables.jl Outdated Show resolved Hide resolved

Update src/Tables.jl

edb7cc8

bkamins approved these changes Oct 19, 2023

View reviewed changes

bkamins reviewed Oct 19, 2023

View reviewed changes

src/Tables.jl Outdated Show resolved Hide resolved

Clean up getproperty redundancy

fc33a5a

bkamins merged commit b3ec016 into JuliaData:main Oct 20, 2023
8 checks passed

MilesCranmer deleted the fix-schema-instability branch October 20, 2023 10:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix type instability in `getproperty(::Schema, ::Symbol)` #340

Fix type instability in `getproperty(::Schema, ::Symbol)` #340

MilesCranmer commented Aug 5, 2023 •

edited

Loading

bkamins commented Aug 5, 2023

MilesCranmer commented Aug 5, 2023

MilesCranmer commented Aug 5, 2023

bkamins commented Aug 5, 2023

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

MilesCranmer commented Aug 6, 2023 •

edited

Loading

bkamins commented Aug 6, 2023

codecov bot commented Aug 6, 2023 •

edited

Loading

MilesCranmer commented Aug 6, 2023

MilesCranmer commented Aug 6, 2023 •

edited

Loading

MilesCranmer commented Aug 6, 2023

MilesCranmer commented Aug 6, 2023 •

edited

Loading

bkamins commented Aug 7, 2023

MilesCranmer commented Aug 7, 2023

MilesCranmer commented Aug 12, 2023

MilesCranmer commented Aug 25, 2023

MilesCranmer commented Sep 5, 2023

bkamins commented Sep 5, 2023

MilesCranmer commented Sep 5, 2023

MilesCranmer commented Sep 24, 2023

MilesCranmer commented Oct 5, 2023

MilesCranmer commented Oct 19, 2023

MilesCranmer commented Oct 19, 2023

bkamins commented Oct 19, 2023

MilesCranmer commented Oct 20, 2023

bkamins commented Oct 20, 2023

MilesCranmer commented Oct 20, 2023

Fix type instability in getproperty(::Schema, ::Symbol) #340

Fix type instability in getproperty(::Schema, ::Symbol) #340

Conversation

MilesCranmer commented Aug 5, 2023 • edited Loading

bkamins commented Aug 5, 2023

MilesCranmer commented Aug 5, 2023

MilesCranmer commented Aug 5, 2023

bkamins commented Aug 5, 2023

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

MilesCranmer commented Aug 6, 2023 • edited Loading

bkamins commented Aug 6, 2023

codecov bot commented Aug 6, 2023 • edited Loading

Codecov Report

MilesCranmer commented Aug 6, 2023

MilesCranmer commented Aug 6, 2023 • edited Loading

MilesCranmer commented Aug 6, 2023

MilesCranmer commented Aug 6, 2023 • edited Loading

bkamins commented Aug 7, 2023

MilesCranmer commented Aug 7, 2023

MilesCranmer commented Aug 12, 2023

MilesCranmer commented Aug 25, 2023

MilesCranmer commented Sep 5, 2023

bkamins commented Sep 5, 2023

MilesCranmer commented Sep 5, 2023

MilesCranmer commented Sep 24, 2023

MilesCranmer commented Oct 5, 2023

MilesCranmer commented Oct 19, 2023

MilesCranmer commented Oct 19, 2023

bkamins commented Oct 19, 2023

MilesCranmer commented Oct 20, 2023

bkamins commented Oct 20, 2023

MilesCranmer commented Oct 20, 2023

Fix type instability in `getproperty(::Schema, ::Symbol)` #340

Fix type instability in `getproperty(::Schema, ::Symbol)` #340

MilesCranmer commented Aug 5, 2023 •

edited

Loading

MilesCranmer commented Aug 6, 2023 •

edited

Loading

codecov bot commented Aug 6, 2023 •

edited

Loading

MilesCranmer commented Aug 6, 2023 •

edited

Loading

MilesCranmer commented Aug 6, 2023 •

edited

Loading