Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance of median and iqr compared to python libraries #3462

Open
lampretl opened this issue Sep 5, 2024 · 5 comments
Open

performance of median and iqr compared to python libraries #3462

lampretl opened this issue Sep 5, 2024 · 5 comments

Comments

@lampretl
Copy link

lampretl commented Sep 5, 2024

I'd like to efficiently and in parallel compute the median = q_0.5 and IQR = q_0.75 - q_0.25 of each column in a dataframe. Let's compare the 3 most used libraries:

pandas:

import numpy as np, pandas as pd, scipy
n,m=10**8,10;   df = pd.DataFrame(np.random.rand(n,m))
%time df.median(axis=0)
%time df.quantile(0.5)
%time df.quantile(0.75)-df.quantile(0.25)
%time scipy.stats.iqr(df,axis=0)
CPU times: user 23.4 s, sys: 921 ms, total: 24.4 s
Wall time: 24.4 s
CPU times: user 20.3 s, sys: 830 ms, total: 21.1 s
Wall time: 21.2 s
CPU times: user 39.9 s, sys: 1.71 s, total: 41.6 s
Wall time: 41.6 s
CPU times: user 25.6 s, sys: 5.28 s, total: 30.9 s
Wall time: 31 s

polars:

import numpy as np, polars as pl
n,m=10**8,10;   df = pl.DataFrame(np.random.rand(n,m), schema=[f"x{k}" for k in range(m)])
%time df.median()
%time df.quantile(0.75,interpolation='linear')
%time df.quantile(0.75,interpolation='linear') - df.quantile(0.25,interpolation='linear')
CPU times: user 21.4 s, sys: 3.51 s, total: 24.9 s
Wall time: 2.95 s
CPU times: user 19.2 s, sys: 3.86 s, total: 23.1 s
Wall time: 2.95 s
CPU times: user 43.8 s, sys: 11.4 s, total: 55.2 s
Wall time: 6.44 s

DataFrames.jl + Julia:

using DataFrames, StatsBase
n,m=10^1,10;   df = DataFrame(rand(n,m), :auto); 
function f1(df::DataFrame) ::Vector{Float64}  return map(median, eachcol(df)) end
function f2(df::DataFrame) ::Vector{Float64}  return map(iqr, eachcol(df)) end
function f3(df::DataFrame) ::Vector{Float64}  m=size(df,2);  res=fill(NaN,m);  Threads.@threads for j=1:m res[j] = median(df[:,j]) end; return res end
function f4(df::DataFrame) ::Vector{Float64}  m=size(df,2);  res=fill(NaN,m);  Threads.@threads for j=1:m res[j] = iqr(df[:,j]) end; return res end
@time f1(df);
@time f2(df);
@time f3(df);
@time f4(df);
14.686185 seconds (53 allocations: 14.901 GiB, 4.56% gc time)
86.758428 seconds (53 allocations: 7.451 GiB, 0.36% gc time)
8.259288 seconds (146 allocations: 22.352 GiB, 9.15% gc time)
50.395623 seconds (144 allocations: 14.901 GiB, 0.47% gc time)

Is there a better, more efficient way to compute medians and IQRs in Julia?

@stensmo
Copy link

stensmo commented Sep 6, 2024

This has nothing to do with DataFrames. You want an algo for median which runs in O(n). You need an algo which uses the median of medians concept. The implementation you use in Statistics.jl does not seem to be O(n), but I could be incorrect.

@lampretl
Copy link
Author

lampretl commented Sep 6, 2024

@stensmo I was hoping for a function from DataFrames.jl that would be comparable in performance to polars one. For a new user, migrating from Python to Julia, what is the equivalent or recommended way of obtaining quantiles?

@stensmo
Copy link

stensmo commented Sep 6, 2024

In Julia DataFrames, you can apply (almost) any function, including your own. The median function does not belong to DataFrames, but it is a standard function in Julia. Writing your own functons and applying them to a DataFrame is super easy in Julia. That is why you will love it, but it takes some time to get used to. You can apply the standard Julia median function to a DataFrame or a superfast implementation, that you find someone else did, or do it yourself.

@bkamins
Copy link
Member

bkamins commented Sep 6, 2024

@nalimilan - this issue should be migrated to Statistics.jl but I do not have privileges to do so. Could you please do it? Thank you!

@nalimilan
Copy link
Member

nalimilan commented Sep 6, 2024

I can't either. Apparently that's only possible between repos of the same org.

Anyway it seems this is already a known problem, and you had even made a PR for it? JuliaStats/Statistics.jl#91

EDIT: specifically, computing the IQR seems very similar to JuliaStats/Statistics.jl#84

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants