-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mutable DataFrameRow for view method that takes a function arg? #1352
Comments
|
Indeed you are right. #=
Contents: Comparison of implementations of view(tbl::DataFrame, f::Function), where f: row -> Bool.
Result: Methods 4 and 5 seem equally fastest. Choose method 4 because it uses the existing DataFrameRow type, whereas method 5 requires a new type.
=#
using DataFrames
mutable struct dfrow{T <: AbstractDataFrame} # Mutable version of DataFrameRow
df::T
rowidx::Int
end
function Base.getindex(r::dfrow, colname::Symbol)
r.df[r.rowidx, colname]
end
function view1(tbl::AbstractDataFrame, f::Function)
n = size(tbl, 1)
inds = fill(false, n)
r = dfrow(tbl, 0)
for i = 1:n
r.rowidx += 1
inds[i] = f(r)
end
view(tbl, inds)
end
function view2(tbl::AbstractDataFrame, f::Function)
inds = [f(r) for r in eachrow(tbl)]
view(tbl, inds)
end
function view3(tbl::AbstractDataFrame, f::Function)
n = size(tbl, 1)
inds = fill(false, n)
for i = 1:n
inds[i] = f(DataFrameRow(tbl, i))
end
view(tbl, inds)
end
function view4(tbl::AbstractDataFrame, f::Function)
n = size(tbl, 1)
inds = Int[]
for i = 1:n
!f(DataFrameRow(tbl, i)) && continue
push!(inds, i)
end
view(tbl, inds)
end
function view5(tbl::AbstractDataFrame, f::Function)
n = size(tbl, 1)
inds = Int[]
r = dfrow(tbl, 0)
for i = 1:n
r.rowidx += 1
!f(r) && continue
push!(inds, i)
end
view(tbl, inds)
end
function view6(tbl::AbstractDataFrame, f::Function)
inds = Int[]
for r in eachrow(tbl)
!f(r) && continue
push!(inds, r.row)
end
view(tbl, inds)
end
tbl = DataFrame(a = [i for i = 1:10_000], b = rand(10_000));
# Compile
v1 = view1(tbl, (r) -> r[:a] <= 5_000);
v2 = view2(tbl, (r) -> r[:a] <= 5_000);
v3 = view3(tbl, (r) -> r[:a] <= 5_000);
v4 = view4(tbl, (r) -> r[:a] <= 5_000);
v5 = view5(tbl, (r) -> r[:a] <= 5_000);
v6 = view6(tbl, (r) -> r[:a] <= 5_000);
# Wrap timings in function
function test1(tbl)
@time view1(tbl, (r) -> r[:a] <= 5_000);
@time view2(tbl, (r) -> r[:a] <= 5_000);
@time view3(tbl, (r) -> r[:a] <= 5_000);
@time view4(tbl, (r) -> r[:a] <= 5_000);
@time view5(tbl, (r) -> r[:a] <= 5_000);
@time view6(tbl, (r) -> r[:a] <= 5_000);
""
end
for i = 1:10
println("Trial $i #################################################")
test1(tbl)
end |
The performance difference between these approaches is negligible compared to the impact of type-instability of the columns: Closing since there's nothing to do in DataFrames for this particular issue. |
OK great. |
Hi there,
More of a question than an issue.
Often I need to filter a DataFrame based on several conditions.
To make this a little easier I've hacked together a
view
method that takes a function from aDataFrame
row toBool
:Then to use it (for example):
To get this happening I made a type similar to
DataFrameRow
, but mutable:(I should make
dfrow
follow the iterator protocol)I have found this to be faster than using the immutable
DataFrameRow
.My questions are:
DataFrameRow
s are immutable?Thanks,
Jock
The text was updated successfully, but these errors were encountered: