Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING] Multicolumn transformations for GoupedDataFrame #2481

Merged
merged 28 commits into from
Nov 1, 2020
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
f409948
part 1 of implementation
bkamins Oct 12, 2020
208a79b
add tests of reordered GroupedDataFrame
bkamins Oct 12, 2020
610dce7
add @inbounds comment
bkamins Oct 12, 2020
96ebb09
support AsTable and multicolumn return values
bkamins Oct 13, 2020
07d7dd3
improve handling of corner cases when mixing transformation
bkamins Oct 14, 2020
0527e77
start rewriting tests
bkamins Oct 14, 2020
c38a33a
finish first round of tests
bkamins Oct 15, 2020
98bd976
update string tests
bkamins Oct 15, 2020
360aee3
update the manual entry
bkamins Oct 15, 2020
27ea3bd
Apply suggestions from code review
bkamins Oct 18, 2020
7ba3204
additional documentation fixes
bkamins Oct 18, 2020
5be5a29
documentation updates - part 2
bkamins Oct 18, 2020
bb02685
add more tests of the new functionality
bkamins Oct 19, 2020
3e05877
refactor split-apply-combine code
bkamins Oct 20, 2020
4014fb4
explain what happens when `sort=false`
bkamins Oct 28, 2020
3a50774
Merge branch 'master' into multicolumn_grouped_transform
bkamins Oct 28, 2020
928463a
Apply suggestions from code review
bkamins Oct 28, 2020
fcdf44e
make sure fun is always Base.Callable
bkamins Oct 28, 2020
fc58ee2
partial application of review comments
bkamins Oct 28, 2020
f119d51
use TransformationResult
bkamins Oct 28, 2020
4fdd34b
move includes to src/DataFrames.jl
bkamins Oct 28, 2020
2fb222d
docstring fixes - round 1
bkamins Oct 28, 2020
06c2906
test cleanup
bkamins Oct 28, 2020
98afe0e
Apply suggestions from code review
bkamins Oct 30, 2020
123bd9f
comments after the review
bkamins Oct 30, 2020
91a2108
Apply suggestions from code review
bkamins Nov 1, 2020
eca8c27
apply comments, sync docstring and manual, update NEWS.md
bkamins Nov 1, 2020
df78848
Apply suggestions from code review
bkamins Nov 1, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 33 additions & 24 deletions docs/src/man/split_apply_combine.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,34 @@
# The Split-Apply-Combine Strategy
# Transforming data frames
bkamins marked this conversation as resolved.
Show resolved Hide resolved

Many data analysis tasks involve splitting a data set into groups, applying some
functions to each of the groups and then combining the results. A standardized
framework for handling this sort of computation is described in the paper
"[The Split-Apply-Combine Strategy for Data Analysis](http://www.jstatsoft.org/v40/i01)",
written by Hadley Wickham.
Many data analysis tasks involve three steps:
1. splitting a data set into groups,
2. applying some functions to each of the groups,
3. combining the results.

Note that any of the steps 1 and 3 of this general procedure can be dropped,
in which case we just transform a data frame without grouping it and later
combining the result.

A standardized framework for handling this sort of computation is described in
the paper "[The Split-Apply-Combine Strategy for Data
Analysis](http://www.jstatsoft.org/v40/i01)", written by Hadley Wickham.

The DataFrames package supports the split-apply-combine strategy through the
`groupby` function followed by `combine`, `select`/`select!` or `transform`/`transform!`.
`groupby` function that creates a `GroupedDataFrame`,
followed by `combine`, `select`/`select!` or `transform`/`transform!`.

All operations described in this section of the manual are supported both for
`AbstractDataFrame` (when split and combine steps are skipped) and
`GroupedDataFrame`. Technically, `AbstractDataFrame` is just considered as being
grouped on no columns (meaning it has a single group, or zero groups if it is
empty). The only difference is that in this case the `keepkeys` and `ungroup`
keyword arguments (described below) are not supported and a data frame is always
returned, as there are no split and combine steps in this case.

In order to perform operations by groups you first need to create a `GroupedDataFrame`
object from your data frame using the `groupby` function that takes two arguments:
(1) a data frame to be grouped, and (2) a set of columns to group by.

!!! note

All operations described for `GroupedDataFrame` in this section of the manual
are also supported for `AbstractDataFrame` in which case it is considered as
being grouped on no columns (meaning it has a single group, or zero groups if it is empty).
The only difference is that in this case the `keepkeys` and `ungroup` keyword
arguments are not supported and a data frame is always returned.

Operations can then be applied on each group using one of the following functions:
* `combine`: does not put restrictions on number of rows returned, the order of rows
is specified by the order of groups in `GroupedDataFrame`; it is typically used
Expand All @@ -34,20 +42,21 @@ Operations can then be applied on each group using one of the following function

All these functions take a specification of one or more functions to apply to
each subset of the `DataFrame`. This specification can be of the following forms:
1. standard column selectors (integers, `Symbol`s, vectors of integers, vectors of
`Symbol`s, vectors of strings, `:`, `All`, `Between`, `Not` and regular expressions).
1. standard column selectors (integers, `Symbol`s, strings, vectors of integers,
vectors of `Symbol`s, vectors of strings,
`All`, `Cols`, `:`, `Between`, `Not` and regular expressions)
2. a `cols => function` pair indicating that `function` should be called with
positional arguments holding columns `cols`, which can be a any valid column selector;
in this case target column name is automatically generated and it is assumed that
`function` returns a single value or a vector; the generated name is created by
concatenating source column name and `function` name by default (see examples below).
3. a `cols => function => target_cols` form additionally explicitly specifying
the target column or columns.
4. a `col => target_cols` pair, which renames the column `col` to `target_cols` which
must be single column (a `Symbol` or a string).
4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which
must be single name (as a `Symbol` or a string).
5. a `nrow` or `nrow => target_cols` form which efficiently computes the number of rows
in a group; without `target_cols` the new column is called `:nrow`, otherwise
it must be single column (a `Symbol` or a string).
it must be single name (as a `Symbol` or a string).
6. vectors or matrices containing transformations specified by the `Pair` syntax
described in points 2 to 5
8. a function which will be called with a `SubDataFrame` corresponding to each group;
Expand All @@ -56,10 +65,10 @@ each subset of the `DataFrame`. This specification can be of the following forms
compilation)

All functions have two types of signatures. One of them takes a `GroupedDataFrame`
as a first argument and an arbitrary number of transfomations described above
as following arguments. The second type of signature is when `Function` or `Type`
is passed as a first argument and `GroupedDataFrame` is the second argument
(similar to how it is passed to `map`).
as the first argument and an arbitrary number of transformations described above
as following arguments. The second type of signature is when a `Function` or a `Type`
is passed as the first argument and a `GroupedDataFrame` as the second argument
(similar to `map`).

As a special rule, with the `cols => function` and `cols => function =>
target_cols` syntaxes, if `cols` is wrapped in an `AsTable`
Expand Down
3 changes: 3 additions & 0 deletions src/DataFrames.jl
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,9 @@ include("abstractdataframe/join.jl")
include("abstractdataframe/reshape.jl")

include("groupeddataframe/splitapplycombine.jl")
include("groupeddataframe/callprocessing.jl")
include("groupeddataframe/fastaggregates.jl")
include("groupeddataframe/complextransforms.jl")

include("abstractdataframe/show.jl")
include("groupeddataframe/show.jl")
Expand Down
55 changes: 34 additions & 21 deletions src/abstractdataframe/selection.jl
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,16 @@

const TRANSFORMATION_COMMON_RULES =
"""
Below detailed common rules for all transformation functions provided in
Below detailed common rules for all transformation functions supported by
DataFrames.jl are explained and compared.

All operations described below are supported both for `GroupedDataFrame` and
`AbstractDataFrame` in which case it is considered as being grouped on no
`AbstractDataFrame`. In the latter case, the data frame is considered as being grouped on no
columns (meaning it has a single group, or zero groups if it is empty). The
only difference is that in this case the `keepkeys` and `ungroup` keyword
arguments are not supported and a data frame is always returned.

Operations on can be applied on each group using one of the following functions:
Operations can be applied on each group using one of the following functions:
* `combine`: does not put restrictions on number of rows returned, the order of rows
is specified by the order of groups in `GroupedDataFrame`; it is typically used
to compute summary statistics by group;
Expand Down Expand Up @@ -51,7 +51,8 @@ const TRANSFORMATION_COMMON_RULES =
6. vectors or matrices containing transformations specified by the `Pair` syntax
described in points 2 to 5
8. a function which will be called with a `SubDataFrame` corresponding to each group;
this form should be avoided due to its poor performance unless a very large
this form should be avoided due to its poor performance unless
the number of groups is small or a very large
number of columns are processed (in which case `SubDataFrame` avoids excessive
compilation)

Expand Down Expand Up @@ -129,8 +130,10 @@ const TRANSFORMATION_COMMON_RULES =
transformation and single column selection operations must be unique, so e.g.
`select!(df, :a, :a => :a)` or `select!(df, :a, :a => ByRow(sin) => :a)` are not allowed.

Note that including the same column several times in the data frame via renaming
or transformations that return the same object without copying may create
As a general rule if `copycols=true` columns are copied and when
`copycols=false` columns are reused if possible. Note, however, that
including the same column several times in the data frame via renaming or
transformations that return the same object without copying may create
column aliases even if `copycols=true`. An example of such a situation is
bkamins marked this conversation as resolved.
Show resolved Hide resolved
`select!(df, :a, :a => :b, :a => identity => :c)`.

Expand All @@ -141,8 +144,8 @@ const TRANSFORMATION_COMMON_RULES =

There the following keyword arguments are supported by the transformation functions
(not all keyword arguments are supported in all cases; in general they are allowed
in situations when they are meaningful, see the documentation of the specific functions
for details):
in situations when they are meaningful, see the signatures of the specific functions
in the documentation strings to get the exact information):
- `keepkeys` : whether grouping columns should be kept in the returned data frame.
- `ungroup` : whether the return value of the operation should be a data frame or a
`GroupedDataFrame`.
Expand Down Expand Up @@ -582,7 +585,7 @@ end
select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
select!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true)

Mutate `df` or `gd` in place to retain only columns specified by `args...` and
Mutate `df` or `gd` in place to retain only columns or transformations specified by `args...` and
return it. The result is guaranteed to have the same number of rows as `df` or
parent of `gd`, except when no columns are selected (in which case the result
has zero rows).
Expand Down Expand Up @@ -615,7 +618,7 @@ end

Mutate `df` or `gd` in place to add columns specified by `args...` and return it.
The result is guaranteed to have the same number of rows as `df`.
Equivalent to `select!(df, :, args...)` and `select!(gd, :, args...)`.
Equivalent to `select!(df, :, args...)` or `select!(gd, :, args...)`.

$TRANSFORMATION_COMMON_RULES

Expand Down Expand Up @@ -784,7 +787,8 @@ Last Group (3 rows): a = 2
│ 2 │ 2 │ 17 │ 3 │
│ 3 │ 2 │ 17 │ 3 │

julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
# specifying a name for target column
julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c)
8×2 DataFrame
│ Row │ a │ sum_log_c │
│ │ Int64 │ Float64 │
Expand Down Expand Up @@ -812,8 +816,8 @@ julia> select(gd, [:b, :c] .=> sum) # passing a vector of pairs
│ 7 │ 1 │ 8 │ 19 │
│ 8 │ 2 │ 4 │ 17 │

julia> select(gd, :b => :b1, :c => :c1,
[:b, :c] => +, keepkeys=false) # multiple arguments, renaming and keepkeys
# multiple arguments, renaming and keepkeys
julia> select(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false)
8×3 DataFrame
│ Row │ b1 │ c1 │ b_c_+ │
│ │ Int64 │ Int64 │ Int64 │
Expand All @@ -827,7 +831,8 @@ julia> select(gd, :b => :b1, :c => :c1,
│ 7 │ 2 │ 7 │ 9 │
│ 8 │ 1 │ 8 │ 9 │

julia> select(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max]) # broadcasting and column expansion
# broadcasting and column expansion
julia> select(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max])
8×4 DataFrame
│ Row │ a │ b │ min │ max │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
Expand Down Expand Up @@ -875,9 +880,9 @@ end
transform(f::Base.Callable, gd::GroupedDataFrame; copycols::Bool=true,
keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)

Create a new data frame that contains columns from `df` or `gd` and adds columns
Create a new data frame that contains columns from `df` or `gd` plus columns
specified by `args` and return it. The result is guaranteed to have the same
number of rows as `df`. Equivalent to `select(df, :, args...)`.
number of rows as `df`. Equivalent to `select(df, :, args...)` or `select(gd, :, args...)`.

$TRANSFORMATION_COMMON_RULES

Expand Down Expand Up @@ -1029,7 +1034,8 @@ julia> combine(gd) do d # do syntax for the slower variant
│ 3 │ 3 │ 10 │
│ 4 │ 4 │ 12 │

julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
# specifying a name for target column
julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c)
4×2 DataFrame
│ Row │ a │ sum_log_c │
│ │ Int64 │ Float64 │
Expand Down Expand Up @@ -1063,8 +1069,8 @@ julia> combine(gd) do sdf # dropping group when DataFrame() is returned
│ 5 │ 4 │ 1 │ 4 │
│ 6 │ 4 │ 1 │ 8 │

julia> combine(gd, :b => :b1, :c => :c1,
[:b, :c] => +, keepkeys=false) # auto-splatting, renaming and keepkeys
# auto-splatting, renaming and keepkeys
julia> combine(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false)
8×3 DataFrame
│ Row │ b1 │ c1 │ b_c_+ │
│ │ Int64 │ Int64 │ Int64 │
Expand All @@ -1078,7 +1084,8 @@ julia> combine(gd, :b => :b1, :c => :c1,
│ 7 │ 1 │ 4 │ 5 │
│ 8 │ 1 │ 8 │ 9 │

julia> combine(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max]) # broadcasting and column expansion
# broadcasting and column expansion
julia> combine(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max])
8×4 DataFrame
│ Row │ a │ b │ min │ max │
│ │ Int64 │ Int64 │ Int64 │ Int64 │
Expand All @@ -1092,7 +1099,8 @@ julia> combine(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max]) # br
│ 7 │ 4 │ 1 │ 1 │ 4 │
│ 8 │ 4 │ 1 │ 1 │ 8 │

julia> combine(gd, [:b, :c] .=> Ref) # protecting result
# preventing vector from being spread across multiple rows
julia> combine(gd, [:b, :c] .=> Ref)
4×3 DataFrame
│ Row │ a │ b_Ref │ c_Ref │
│ │ Int64 │ SubArra… │ SubArra… │
Expand Down Expand Up @@ -1137,6 +1145,11 @@ function combine(arg::Base.Callable, df::AbstractDataFrame; renamecols::Bool=tru
return combine(df, arg)
end

combine(f::Pair, gd::AbstractDataFrame; renamecols::Bool=true) =
throw(ArgumentError("First argument must be a transformation if the second argument is a data frame. " *
"You can pass a `Pair` as a second argument of the transformation. If you want the return " *
bkamins marked this conversation as resolved.
Show resolved Hide resolved
"value to be processed as having multiple columns add `=> AsTable` suffix to the pair."))

manipulate(df::DataFrame, args::AbstractVector{Int}; copycols::Bool, keeprows::Bool,
renamecols::Bool) =
DataFrame(_columns(df)[args], Index(_names(df)[args]), copycols=copycols)
Expand Down
16 changes: 8 additions & 8 deletions src/groupeddataframe/callprocessing.jl
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ end
# For more than 4 columns `map` is slower than @generated
# but this case is probably rare and if huge number of columns is passed @generated
# has very high compilation cost
function do_call(f::Any, idx::AbstractVector{<:Integer},
function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
gd::GroupedDataFrame, incols::Tuple{}, i::Integer)
if f isa ByRow
Expand All @@ -88,43 +88,43 @@ function do_call(f::Any, idx::AbstractVector{<:Integer},
end
end

function do_call(f::Any, idx::AbstractVector{<:Integer},
function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
gd::GroupedDataFrame, incols::Tuple{AbstractVector}, i::Integer)
idx = idx[starts[i]:ends[i]]
return f(view(incols[1], idx))
end

function do_call(f::Any, idx::AbstractVector{<:Integer},
function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
gd::GroupedDataFrame, incols::NTuple{2, AbstractVector}, i::Integer)
idx = idx[starts[i]:ends[i]]
return f(view(incols[1], idx), view(incols[2], idx))
end

function do_call(f::Any, idx::AbstractVector{<:Integer},
function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
gd::GroupedDataFrame, incols::NTuple{3, AbstractVector}, i::Integer)
idx = idx[starts[i]:ends[i]]
return f(view(incols[1], idx), view(incols[2], idx), view(incols[3], idx))
end

function do_call(f::Any, idx::AbstractVector{<:Integer},
function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
gd::GroupedDataFrame, incols::NTuple{4, AbstractVector}, i::Integer)
idx = idx[starts[i]:ends[i]]
return f(view(incols[1], idx), view(incols[2], idx), view(incols[3], idx),
view(incols[4], idx))
end

function do_call(f::Any, idx::AbstractVector{<:Integer},
function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
gd::GroupedDataFrame, incols::Tuple, i::Integer)
idx = idx[starts[i]:ends[i]]
return f(map(c -> view(c, idx), incols)...)
end

function do_call(f::Any, idx::AbstractVector{<:Integer},
function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
gd::GroupedDataFrame, incols::NamedTuple, i::Integer)
if f isa ByRow && isempty(incols)
Expand All @@ -135,7 +135,7 @@ function do_call(f::Any, idx::AbstractVector{<:Integer},
end
end

function do_call(f::Any, idx::AbstractVector{<:Integer},
function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
gd::GroupedDataFrame, incols::Nothing, i::Integer)
idx = idx[starts[i]:ends[i]]
Expand Down
Loading