JuliaData · bkamins · Nov 1, 2020 · Oct 12, 2020 · Oct 12, 2020 · Oct 12, 2020
diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md
@@ -1,26 +1,34 @@
-# The Split-Apply-Combine Strategy
+# Transforming data frames
 
-Many data analysis tasks involve splitting a data set into groups, applying some
-functions to each of the groups and then combining the results. A standardized
-framework for handling this sort of computation is described in the paper
-"[The Split-Apply-Combine Strategy for Data Analysis](http://www.jstatsoft.org/v40/i01)",
-written by Hadley Wickham.
+Many data analysis tasks involve three steps:
+1. splitting a data set into groups,
+2. applying some functions to each of the groups,
+3. combining the results.
+
+Note that any of the steps 1 and 3 of this general procedure can be dropped,
+in which case we just transform a data frame without grouping it and later
+combining the result.
+
+A standardized framework for handling this sort of computation is described in
+the paper "[The Split-Apply-Combine Strategy for Data
+Analysis](http://www.jstatsoft.org/v40/i01)", written by Hadley Wickham.
 
 The DataFrames package supports the split-apply-combine strategy through the
-`groupby` function followed by `combine`, `select`/`select!` or `transform`/`transform!`.
+`groupby` function that creates a `GroupedDataFrame`,
+followed by `combine`, `select`/`select!` or `transform`/`transform!`.
+
+All operations described in this section of the manual are supported both for
+`AbstractDataFrame` (when split and combine steps are skipped) and
+`GroupedDataFrame`. Technically, `AbstractDataFrame` is just considered as being
+grouped on no columns (meaning it has a single group, or zero groups if it is
+empty). The only difference is that in this case the `keepkeys` and `ungroup`
+keyword arguments (described below) are not supported and a data frame is always
+returned, as there are no split and combine steps in this case.
 
 In order to perform operations by groups you first need to create a `GroupedDataFrame`
 object from your data frame using the `groupby` function that takes two arguments:
 (1) a data frame to be grouped, and (2) a set of columns to group by.
 
-!!! note
-
-    All operations described for `GroupedDataFrame` in this section of the manual
-    are also supported for `AbstractDataFrame` in which case it is considered as
-    being grouped on no columns (meaning it has a single group, or zero groups if it is empty).
-    The only difference is that in this case the `keepkeys` and `ungroup` keyword
-    arguments are not supported and a data frame is always returned.
-
 Operations can then be applied on each group using one of the following functions:
 * `combine`: does not put restrictions on number of rows returned, the order of rows
   is specified by the order of groups in `GroupedDataFrame`; it is typically used
@@ -34,20 +42,21 @@ Operations can then be applied on each group using one of the following function
 
 All these functions take a specification of one or more functions to apply to
 each subset of the `DataFrame`. This specification can be of the following forms:
-1. standard column selectors (integers, `Symbol`s, vectors of integers, vectors of
-   `Symbol`s, vectors of strings, `:`, `All`, `Between`, `Not` and regular expressions).
+1. standard column selectors (integers, `Symbol`s, strings, vectors of integers,
+   vectors of `Symbol`s, vectors of strings,
+   `All`, `Cols`, `:`, `Between`, `Not` and regular expressions)
 2. a `cols => function` pair indicating that `function` should be called with
    positional arguments holding columns `cols`, which can be a any valid column selector;
    in this case target column name is automatically generated and it is assumed that
    `function` returns a single value or a vector; the generated name is created by
    concatenating source column name and `function` name by default (see examples below).
 3. a `cols => function => target_cols` form additionally explicitly specifying
    the target column or columns.
-4. a `col => target_cols` pair, which renames the column `col` to `target_cols` which
-   must be single column (a `Symbol` or a string).
+4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which
+   must be single name (as a `Symbol` or a string).
 5. a `nrow` or `nrow => target_cols` form which efficiently computes the number of rows
    in a group; without `target_cols` the new column is called `:nrow`, otherwise
-   it must be single column (a `Symbol` or a string).
+   it must be single name (as a `Symbol` or a string).
 6. vectors or matrices containing transformations specified by the `Pair` syntax
    described in points 2 to 5
 8. a function which will be called with a `SubDataFrame` corresponding to each group;
@@ -56,10 +65,10 @@ each subset of the `DataFrame`. This specification can be of the following forms
    compilation)
 
 All functions have two types of signatures. One of them takes a `GroupedDataFrame`
-as a first argument and an arbitrary number of transfomations described above
-as following arguments. The second type of signature is when `Function` or `Type`
-is passed as a first argument and `GroupedDataFrame` is the second argument
-(similar to how it is passed to `map`).
+as the first argument and an arbitrary number of transformations described above
+as following arguments. The second type of signature is when a `Function` or a `Type`
+is passed as the first argument and a `GroupedDataFrame` as the second argument
+(similar to `map`).
 
 As a special rule, with the `cols => function` and `cols => function =>
 target_cols` syntaxes, if `cols` is wrapped in an `AsTable`

diff --git a/src/DataFrames.jl b/src/DataFrames.jl
@@ -107,6 +107,9 @@ include("abstractdataframe/join.jl")
 include("abstractdataframe/reshape.jl")
 
 include("groupeddataframe/splitapplycombine.jl")
+include("groupeddataframe/callprocessing.jl")
+include("groupeddataframe/fastaggregates.jl")
+include("groupeddataframe/complextransforms.jl")
 
 include("abstractdataframe/show.jl")
 include("groupeddataframe/show.jl")

diff --git a/src/abstractdataframe/selection.jl b/src/abstractdataframe/selection.jl
@@ -12,16 +12,16 @@
 
 const TRANSFORMATION_COMMON_RULES =
     """
-    Below detailed common rules for all transformation functions provided in
+    Below detailed common rules for all transformation functions supported by
     DataFrames.jl are explained and compared.
 
     All operations described below are supported both for `GroupedDataFrame` and
-    `AbstractDataFrame` in which case it is considered as being grouped on no
+    `AbstractDataFrame`. In the latter case, the data frame is considered as being grouped on no
     columns (meaning it has a single group, or zero groups if it is empty). The
     only difference is that in this case the `keepkeys` and `ungroup` keyword
     arguments are not supported and a data frame is always returned.
 
-    Operations on can be applied on each group using one of the following functions:
+    Operations can be applied on each group using one of the following functions:
     * `combine`: does not put restrictions on number of rows returned, the order of rows
       is specified by the order of groups in `GroupedDataFrame`; it is typically used
       to compute summary statistics by group;
@@ -51,7 +51,8 @@ const TRANSFORMATION_COMMON_RULES =
     6. vectors or matrices containing transformations specified by the `Pair` syntax
        described in points 2 to 5
     8. a function which will be called with a `SubDataFrame` corresponding to each group;
-       this form should be avoided due to its poor performance unless a very large
+       this form should be avoided due to its poor performance unless
+       the number of groups is small or a very large
        number of columns are processed (in which case `SubDataFrame` avoids excessive
        compilation)
 
@@ -129,8 +130,10 @@ const TRANSFORMATION_COMMON_RULES =
     transformation and single column selection operations must be unique, so e.g.
     `select!(df, :a, :a => :a)` or `select!(df, :a, :a => ByRow(sin) => :a)` are not allowed.
 
-    Note that including the same column several times in the data frame via renaming
-    or transformations that return the same object without copying may create
+    As a general rule if `copycols=true` columns are copied and when
+    `copycols=false` columns are reused if possible. Note, however, that
+    including the same column several times in the data frame via renaming or
+    transformations that return the same object without copying may create
     column aliases even if `copycols=true`. An example of such a situation is
     `select!(df, :a, :a => :b, :a => identity => :c)`.
 
@@ -141,8 +144,8 @@ const TRANSFORMATION_COMMON_RULES =
 
     There the following keyword arguments are supported by the transformation functions
     (not all keyword arguments are supported in all cases; in general they are allowed
-    in situations when they are meaningful, see the documentation of the specific functions
-    for details):
+    in situations when they are meaningful, see the signatures of the specific functions
+    in the documentation strings to get the exact information):
     - `keepkeys` : whether grouping columns should be kept in the returned data frame.
     - `ungroup` : whether the return value of the operation should be a data frame or a
       `GroupedDataFrame`.
@@ -582,7 +585,7 @@ end
     select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true)
     select!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true)
 
-Mutate `df` or `gd` in place to retain only columns specified by `args...` and
+Mutate `df` or `gd` in place to retain only columns or transformations specified by `args...` and
 return it. The result is guaranteed to have the same number of rows as `df` or
 parent of `gd`, except when no columns are selected (in which case the result
 has zero rows).
@@ -615,7 +618,7 @@ end
 
 Mutate `df` or `gd` in place to add columns specified by `args...` and return it.
 The result is guaranteed to have the same number of rows as `df`.
-Equivalent to `select!(df, :, args...)` and `select!(gd, :, args...)`.
+Equivalent to `select!(df, :, args...)` or `select!(gd, :, args...)`.
 
 $TRANSFORMATION_COMMON_RULES
 
@@ -784,7 +787,8 @@ Last Group (3 rows): a = 2
 │ 2   │ 2     │ 17    │ 3     │
 │ 3   │ 2     │ 17    │ 3     │
 
-julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
+# specifying a name for target column
+julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c)
 8×2 DataFrame
 │ Row │ a     │ sum_log_c │
 │     │ Int64 │ Float64   │
@@ -812,8 +816,8 @@ julia> select(gd, [:b, :c] .=> sum) # passing a vector of pairs
 │ 7   │ 1     │ 8     │ 19    │
 │ 8   │ 2     │ 4     │ 17    │
 
-julia> select(gd, :b => :b1, :c => :c1,
-              [:b, :c] => +, keepkeys=false) # multiple arguments, renaming and keepkeys
+ # multiple arguments, renaming and keepkeys
+julia> select(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false)
 8×3 DataFrame
 │ Row │ b1    │ c1    │ b_c_+ │
 │     │ Int64 │ Int64 │ Int64 │
@@ -827,7 +831,8 @@ julia> select(gd, :b => :b1, :c => :c1,
 │ 7   │ 2     │ 7     │ 9     │
 │ 8   │ 1     │ 8     │ 9     │
 
-julia> select(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max]) # broadcasting and column expansion
+# broadcasting and column expansion
+julia> select(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max])
 8×4 DataFrame
 │ Row │ a     │ b     │ min   │ max   │
 │     │ Int64 │ Int64 │ Int64 │ Int64 │
@@ -875,9 +880,9 @@ end
     transform(f::Base.Callable, gd::GroupedDataFrame; copycols::Bool=true,
               keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true)
 
-Create a new data frame that contains columns from `df` or `gd` and adds columns
+Create a new data frame that contains columns from `df` or `gd` plus columns
 specified by `args` and return it. The result is guaranteed to have the same
-number of rows as `df`. Equivalent to `select(df, :, args...)`.
+number of rows as `df`. Equivalent to `select(df, :, args...)` or `select(gd, :, args...)`.
 
 $TRANSFORMATION_COMMON_RULES
 
@@ -1029,7 +1034,8 @@ julia> combine(gd) do d # do syntax for the slower variant
 │ 3   │ 3     │ 10    │
 │ 4   │ 4     │ 12    │
 
-julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column
+# specifying a name for target column
+julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c)
 4×2 DataFrame
 │ Row │ a     │ sum_log_c │
 │     │ Int64 │ Float64   │
@@ -1063,8 +1069,8 @@ julia> combine(gd) do sdf # dropping group when DataFrame() is returned
 │ 5   │ 4     │ 1     │ 4     │
 │ 6   │ 4     │ 1     │ 8     │
 
-julia> combine(gd, :b => :b1, :c => :c1,
-               [:b, :c] => +, keepkeys=false) # auto-splatting, renaming and keepkeys
+# auto-splatting, renaming and keepkeys
+julia> combine(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false)
 8×3 DataFrame
 │ Row │ b1    │ c1    │ b_c_+ │
 │     │ Int64 │ Int64 │ Int64 │
@@ -1078,7 +1084,8 @@ julia> combine(gd, :b => :b1, :c => :c1,
 │ 7   │ 1     │ 4     │ 5     │
 │ 8   │ 1     │ 8     │ 9     │
 
-julia> combine(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max]) # broadcasting and column expansion
+# broadcasting and column expansion
+julia> combine(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max])
 8×4 DataFrame
 │ Row │ a     │ b     │ min   │ max   │
 │     │ Int64 │ Int64 │ Int64 │ Int64 │
@@ -1092,7 +1099,8 @@ julia> combine(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max]) # br
 │ 7   │ 4     │ 1     │ 1     │ 4     │
 │ 8   │ 4     │ 1     │ 1     │ 8     │
 
-julia> combine(gd, [:b, :c] .=> Ref) # protecting result
+# preventing vector from being spread across multiple rows
+julia> combine(gd, [:b, :c] .=> Ref)
 4×3 DataFrame
 │ Row │ a     │ b_Ref    │ c_Ref    │
 │     │ Int64 │ SubArra… │ SubArra… │
@@ -1137,6 +1145,11 @@ function combine(arg::Base.Callable, df::AbstractDataFrame; renamecols::Bool=tru
     return combine(df, arg)
 end
 
+combine(f::Pair, gd::AbstractDataFrame; renamecols::Bool=true) =
+    throw(ArgumentError("First argument must be a transformation if the second argument is a data frame. " *
+                        "You can pass a `Pair` as a second argument of the transformation. If you want the return " *
+                        "value to be processed as having multiple columns add `=> AsTable` suffix to the pair."))
+
 manipulate(df::DataFrame, args::AbstractVector{Int}; copycols::Bool, keeprows::Bool,
            renamecols::Bool) =
     DataFrame(_columns(df)[args], Index(_names(df)[args]), copycols=copycols)

diff --git a/src/groupeddataframe/callprocessing.jl b/src/groupeddataframe/callprocessing.jl
@@ -78,7 +78,7 @@ end
 # For more than 4 columns `map` is slower than @generated
 # but this case is probably rare and if huge number of columns is passed @generated
 # has very high compilation cost
-function do_call(f::Any, idx::AbstractVector{<:Integer},
+function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
                  starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
                  gd::GroupedDataFrame, incols::Tuple{}, i::Integer)
     if f isa ByRow
@@ -88,43 +88,43 @@ function do_call(f::Any, idx::AbstractVector{<:Integer},
     end
 end
 
-function do_call(f::Any, idx::AbstractVector{<:Integer},
+function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
                  starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
                  gd::GroupedDataFrame, incols::Tuple{AbstractVector}, i::Integer)
     idx = idx[starts[i]:ends[i]]
     return f(view(incols[1], idx))
 end
 
-function do_call(f::Any, idx::AbstractVector{<:Integer},
+function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
                  starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
                  gd::GroupedDataFrame, incols::NTuple{2, AbstractVector}, i::Integer)
     idx = idx[starts[i]:ends[i]]
     return f(view(incols[1], idx), view(incols[2], idx))
 end
 
-function do_call(f::Any, idx::AbstractVector{<:Integer},
+function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
                  starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
                  gd::GroupedDataFrame, incols::NTuple{3, AbstractVector}, i::Integer)
     idx = idx[starts[i]:ends[i]]
     return f(view(incols[1], idx), view(incols[2], idx), view(incols[3], idx))
 end
 
-function do_call(f::Any, idx::AbstractVector{<:Integer},
+function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
                  starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
                  gd::GroupedDataFrame, incols::NTuple{4, AbstractVector}, i::Integer)
     idx = idx[starts[i]:ends[i]]
     return f(view(incols[1], idx), view(incols[2], idx), view(incols[3], idx),
              view(incols[4], idx))
 end
 
-function do_call(f::Any, idx::AbstractVector{<:Integer},
+function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
                  starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
                  gd::GroupedDataFrame, incols::Tuple, i::Integer)
     idx = idx[starts[i]:ends[i]]
     return f(map(c -> view(c, idx), incols)...)
 end
 
-function do_call(f::Any, idx::AbstractVector{<:Integer},
+function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
                  starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
                  gd::GroupedDataFrame, incols::NamedTuple, i::Integer)
     if f isa ByRow && isempty(incols)
@@ -135,7 +135,7 @@ function do_call(f::Any, idx::AbstractVector{<:Integer},
     end
 end
 
-function do_call(f::Any, idx::AbstractVector{<:Integer},
+function do_call(f::Base.Callable, idx::AbstractVector{<:Integer},
                  starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer},
                  gd::GroupedDataFrame, incols::Nothing, i::Integer)
     idx = idx[starts[i]:ends[i]]