diff --git a/NEWS.md b/NEWS.md index 55a7340090..659ebbf119 100644 --- a/NEWS.md +++ b/NEWS.md @@ -5,7 +5,8 @@ * the rules for transformations passed to `select`/`select!`, `transform`/`transform!`, and `combine` have been made more flexible; in particular now it is allowed to return multiple columns from a transformation function - [#2461](https://github.com/JuliaData/DataFrames.jl/pull/2461) + ([#2461](https://github.com/JuliaData/DataFrames.jl/pull/2461) and + [#2481](https://github.com/JuliaData/DataFrames.jl/pull/2481)) * CategoricalArrays.jl is no longer reexported: call `using CategoricalArrays` to use it [#2404]((https://github.com/JuliaData/DataFrames.jl/pull/2404)). In the same vein, the `categorical` and `categorical!` functions diff --git a/docs/src/man/split_apply_combine.md b/docs/src/man/split_apply_combine.md index 3bdc94bf62..e7a129009f 100644 --- a/docs/src/man/split_apply_combine.md +++ b/docs/src/man/split_apply_combine.md @@ -1,13 +1,29 @@ # The Split-Apply-Combine Strategy -Many data analysis tasks involve splitting a data set into groups, applying some -functions to each of the groups and then combining the results. A standardized -framework for handling this sort of computation is described in the paper -"[The Split-Apply-Combine Strategy for Data Analysis](http://www.jstatsoft.org/v40/i01)", -written by Hadley Wickham. +Many data analysis tasks involve three steps: +1. splitting a data set into groups, +2. applying some functions to each of the groups, +3. combining the results. + +Note that any of the steps 1 and 3 of this general procedure can be dropped, +in which case we just transform a data frame without grouping it and later +combining the result. + +A standardized framework for handling this sort of computation is described in +the paper "[The Split-Apply-Combine Strategy for Data +Analysis](http://www.jstatsoft.org/v40/i01)", written by Hadley Wickham. The DataFrames package supports the split-apply-combine strategy through the -`groupby` function followed by `combine`, `select`/`select!` or `transform`/`transform!`. +`groupby` function that creates a `GroupedDataFrame`, +followed by `combine`, `select`/`select!` or `transform`/`transform!`. + +All operations described in this section of the manual are supported both for +`AbstractDataFrame` (when split and combine steps are skipped) and +`GroupedDataFrame`. Technically, `AbstractDataFrame` is just considered as being +grouped on no columns (meaning it has a single group, or zero groups if it is +empty). The only difference is that in this case the `keepkeys` and `ungroup` +keyword arguments (described below) are not supported and a data frame is always +returned, as there are no split and combine steps in this case. In order to perform operations by groups you first need to create a `GroupedDataFrame` object from your data frame using the `groupby` function that takes two arguments: @@ -26,59 +42,107 @@ Operations can then be applied on each group using one of the following function All these functions take a specification of one or more functions to apply to each subset of the `DataFrame`. This specification can be of the following forms: -1. standard column selectors (integers, symbols, vectors of integers, vectors of symbols, +1. standard column selectors (integers, `Symbol`s, strings, vectors of integers, + vectors of `Symbol`s, vectors of strings, `All`, `Cols`, `:`, `Between`, `Not` and regular expressions) 2. a `cols => function` pair indicating that `function` should be called with - positional arguments holding columns `cols`, which can be a any valid column selector -3. a `cols => function => target_col` form additionally - specifying the name of the target column (this assumes that `function` returns a single - value or a vector) -4. a `col => target_col` pair, which renames the column `col` to `target_col` -5. a `nrow` or `nrow => target_col` form which efficiently computes the number of rows - in a group (without `target_col` the new column is called `:nrow`) -6. several arguments of the forms given above, or vectors thereof -7. a function which will be called with a `SubDataFrame` corresponding to each group; + positional arguments holding columns `cols`, which can be a any valid column selector; + in this case target column name is automatically generated and it is assumed that + `function` returns a single value or a vector; the generated name is created by + concatenating source column name and `function` name by default (see examples below). +3. a `cols => function => target_cols` form additionally explicitly specifying + the target column or columns. +4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which + must be single name (as a `Symbol` or a string). +5. a `nrow` or `nrow => target_cols` form which efficiently computes the number of rows + in a group; without `target_cols` the new column is called `:nrow`, otherwise + it must be single name (as a `Symbol` or a string). +6. vectors or matrices containing transformations specified by the `Pair` syntax + described in points 2 to 5 +8. a function which will be called with a `SubDataFrame` corresponding to each group; this form should be avoided due to its poor performance unless a very large number of columns are processed (in which case `SubDataFrame` avoids excessive compilation) -As a special rule that applies to `cols => function` syntax, if `cols` is wrapped -in an `AsTable` object then a `NamedTuple` containing columns selected by `cols` is -passed to `function`. - -In all of these cases, `function` can return either a single row or multiple rows. -`function` can always generate a single column by returning a single value or a vector. -Additionally, if `combine` is passed exactly one `function`, `cols => function`, -or `cols => function => outcol` as a first argument -and `target_col` is not specified, -`function` can return multiple columns in the form of an `AbstractDataFrame`, -`AbstractMatrix`, `NamedTuple` or `DataFrameRow`. +All functions have two types of signatures. One of them takes a `GroupedDataFrame` +as the first argument and an arbitrary number of transformations described above +as following arguments. The second type of signature is when a `Function` or a `Type` +is passed as the first argument and a `GroupedDataFrame` as the second argument +(similar to `map`). + +As a special rule, with the `cols => function` and `cols => function => +target_cols` syntaxes, if `cols` is wrapped in an `AsTable` +object then a `NamedTuple` containing columns selected by `cols` is passed to +`function`. + +What is allowed for `function` to return is determined by the `target_cols` value: +1. If both `cols` and `target_cols` are omitted (so only a `function` is passed), + then returning a data frame, a matrix, a `NamedTuple`, or a `DataFrameRow` will + produce multiple columns in the result. Returning any other value produces + a single column. +2. If `target_cols` is a `Symbol` or a string then the function is assumed to return + a single column. In this case returning a data frame, a matrix, a `NamedTuple`, + or a `DataFrameRow` raises an error. +3. If `target_cols` is a vector of `Symbol`s or strings or `AsTable` it is assumed + that `function` returns multiple columns. + If `function` returns one of `AbstractDataFrame`, `NamedTuple`, `DataFrameRow`, + `AbstractMatrix` then rules described in point 1 above apply. + If `function` returns an `AbstractVector` then each element of this vector must + support the `keys` function, which must return a collection of `Symbol`s, strings + or integers; the return value of `keys` must be identical for all elements. + Then as many columns are created as there are elements in the return value + of the `keys` function. If `target_cols` is `AsTable` then their names + are set to be equal to the key names except if `keys` returns integers, in + which case they are prefixed by `x` (so the column names are e.g. `x1`, + `x2`, ...). If `target_cols` is a vector of `Symbol`s or strings then + column names produced using the rules above are ignored and replaced by + `target_cols` (the number of columns must be the same as the length of + `target_cols` in this case). + If `fun` returns a value of any other type then it is assumed that it is a + table conforming to the Tables.jl API and the `Tables.columntable` function + is called on it to get the resulting columns and their names. The names are + retained when `target_cols` is `AsTable` and are replaced if + `target_cols` is a vector of `Symbol`s or strings. + +In all of these cases, `function` can return either a single row or multiple +rows. As a particular rule, values wrapped in a `Ref` or a `0`-dimensional +`AbstractArray` are unwrapped and then treated as a single row. `select`/`select!` and `transform`/`transform!` always return a `DataFrame` -with the same number of rows as the source. -For `combine`, the shape of the resulting `DataFrame` is determined -according to the following rules: -- a single value produces a single row and column per group -- a named tuple or `DataFrameRow` produces a single row and one column per field -- a vector produces a single column with one row per entry -- a named tuple of vectors produces one column per field with one row per entry in the vectors -- a `DataFrame` or a matrix produces as many rows and columns as it contains; - note that this option should be avoided due to its poor performance when the number - of groups is large - -The kind of return value and the number and names of columns must be the same for all groups. +with the same number and order of rows as the source (even if `GroupedDataFrame` +had its groups reordered). + +For `combine`, rows in the returned object appear in the order of groups in the +`GroupedDataFrame`. The functions can return an arbitrary number of rows for +each group, but the kind of returned object and the number and names of columns +must be the same for all groups, except when a `DataFrame()` or `NamedTuple()` +is returned, in which case a given group is skipped. It is allowed to mix single values and vectors if multiple transformations -are requested. In this case single value will be broadcasted to match the length +are requested. In this case single value will be repeated to match the length of columns specified by returned vectors. -As a particular rule, values wrapped in a `Ref` or a `0`-dimensional `AbstractArray` -are unwrapped and then broadcasted. - -If a single value or a vector is returned by the `function` and `target_col` is not -provided, it is generated automatically, by concatenating source column name and -`function` name where possible (see examples below). -We show several examples of the `by` function applied to the `iris` dataset below: +To apply `function` to each row instead of whole columns, it can be wrapped in a +`ByRow` struct. `cols` can be any column indexing syntax, in which case +`function` will be passed one argument for each of the columns specified by +`cols` or a `NamedTuple` of them if specified columns are wrapped in `AsTable`. +If `ByRow` is used it is allowed for `cols` to select an empty set of columns, +in which case `function` is called for each row without any arguments and an +empty `NamedTuple` is passed if empty set of columns is wrapped in `AsTable`. + +There the following keyword arguments are supported by the transformation functions +(not all keyword arguments are supported in all cases; in general they are allowed +in situations when they are meaningful, see the documentation of the specific functions +for details): +- `keepkeys` : whether grouping columns should be kept in the returned data frame. +- `ungroup` : whether the return value of the operation should be a data frame or a + `GroupedDataFrame`. +- `copycols` : whether columns of the source data frame should be copied if no + transformation is applied to them. +- `renamecols` : whether in the `cols => function` form automatically generated + column names should include the name of transformation functions or not. + +We show several examples of these functions applied to the `iris` dataset below: ```jldoctest sac julia> using DataFrames, CSV, Statistics @@ -176,8 +240,8 @@ julia> combine(gdf, nrow, :PetalLength => mean => :mean) │ 2 │ Iris-versicolor │ 50 │ 4.26 │ │ 3 │ Iris-virginica │ 50 │ 5.552 │ -julia> combine([:PetalLength, :SepalLength] => (p, s) -> (a=mean(p)/mean(s), b=sum(p)), - gdf) # multiple columns are passed as arguments +julia> combine(gdf, [:PetalLength, :SepalLength] => ((p, s) -> (a=mean(p)/mean(s), b=sum(p))) => + AsTable) # multiple columns are passed as arguments 3×3 DataFrame │ Row │ Species │ a │ b │ │ │ String │ Float64 │ Float64 │ @@ -215,6 +279,14 @@ julia> combine(gdf, 1:2 => cor, nrow) │ 2 │ Iris-versicolor │ 0.525911 │ 50 │ │ 3 │ Iris-virginica │ 0.457228 │ 50 │ +julia> combine(gdf, :PetalLength => (x -> [extrema(x)]) => [:min, :max]) +3×3 DataFrame +│ Row │ Species │ min │ max │ +│ │ String │ Float64 │ Float64 │ +├─────┼─────────────────┼─────────┼─────────┤ +│ 1 │ Iris-setosa │ 1.0 │ 1.9 │ +│ 2 │ Iris-versicolor │ 3.0 │ 5.1 │ +│ 3 │ Iris-virginica │ 4.5 │ 6.9 │ ``` Contrary to `combine`, the `select` and `transform` functions always return @@ -268,7 +340,7 @@ julia> transform(gdf, :Species => x -> chop.(x, head=5, tail=0)) │ 150 │ Iris-virginica │ 5.9 │ 3.0 │ 5.1 │ 1.8 │ virginica │ ``` -The `combine` function also supports the `do` block form. However, as noted above, +All functions also support the `do` block form. However, as noted above, this form is slow and should therefore be avoided when performance matters. ```jldoctest sac @@ -385,7 +457,7 @@ julia> combine(gd, valuecols(gd) .=> mean) │ 2 │ Iris-versicolor │ 5.936 │ 2.77 │ 4.26 │ 1.326 │ │ 3 │ Iris-virginica │ 6.588 │ 2.974 │ 5.552 │ 2.026 │ -julia> combine(gd, valuecols(gd) .=> (x -> (x .- mean(x)) ./ std(x)) .=> valuecols(gd)) +julia> combine(gd, valuecols(gd) .=> (x -> (x .- mean(x)) ./ std(x)), renamecols=false) 150×5 DataFrame │ Row │ Species │ SepalLength │ SepalWidth │ PetalLength │ PetalWidth │ │ │ String │ Float64 │ Float64 │ Float64 │ Float64 │ diff --git a/src/DataFrames.jl b/src/DataFrames.jl index 408561bece..f3722f964d 100644 --- a/src/DataFrames.jl +++ b/src/DataFrames.jl @@ -107,6 +107,9 @@ include("abstractdataframe/join.jl") include("abstractdataframe/reshape.jl") include("groupeddataframe/splitapplycombine.jl") +include("groupeddataframe/callprocessing.jl") +include("groupeddataframe/fastaggregates.jl") +include("groupeddataframe/complextransforms.jl") include("abstractdataframe/show.jl") include("groupeddataframe/show.jl") diff --git a/src/abstractdataframe/selection.jl b/src/abstractdataframe/selection.jl index ea66b52898..ba772786e4 100644 --- a/src/abstractdataframe/selection.jl +++ b/src/abstractdataframe/selection.jl @@ -10,6 +10,145 @@ # 4) Pair{AsTable, <:Pair{<:Base.Callable, <:Union{Symbol, Vector{Symbol}, Type{AsTable}}}} # 5) Callable +const TRANSFORMATION_COMMON_RULES = + """ + Below detailed common rules for all transformation functions supported by + DataFrames.jl are explained and compared. + + All these operations are supported both for + `AbstractDataFrame` (when split and combine steps are skipped) and + `GroupedDataFrame`. Technically, `AbstractDataFrame` is just considered as being + grouped on no columns (meaning it has a single group, or zero groups if it is + empty). The only difference is that in this case the `keepkeys` and `ungroup` + keyword arguments (described below) are not supported and a data frame is always + returned, as there are no split and combine steps in this case. + + In order to perform operations by groups you first need to create a `GroupedDataFrame` + object from your data frame using the `groupby` function that takes two arguments: + (1) a data frame to be grouped, and (2) a set of columns to group by. + + Operations can then be applied on each group using one of the following functions: + * `combine`: does not put restrictions on number of rows returned, the order of rows + is specified by the order of groups in `GroupedDataFrame`; it is typically used + to compute summary statistics by group; + * `select`: return a data frame with the number and order of rows exactly the same + as the source data frame, including only new calculated columns; + `select!` is an in-place version of `select`; + * `transform`: return a data frame with the number and order of rows exactly the same + as the source data frame, including all columns from the source and new calculated columns; + `transform!` is an in-place version of `transform`. + + All these functions take a specification of one or more functions to apply to + each subset of the `DataFrame`. This specification can be of the following forms: + 1. standard column selectors (integers, `Symbol`s, strings, vectors of integers, + vectors of `Symbol`s, vectors of strings, + `All`, `Cols`, `:`, `Between`, `Not` and regular expressions) + 2. a `cols => function` pair indicating that `function` should be called with + positional arguments holding columns `cols`, which can be a any valid column selector; + in this case target column name is automatically generated and it is assumed that + `function` returns a single value or a vector; the generated name is created by + concatenating source column name and `function` name by default (see examples below). + 3. a `cols => function => target_cols` form additionally explicitly specifying + the target column or columns. + 4. a `col => target_cols` pair, which renames the column `col` to `target_cols`, which + must be single name (as a `Symbol` or a string). + 5. a `nrow` or `nrow => target_cols` form which efficiently computes the number of rows + in a group; without `target_cols` the new column is called `:nrow`, otherwise + it must be single name (as a `Symbol` or a string). + 6. vectors or matrices containing transformations specified by the `Pair` syntax + described in points 2 to 5 + 8. a function which will be called with a `SubDataFrame` corresponding to each group; + this form should be avoided due to its poor performance unless a very large + number of columns are processed (in which case `SubDataFrame` avoids excessive + compilation) + + All functions have two types of signatures. One of them takes a `GroupedDataFrame` + as the first argument and an arbitrary number of transformations described above + as following arguments. The second type of signature is when a `Function` or a `Type` + is passed as the first argument and a `GroupedDataFrame` as the second argument + (similar to `map`). + + As a special rule, with the `cols => function` and `cols => function => + target_cols` syntaxes, if `cols` is wrapped in an `AsTable` + object then a `NamedTuple` containing columns selected by `cols` is passed to + `function`. + + What is allowed for `function` to return is determined by the `target_cols` value: + 1. If both `cols` and `target_cols` are omitted (so only a `function` is passed), + then returning a data frame, a matrix, a `NamedTuple`, or a `DataFrameRow` will + produce multiple columns in the result. Returning any other value produces + a single column. + 2. If `target_cols` is a `Symbol` or a string then the function is assumed to return + a single column. In this case returning a data frame, a matrix, a `NamedTuple`, + or a `DataFrameRow` raises an error. + 3. If `target_cols` is a vector of `Symbol`s or strings or `AsTable` it is assumed + that `function` returns multiple columns. + If `function` returns one of `AbstractDataFrame`, `NamedTuple`, `DataFrameRow`, + `AbstractMatrix` then rules described in point 1 above apply. + If `function` returns an `AbstractVector` then each element of this vector must + support the `keys` function, which must return a collection of `Symbol`s, strings + or integers; the return value of `keys` must be identical for all elements. + Then as many columns are created as there are elements in the return value + of the `keys` function. If `target_cols` is `AsTable` then their names + are set to be equal to the key names except if `keys` returns integers, in + which case they are prefixed by `x` (so the column names are e.g. `x1`, + `x2`, ...). If `target_cols` is a vector of `Symbol`s or strings then + column names produced using the rules above are ignored and replaced by + `target_cols` (the number of columns must be the same as the length of + `target_cols` in this case). + If `fun` returns a value of any other type then it is assumed that it is a + table conforming to the Tables.jl API and the `Tables.columntable` function + is called on it to get the resulting columns and their names. The names are + retained when `target_cols` is `AsTable` and are replaced if + `target_cols` is a vector of `Symbol`s or strings. + + In all of these cases, `function` can return either a single row or multiple + rows. As a particular rule, values wrapped in a `Ref` or a `0`-dimensional + `AbstractArray` are unwrapped and then treated as a single row. + + `select`/`select!` and `transform`/`transform!` always return a `DataFrame` + with the same number and order of rows as the source (even if `GroupedDataFrame` + had its groups reordered). + + For `combine`, rows in the returned object appear in the order of groups in the + `GroupedDataFrame`. The functions can return an arbitrary number of rows for + each group, but the kind of returned object and the number and names of columns + must be the same for all groups, except when a `DataFrame()` or `NamedTuple()` + is returned, in which case a given group is skipped. + + It is allowed to mix single values and vectors if multiple transformations + are requested. In this case single value will be repeated to match the length + of columns specified by returned vectors. + + To apply `function` to each row instead of whole columns, it can be wrapped in a + `ByRow` struct. `cols` can be any column indexing syntax, in which case + `function` will be passed one argument for each of the columns specified by + `cols` or a `NamedTuple` of them if specified columns are wrapped in `AsTable`. + If `ByRow` is used it is allowed for `cols` to select an empty set of columns, + in which case `function` is called for each row without any arguments and an + empty `NamedTuple` is passed if empty set of columns is wrapped in `AsTable`. + + If a collection of column names is passed then requesting duplicate column + names in target data frame are accepted (e.g. `select!(df, [:a], :, r"a")` + is allowed) and only the first occurrence is used. In particular a syntax to + move column `:col` to the first position in the data frame is + `select!(df, :col, :)`. On the contrary, output column names of renaming, + transformation and single column selection operations must be unique, so e.g. + `select!(df, :a, :a => :a)` or `select!(df, :a, :a => ByRow(sin) => :a)` are not allowed. + + As a general rule if `copycols=true` columns are copied and when + `copycols=false` columns are reused if possible. Note, however, that + including the same column several times in the data frame via renaming or + transformations that return the same object without copying may create + column aliases even if `copycols=true`. An example of such a situation is + `select!(df, :a, :a => :b, :a => identity => :c)`. + + If `df` is a `SubDataFrame` and `copycols=true` then a `DataFrame` is + returned and the same copying rules apply as for a `DataFrame` input: this + means in particular that selected columns will be copied. If + `copycols=false`, a `SubDataFrame` is returned without copying columns. + """ + """ ByRow @@ -434,233 +573,30 @@ function select_transform!(@nospecialize(nc::Union{Base.Callable, Pair{<:Union{I end end -SELECT_ARG_RULES = - """ - Arguments passed as `args...` can be: - - * Any index that is allowed for column indexing - ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR). - * A function or a type - * Column transformation operations using the `Pair` notation that is - described below and vectors or matrices of such pairs. - - Columns can be renamed using the `old_column => new_column_name` syntax, and - transformed using the `old_column => fun => new_column_name` syntax. - `new_column_name` must be a `Symbol` or a string, a vector of `Symbol`s or - strings, or `AsTable`. `fun` must be a function or a type. If `old_column` is a - `Symbol`, a string, or an integer then `fun` is applied to the corresponding - column vector. Otherwise `old_column` can be any column indexing syntax, in - which case `fun` will be passed the column vectors specified by `old_column` - as separate arguments. The only exception is when `old_column` is an - `AsTable` type wrapping a selector, in which case `fun` is passed a - `NamedTuple` containing the selected columns. - - Column renaming and transformation operations can be passed wrapped in - vectors or matrices (this is useful when combined with broadcasting). - - # Rules when `new_column_name` is a `Symbol` or a string or is absent - - If `fun` returns a value of type other than `AbstractVector` then it will be - repeated in a vector matching the target number of rows in the data - frame, unless its type is one of `AbstractDataFrame`, `NamedTuple`, - `DataFrameRow`, `AbstractMatrix`, in which case an error is thrown. As a - particular rule, values wrapped in a `Ref` or a `0`-dimensional - `AbstractArray` are unwrapped and then repeated. - - To apply `fun` to each row instead of whole columns, it can be wrapped in a - `ByRow` struct. In this case if `old_column` is a `Symbol`, a string, or an - integer then `fun` is applied to each element (row) of `old_column` using - broadcasting. Otherwise `old_column` can be any column indexing syntax, in - which case `fun` will be passed one argument for each of the columns - specified by `old_column`. If `ByRow` is used it is allowed for - `old_column` to select an empty set of columns, in which case `fun` - is called for each row without any arguments. - - Column transformation can also be specified using the short `old_column => - fun` form. In this case, `new_column_name` is automatically generated as - `\$(old_column)_\$(fun)` if `renamecols=true` and `\$(old_column)` if - `renamecols=false`. Up to three column names are used for multiple input - columns and they are joined using `_`; if more than three columns are passed - then the name consists of the first two names and `etc` suffix then, e.g. - `[:a,:b,:c,:d] => fun` produces the new column name `:a_b_etc_fun` if - `renamecols=true` and ``:a_b_etc` if `renamecols=false`. - It is not allowed to pass `renamecols=false` if `old_column` is empty - as it would generate an empty column name. - - # Rules when `new_column_name` is a vector of `Symbol`s or strings or is `AsTable` - - In this case it is assumed that `fun` returns multiple columns. - - If `fun` returns one of `AbstractDataFrame`, `NamedTuple`, `DataFrameRow`, - `AbstractMatrix` then rules described in the section describing the case - when `args` is a function or a type apply. - - If `fun` returns an `AbstractVector` then each element of this vector must - support the `keys` function, which must return a collection of `Symbol`s, strings - or integers; the return value of `keys` must be identical for all elements. - Then as many columns are created as there are elements in the return value - of the `keys` function. If `new_column_name` is `AsTable` then their names - are set to be equal to the key names except if `keys` returns integers, in - which case they are prefixed by `x` (so the column names are e.g. `x1`, - `x2`, ...). If `new_column_name` is a vector of `Symbol`s or strings then - column names produced using the rules above are ignored and replaced by - `new_column_name` (the number of columns must be the same as the length of - `new_column_name` in this case). - - If `fun` returns a value of any other type then it is assumed that it is a - table conforming to the Tables.jl API and the `Tables.columntable` function - is called on it to get the resulting columns and their names. The names are - retained when `new_column_name` is `AsTable` and are replaced if - `new_column_name` is a vector of `Symbol`s or strings. - - # Rules when element of `args` is a function or a type - - In this case the function or type is called with `df` as a single argument. - - If the return value of the transformation is one of `AbstractDataFrame`, - `NamedTuple`, `DataFrameRow` or `AbstractMatrix` then it is treated as - containing multiple columns. For `AbstractMatrix` column names are generated - as `x1`, `x2`, etc. For `AbstractDataFrame`, `NamedTuple` of vectors and - `AbstractMatrix` the columns are taken as is from the returned value. For - `DataFrameRow` and` NamedTuple` not containing any vectors the returned - value is broadcasted to a vector matching the target number of rows in the data - frame. - - If the return value is an `AbstractVector` then it is used as-is. The resulting - column gets the name `x1`. - - In all other cases the return value is repeated in a vector matching - the target number of rows in the data frame. As a particular rule, values - wrapped in a `Ref` or a `0`-dimensional `AbstractArray` are unwrapped and - then repeated. The resulting column gets the name `x1`. - - # Special rules - - As a special rule passing `nrow` without specifying `old_column` creates a - column named `:nrow` containing a number of rows in a source data frame, and - passing `nrow => new_column_name` stores the number of rows in source data - frame in `new_column_name` column. - - If a collection of column names is passed to `select!` or `select` then - requesting duplicate column names in target data frame are accepted (e.g. - `select!(df, [:a], :, r"a")` is allowed) and only the first occurrence is - used. In particular a syntax to move column `:col` to the first position in - the data frame is `select!(df, :col, :)`. On the contrary, output column - names of renaming, transformation and single column selection operations - must be unique, so e.g. `select!(df, :a, :a => :a)` or - `select!(df, :a, :a => ByRow(sin) => :a)` are not allowed. - """ - """ select!(df::DataFrame, args...; renamecols::Bool=true) - select!(args::Callable, df::DataFrame; renamecols::Bool=true) - -Mutate `df` in place to retain only columns specified by `args...` and return it. -The result is guaranteed to have the same number of rows as `df`, except when no -columns are selected (in which case the result has zero rows). - -$SELECT_ARG_RULES - -Note that including the same column several times in the data frame via renaming -or transformations that return the same object without copying will create -column aliases. An example of such a situation is -`select!(df, :a, :a => :b, :a => identity => :c)`. - -# Examples -```jldoctest -julia> df = DataFrame(a=1:3, b=4:6) -3×2 DataFrame -│ Row │ a │ b │ -│ │ Int64 │ Int64 │ -├─────┼───────┼───────┤ -│ 1 │ 1 │ 4 │ -│ 2 │ 2 │ 5 │ -│ 3 │ 3 │ 6 │ - -julia> select!(df, 2) -3×1 DataFrame -│ Row │ b │ -│ │ Int64 │ -├─────┼───────┤ -│ 1 │ 4 │ -│ 2 │ 5 │ -│ 3 │ 6 │ - -julia> df = DataFrame(a=1:3, b=4:6); - -julia> select!(df, :a => ByRow(sin) => :c, :b) -3×2 DataFrame -│ Row │ c │ b │ -│ │ Float64 │ Int64 │ -├─────┼──────────┼───────┤ -│ 1 │ 0.841471 │ 4 │ -│ 2 │ 0.909297 │ 5 │ -│ 3 │ 0.14112 │ 6 │ + select!(args::Base.Callable, df::DataFrame; renamecols::Bool=true) + select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true) + select!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true) -julia> select!(df, :, [:c, :b] => (c,b) -> c .+ b .- sum(b)/length(b)) -3×3 DataFrame -│ Row │ c │ b │ c_b_function │ -│ │ Float64 │ Int64 │ Float64 │ -├─────┼──────────┼───────┼──────────────┤ -│ 1 │ 0.841471 │ 4 │ -0.158529 │ -│ 2 │ 0.909297 │ 5 │ 0.909297 │ -│ 3 │ 0.14112 │ 6 │ 1.14112 │ +Mutate `df` or `gd` in place to retain only columns or transformations specified by `args...` and +return it. The result is guaranteed to have the same number of rows as `df` or +parent of `gd`, except when no columns are selected (in which case the result +has zero rows). -julia> df = DataFrame(a=1:3, b=4:6); - -julia> select!(df, names(df) .=> [minimum maximum]); - -julia> df -3×4 DataFrame -│ Row │ a_minimum │ b_minimum │ a_maximum │ b_maximum │ -│ │ Int64 │ Int64 │ Int64 │ Int64 │ -├─────┼───────────┼───────────┼───────────┼───────────┤ -│ 1 │ 1 │ 4 │ 3 │ 6 │ -│ 2 │ 1 │ 4 │ 3 │ 6 │ -│ 3 │ 1 │ 4 │ 3 │ 6 │ - -julia> df = DataFrame(a=1:3, b=4:6); - -julia> using Statistics +If `gd` is passed then it is updated to reflect the new rows of its updated +parent. If there are independent `GroupedDataFrame` objects constructed using +the same parent data frame they might get corrupt. -julia> select!(df, AsTable(:) => ByRow(mean), renamecols=false) -3×1 DataFrame -│ Row │ a_b │ -│ │ Float64 │ -├─────┼─────────┤ -│ 1 │ 2.5 │ -│ 2 │ 3.5 │ -│ 3 │ 4.5 │ - -julia> df = DataFrame(a=1:3, b=4:6); - -julia> select!(first, df) -3×2 DataFrame -│ Row │ a │ b │ -│ │ Int64 │ Int64 │ -├─────┼───────┼───────┤ -│ 1 │ 1 │ 4 │ -│ 2 │ 1 │ 4 │ -│ 3 │ 1 │ 4 │ +$TRANSFORMATION_COMMON_RULES -julia> df = DataFrame(a=1:3, b=4:6, c=7:9) -3×3 DataFrame -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 4 │ 7 │ -│ 2 │ 2 │ 5 │ 8 │ -│ 3 │ 3 │ 6 │ 9 │ +# Keyword arguments +- `renamecols::Bool=true` : whether in the `cols => function` form automatically generated + column names should include the name of transformation functions or not. +- `ungroup::Bool=true` : whether the return value of the operation on `gd` should be a data + frame or a `GroupedDataFrame`. -julia> select!(df, AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => :stats, - AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => AsTable) -3×3 DataFrame -│ Row │ stats │ mean │ std │ -│ │ NamedTuple… │ Float64 │ Float64 │ -├─────┼─────────────────────────┼─────────┼─────────┤ -│ 1 │ (mean = 4.0, std = 3.0) │ 4.0 │ 3.0 │ -│ 2 │ (mean = 5.0, std = 3.0) │ 5.0 │ 3.0 │ -│ 3 │ (mean = 6.0, std = 3.0) │ 6.0 │ 3.0 │ +See [`select`](@ref) for examples. ``` """ @@ -677,12 +613,22 @@ end """ transform!(df::DataFrame, args...; renamecols::Bool=true) transform!(args::Callable, df::DataFrame; renamecols::Bool=true) + transform!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true) + transform!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true) -Mutate `df` in place to add columns specified by `args...` and return it. +Mutate `df` or `gd` in place to add columns specified by `args...` and return it. The result is guaranteed to have the same number of rows as `df`. -Equivalent to `select!(df, :, args...)`. +Equivalent to `select!(df, :, args...)` or `select!(gd, :, args...)`. + +$TRANSFORMATION_COMMON_RULES -See [`select!`](@ref) for detailed rules regarding accepted values for `args`. +# Keyword arguments +- `renamecols::Bool=true` : whether in the `cols => function` form automatically generated + column names should include the name of transformation functions or not. +- `ungroup::Bool=true` : whether the return value of the operation on `gd` should be a data + frame or a `GroupedDataFrame`. + +See [`select`](@ref) for examples. """ transform!(df::DataFrame, @nospecialize(args...); renamecols::Bool=true) = select!(df, :, args..., renamecols=renamecols) @@ -697,38 +643,27 @@ end """ select(df::AbstractDataFrame, args...; copycols::Bool=true, renamecols::Bool=true) select(args::Callable, df::DataFrame; renamecols::Bool=true) - -Create a new data frame that contains columns from `df` specified by `args` and -return it. The result is guaranteed to have the same number of rows as `df`, -except when no columns are selected (in which case the result has zero rows).. - -If `df` is a `DataFrame` or `copycols=true` then column renaming and transformations -are supported. - -$SELECT_ARG_RULES - -If `df` is a `DataFrame` a new `DataFrame` is returned. -If `copycols=false`, then the returned `DataFrame` shares column vectors with `df` -where possible. -If `copycols=true` (the default), then the returned `DataFrame` will not share -columns with `df`. -The only exception for this rule is the `old_column => fun => new_column` -transformation when `fun` returns a vector that is not allocated by `fun` but is -neither a `SubArray` nor one of the input vectors. -In such a case a new `DataFrame` might contain aliases. Such a situation can -only happen with transformations which returns vectors other than their inputs, -e.g. with `select(df, :a => (x -> c) => :c1, :b => (x -> c) => :c2)` when `c` -is a vector object or with `select(df, :a => (x -> df.c) => :c2)`. - -If `df` is a `SubDataFrame` and `copycols=true` then a `DataFrame` is returned -and the same copying rules apply as for a `DataFrame` input: -this means in particular that selected columns will be copied. -If `copycols=false`, a `SubDataFrame` is returned without copying columns. - -Note that including the same column several times in the data frame via renaming -or transformations that return the same object when `copycols=false` will create -column aliases. An example of such a situation is -`select(df, :a, :a => :b, :a => identity => :c, copycols=false)`. + select(gd::GroupedDataFrame, args...; copycols::Bool=true, keepkeys::Bool=true, + ungroup::Bool=true, renamecols::Bool=true) + select(f::Base.Callable, gd::GroupedDataFrame; copycols::Bool=true, + keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) + +Create a new data frame that contains columns from `df` or `gd` specified by +`args` and return it. The result is guaranteed to have the same number of rows +as `df`, except when no columns are selected (in which case the result has zero +rows). + +$TRANSFORMATION_COMMON_RULES + +# Keyword arguments +- `copycols::Bool=true` : whether columns of the source data frame should be copied if + no transformation is applied to them. +- `renamecols::Bool=true` : whether in the `cols => function` form automatically generated + column names should include the name of transformation functions or not. +- `keepkeys::Bool=true` : whether grouping columns of `gd` should be kept in the returned + data frame. +- `ungroup::Bool=true` : whether the return value of the operation on `gd` should be a data + frame or a `GroupedDataFrame`. # Examples ```jldoctest @@ -815,6 +750,131 @@ julia> select(df, AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => :stats │ 1 │ (mean = 4.0, std = 3.0) │ 4.0 │ 3.0 │ │ 2 │ (mean = 5.0, std = 3.0) │ 5.0 │ 3.0 │ │ 3 │ (mean = 6.0, std = 3.0) │ 6.0 │ 3.0 │ + +julia> df = DataFrame(a = [1, 1, 1, 2, 2, 1, 1, 2], + b = repeat([2, 1], outer=[4]), + c = 1:8) +8×3 DataFrame +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 2 │ 1 │ +│ 2 │ 1 │ 1 │ 2 │ +│ 3 │ 1 │ 2 │ 3 │ +│ 4 │ 2 │ 1 │ 4 │ +│ 5 │ 2 │ 2 │ 5 │ +│ 6 │ 1 │ 1 │ 6 │ +│ 7 │ 1 │ 2 │ 7 │ +│ 8 │ 2 │ 1 │ 8 │ + +julia> gd = groupby(df, :a); + +julia> select(gd, :c => sum, nrow) +8×3 DataFrame +│ Row │ a │ c_sum │ nrow │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 19 │ 5 │ +│ 2 │ 1 │ 19 │ 5 │ +│ 3 │ 1 │ 19 │ 5 │ +│ 4 │ 2 │ 17 │ 3 │ +│ 5 │ 2 │ 17 │ 3 │ +│ 6 │ 1 │ 19 │ 5 │ +│ 7 │ 1 │ 19 │ 5 │ +│ 8 │ 2 │ 17 │ 3 │ + +julia> select(gd, :c => sum, nrow, ungroup=false) +GroupedDataFrame with 2 groups based on key: a +First Group (5 rows): a = 1 +│ Row │ a │ c_sum │ nrow │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 19 │ 5 │ +│ 2 │ 1 │ 19 │ 5 │ +│ 3 │ 1 │ 19 │ 5 │ +│ 4 │ 1 │ 19 │ 5 │ +│ 5 │ 1 │ 19 │ 5 │ +⋮ +Last Group (3 rows): a = 2 +│ Row │ a │ c_sum │ nrow │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 2 │ 17 │ 3 │ +│ 2 │ 2 │ 17 │ 3 │ +│ 3 │ 2 │ 17 │ 3 │ + +# specifying a name for target column +julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c) +8×2 DataFrame +│ Row │ a │ sum_log_c │ +│ │ Int64 │ Float64 │ +├─────┼───────┼───────────┤ +│ 1 │ 1 │ 5.52943 │ +│ 2 │ 1 │ 5.52943 │ +│ 3 │ 1 │ 5.52943 │ +│ 4 │ 2 │ 5.07517 │ +│ 5 │ 2 │ 5.07517 │ +│ 6 │ 1 │ 5.52943 │ +│ 7 │ 1 │ 5.52943 │ +│ 8 │ 2 │ 5.07517 │ + +julia> select(gd, [:b, :c] .=> sum) # passing a vector of pairs +8×3 DataFrame +│ Row │ a │ b_sum │ c_sum │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 8 │ 19 │ +│ 2 │ 1 │ 8 │ 19 │ +│ 3 │ 1 │ 8 │ 19 │ +│ 4 │ 2 │ 4 │ 17 │ +│ 5 │ 2 │ 4 │ 17 │ +│ 6 │ 1 │ 8 │ 19 │ +│ 7 │ 1 │ 8 │ 19 │ +│ 8 │ 2 │ 4 │ 17 │ + + # multiple arguments, renaming and keepkeys +julia> select(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false) +8×3 DataFrame +│ Row │ b1 │ c1 │ b_c_+ │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 2 │ 1 │ 3 │ +│ 2 │ 1 │ 2 │ 3 │ +│ 3 │ 2 │ 3 │ 5 │ +│ 4 │ 1 │ 4 │ 5 │ +│ 5 │ 2 │ 5 │ 7 │ +│ 6 │ 1 │ 6 │ 7 │ +│ 7 │ 2 │ 7 │ 9 │ +│ 8 │ 1 │ 8 │ 9 │ + +# broadcasting and column expansion +julia> select(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max]) +8×4 DataFrame +│ Row │ a │ b │ min │ max │ +│ │ Int64 │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 2 │ 1 │ 2 │ +│ 2 │ 1 │ 1 │ 1 │ 2 │ +│ 3 │ 1 │ 2 │ 2 │ 3 │ +│ 4 │ 2 │ 1 │ 1 │ 4 │ +│ 5 │ 2 │ 2 │ 2 │ 5 │ +│ 6 │ 1 │ 1 │ 1 │ 6 │ +│ 7 │ 1 │ 2 │ 2 │ 7 │ +│ 8 │ 2 │ 1 │ 1 │ 8 │ + +julia> select(gd, :, AsTable(Not(:a)) => sum, renamecols=false) +8×4 DataFrame +│ Row │ a │ b │ c │ b_c │ +│ │ Int64 │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 2 │ 1 │ 3 │ +│ 2 │ 1 │ 1 │ 2 │ 3 │ +│ 3 │ 1 │ 2 │ 3 │ 5 │ +│ 4 │ 2 │ 1 │ 4 │ 5 │ +│ 5 │ 2 │ 2 │ 5 │ 7 │ +│ 6 │ 1 │ 1 │ 6 │ 7 │ +│ 7 │ 1 │ 2 │ 7 │ 9 │ +│ 8 │ 2 │ 1 │ 8 │ 9 │ ``` """ @@ -830,14 +890,59 @@ end """ transform(df::AbstractDataFrame, args...; copycols::Bool=true, renamecols::Bool=true) - transform(args::Callable, df::DataFrame; renamecols::Bool=true) + transform(f::Callable, df::DataFrame; renamecols::Bool=true) + transform(gd::GroupedDataFrame, args...; copycols::Bool=true, + keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) + transform(f::Base.Callable, gd::GroupedDataFrame; copycols::Bool=true, + keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) + +Create a new data frame that contains columns from `df` or `gd` plus columns +specified by `args` and return it. The result is guaranteed to have the same +number of rows as `df`. Equivalent to `select(df, :, args...)` or `select(gd, :, args...)`. + +$TRANSFORMATION_COMMON_RULES + +# Keyword arguments +- `copycols::Bool=true` : whether columns of the source data frame should be copied if + no transformation is applied to them. +- `renamecols::Bool=true` : whether in the `cols => function` form automatically generated + column names should include the name of transformation functions or not. +- `keepkeys::Bool=true` : whether grouping columns of `gd` should be kept in the returned + data frame. +- `ungroup::Bool=true` : whether the return value of the operation on `gd` should be a data + frame or a `GroupedDataFrame`. + +Note that when the first argument is a `GroupedDataFrame`, `keepkeys=false` +is needed to be able to return a different value for the grouping column: -Create a new data frame that contains columns from `df` and adds columns -specified by `args` and return it. -The result is guaranteed to have the same number of rows as `df`. -Equivalent to `select(df, :, args..., copycols=copycols)`. +``` +julia> gdf = groupby(DataFrame(x=1:2), :x) +GroupedDataFrame with 2 groups based on key: x +First Group (1 row): x = 1 +│ Row │ x │ +│ │ Int64 │ +├─────┼───────┤ +│ 1 │ 1 │ +⋮ +Last Group (1 row): x = 2 +│ Row │ x │ +│ │ Int64 │ +├─────┼───────┤ +│ 1 │ 2 │ + +julia> transform(gdf, x -> (x=10,), keepkeys=false) +2×1 DataFrame +│ Row │ x │ +│ │ Int64 │ +├─────┼───────┤ +│ 1 │ 10 │ +│ 2 │ 10 │ -See [`select`](@ref) for detailed rules regarding accepted values for `args`. +julia> transform(gdf, x -> (x=10,), keepkeys=true) +ERROR: ArgumentError: column :x in returned data frame is not equal to grouping key :x +``` + +See [`select`](@ref) for more examples. """ transform(df::AbstractDataFrame, @nospecialize(args...); copycols::Bool=true, renamecols::Bool=true) = select(df, :, args..., copycols=copycols, renamecols=renamecols) @@ -851,16 +956,25 @@ end """ combine(df::AbstractDataFrame, args...; renamecols::Bool=true) - combine(args::Callable, df::AbstractDataFrame; renamecols::Bool=true) - -Create a new data frame that contains columns from `df` specified by `args` and -return it. The result can have any number of rows that is determined by the -values returned by passed transformations. - -See [`select`](@ref) for detailed rules regarding accepted values for `args` in -`combine(df, args...)` form. For `combine(arg, df)` the same rules as for -`combine` on `GroupedDataFrame` apply except that a `df` with zero rows is -currently not allowed. + combine(f::Callable, df::AbstractDataFrame; renamecols::Bool=true) + combine(gd::GroupedDataFrame, args...; + keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) + combine(f::Base.Callable, gd::GroupedDataFrame; + keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) + +Create a new data frame that contains columns from `df` or `gd` specified by +`args` and return it. The result can have any number of rows that is determined +by the values returned by passed transformations. + +$TRANSFORMATION_COMMON_RULES + +# Keyword arguments +- `renamecols::Bool=true` : whether in the `cols => function` form automatically generated + column names should include the name of transformation functions or not. +- `keepkeys::Bool=true` : whether grouping columns of `gd` should be kept in the returned + data frame. +- `ungroup::Bool=true` : whether the return value of the operation on `gd` should be a data + frame or a `GroupedDataFrame`. # Examples ```jldoctest @@ -941,6 +1055,148 @@ julia> combine(df, AsTable(:) => ByRow(x -> (mean=mean(x), std=std(x))) => :stat │ 1 │ (mean = 4.0, std = 3.0) │ 4.0 │ 3.0 │ │ 2 │ (mean = 5.0, std = 3.0) │ 5.0 │ 3.0 │ │ 3 │ (mean = 6.0, std = 3.0) │ 6.0 │ 3.0 │ + +julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]), + b = repeat([2, 1], outer=[4]), + c = 1:8); + +julia> gd = groupby(df, :a); + +julia> combine(gd, :c => sum, nrow) +4×3 DataFrame +│ Row │ a │ c_sum │ nrow │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 6 │ 2 │ +│ 2 │ 2 │ 8 │ 2 │ +│ 3 │ 3 │ 10 │ 2 │ +│ 4 │ 4 │ 12 │ 2 │ + +julia> combine(gd, :c => sum, nrow, ungroup=false) +GroupedDataFrame with 4 groups based on key: a +First Group (1 row): a = 1 +│ Row │ a │ c_sum │ nrow │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 6 │ 2 │ +⋮ +Last Group (1 row): a = 4 +│ Row │ a │ c_sum │ nrow │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 4 │ 12 │ 2 │ + +julia> combine(gd) do d # do syntax for the slower variant + sum(d.c) + end +4×2 DataFrame +│ Row │ a │ x1 │ +│ │ Int64 │ Int64 │ +├─────┼───────┼───────┤ +│ 1 │ 1 │ 6 │ +│ 2 │ 2 │ 8 │ +│ 3 │ 3 │ 10 │ +│ 4 │ 4 │ 12 │ + +# specifying a name for target column +julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c) +4×2 DataFrame +│ Row │ a │ sum_log_c │ +│ │ Int64 │ Float64 │ +├─────┼───────┼───────────┤ +│ 1 │ 1 │ 1.60944 │ +│ 2 │ 2 │ 2.48491 │ +│ 3 │ 3 │ 3.04452 │ +│ 4 │ 4 │ 3.46574 │ + +julia> combine(gd, [:b, :c] .=> sum) # passing a vector of pairs +4×3 DataFrame +│ Row │ a │ b_sum │ c_sum │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 4 │ 6 │ +│ 2 │ 2 │ 2 │ 8 │ +│ 3 │ 3 │ 4 │ 10 │ +│ 4 │ 4 │ 2 │ 12 │ + +julia> combine(gd) do sdf # dropping group when DataFrame() is returned + sdf.c[1] != 1 ? sdf : DataFrame() + end +6×3 DataFrame +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 2 │ 1 │ 2 │ +│ 2 │ 2 │ 1 │ 6 │ +│ 3 │ 3 │ 2 │ 3 │ +│ 4 │ 3 │ 2 │ 7 │ +│ 5 │ 4 │ 1 │ 4 │ +│ 6 │ 4 │ 1 │ 8 │ + +# auto-splatting, renaming and keepkeys +julia> combine(gd, :b => :b1, :c => :c1, [:b, :c] => +, keepkeys=false) +8×3 DataFrame +│ Row │ b1 │ c1 │ b_c_+ │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 2 │ 1 │ 3 │ +│ 2 │ 2 │ 5 │ 7 │ +│ 3 │ 1 │ 2 │ 3 │ +│ 4 │ 1 │ 6 │ 7 │ +│ 5 │ 2 │ 3 │ 5 │ +│ 6 │ 2 │ 7 │ 9 │ +│ 7 │ 1 │ 4 │ 5 │ +│ 8 │ 1 │ 8 │ 9 │ + +# broadcasting and column expansion +julia> combine(gd, :b, AsTable([:b, :c]) => ByRow(extrema) => [:min, :max]) +8×4 DataFrame +│ Row │ a │ b │ min │ max │ +│ │ Int64 │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 2 │ 1 │ 2 │ +│ 2 │ 1 │ 2 │ 2 │ 5 │ +│ 3 │ 2 │ 1 │ 1 │ 2 │ +│ 4 │ 2 │ 1 │ 1 │ 6 │ +│ 5 │ 3 │ 2 │ 2 │ 3 │ +│ 6 │ 3 │ 2 │ 2 │ 7 │ +│ 7 │ 4 │ 1 │ 1 │ 4 │ +│ 8 │ 4 │ 1 │ 1 │ 8 │ + +# preventing vector from being spread across multiple rows +julia> combine(gd, [:b, :c] .=> Ref) +4×3 DataFrame +│ Row │ a │ b_Ref │ c_Ref │ +│ │ Int64 │ SubArra… │ SubArra… │ +├─────┼───────┼──────────┼──────────┤ +│ 1 │ 1 │ [2, 2] │ [1, 5] │ +│ 2 │ 2 │ [1, 1] │ [2, 6] │ +│ 3 │ 3 │ [2, 2] │ [3, 7] │ +│ 4 │ 4 │ [1, 1] │ [4, 8] │ + +julia> combine(gd, AsTable(:) => Ref) # protecting result +4×2 DataFrame +│ Row │ a │ a_b_c_Ref │ +│ │ Int64 │ NamedTuple… │ +├─────┼───────┼──────────────────────────────────────┤ +│ 1 │ 1 │ (a = [1, 1], b = [2, 2], c = [1, 5]) │ +│ 2 │ 2 │ (a = [2, 2], b = [1, 1], c = [2, 6]) │ +│ 3 │ 3 │ (a = [3, 3], b = [2, 2], c = [3, 7]) │ +│ 4 │ 4 │ (a = [4, 4], b = [1, 1], c = [4, 8]) │ + +julia> combine(gd, :, AsTable(Not(:a)) => sum, renamecols=false) +8×4 DataFrame +│ Row │ a │ b │ c │ b_c │ +│ │ Int64 │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 2 │ 1 │ 3 │ +│ 2 │ 1 │ 2 │ 5 │ 7 │ +│ 3 │ 2 │ 1 │ 2 │ 3 │ +│ 4 │ 2 │ 1 │ 6 │ 7 │ +│ 5 │ 3 │ 2 │ 3 │ 5 │ +│ 6 │ 3 │ 2 │ 7 │ 9 │ +│ 7 │ 4 │ 1 │ 4 │ 5 │ +│ 8 │ 4 │ 1 │ 8 │ 9 │ ``` """ combine(df::AbstractDataFrame, @nospecialize(args...); renamecols::Bool=true) = @@ -953,6 +1209,11 @@ function combine(arg::Base.Callable, df::AbstractDataFrame; renamecols::Bool=tru return combine(df, arg) end +combine(f::Pair, gd::AbstractDataFrame; renamecols::Bool=true) = + throw(ArgumentError("First argument must be a transformation if the second argument is a data frame. " * + "You can pass a `Pair` as the second argument of the transformation. If you want the return " * + "value to be processed as having multiple columns add `=> AsTable` suffix to the pair.")) + manipulate(df::DataFrame, args::AbstractVector{Int}; copycols::Bool, keeprows::Bool, renamecols::Bool) = DataFrame(_columns(df)[args], Index(_names(df)[args]), copycols=copycols) diff --git a/src/groupeddataframe/callprocessing.jl b/src/groupeddataframe/callprocessing.jl new file mode 100644 index 0000000000..859987d83d --- /dev/null +++ b/src/groupeddataframe/callprocessing.jl @@ -0,0 +1,143 @@ +# Wrapping automatically adds column names when the value returned +# by the user-provided function lacks them +wrap(x::Union{AbstractDataFrame, DataFrameRow}) = x +wrap(x::NamedTuple) = x +function wrap(x::NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}) + if !isempty(x) + len1 = length(x[1]) + for i in 2:length(x) + length(x[i]) == len1 || throw(DimensionMismatch("all vectors returned in a " * + "NamedTuple must have the same length")) + end + end + return x +end +wrap(x::AbstractMatrix) = + NamedTuple{Tuple(gennames(size(x, 2)))}(Tuple(view(x, :, i) for i in 1:size(x, 2))) +wrap(x::Any) = (x1=x,) + +const ERROR_ROW_COUNT = "return value must not change its kind " * + "(single row or variable number of rows) across groups" + +const ERROR_COL_COUNT = "function must return only single-column values, " * + "or only multiple-column values" + +wrap_table(x::Any, ::Val) = + throw(ArgumentError(ERROR_ROW_COUNT)) +function wrap_table(x::Union{NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}, + AbstractDataFrame, AbstractMatrix}, + ::Val{firstmulticol}) where firstmulticol + if !firstmulticol + throw(ArgumentError(ERROR_COL_COUNT)) + end + return wrap(x) +end + +function wrap_table(x::AbstractVector, ::Val{firstmulticol}) where firstmulticol + if firstmulticol + throw(ArgumentError(ERROR_COL_COUNT)) + end + return wrap(x) +end + +function wrap_row(x::Any, ::Val{firstmulticol}) where firstmulticol + # NamedTuple is not possible in this branch + if (x isa DataFrameRow) ⊻ firstmulticol + throw(ArgumentError(ERROR_COL_COUNT)) + end + return wrap(x) +end + +function wrap_row(x::Union{AbstractArray{<:Any, 0}, Ref}, + ::Val{firstmulticol}) where firstmulticol + if firstmulticol + throw(ArgumentError(ERROR_COL_COUNT)) + end + return (x1 = x[],) +end + +# note that also NamedTuple() is correctly captured by this definition +# as it is more specific than the one below +wrap_row(::Union{AbstractVecOrMat, AbstractDataFrame, + NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}, ::Val) = + throw(ArgumentError(ERROR_ROW_COUNT)) + +function wrap_row(x::NamedTuple, ::Val{firstmulticol}) where firstmulticol + if any(v -> v isa AbstractVector, x) + throw(ArgumentError("mixing single values and vectors in a named tuple is not allowed")) + end + if !firstmulticol + throw(ArgumentError(ERROR_COL_COUNT)) + end + return x +end + +# idx, starts and ends are passed separately to avoid cost of field access in tight loop +# Manual unrolling of Tuple is used as it turned out more efficient than @generated +# for small number of columns passed. +# For more than 4 columns `map` is slower than @generated +# but this case is probably rare and if huge number of columns is passed @generated +# has very high compilation cost +function do_call(f::Base.Callable, idx::AbstractVector{<:Integer}, + starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, + gd::GroupedDataFrame, incols::Tuple{}, i::Integer) + if f isa ByRow + return [f.fun() for _ in 1:(ends[i] - starts[i] + 1)] + else + return f() + end +end + +function do_call(f::Base.Callable, idx::AbstractVector{<:Integer}, + starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, + gd::GroupedDataFrame, incols::Tuple{AbstractVector}, i::Integer) + idx = idx[starts[i]:ends[i]] + return f(view(incols[1], idx)) +end + +function do_call(f::Base.Callable, idx::AbstractVector{<:Integer}, + starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, + gd::GroupedDataFrame, incols::NTuple{2, AbstractVector}, i::Integer) + idx = idx[starts[i]:ends[i]] + return f(view(incols[1], idx), view(incols[2], idx)) +end + +function do_call(f::Base.Callable, idx::AbstractVector{<:Integer}, + starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, + gd::GroupedDataFrame, incols::NTuple{3, AbstractVector}, i::Integer) + idx = idx[starts[i]:ends[i]] + return f(view(incols[1], idx), view(incols[2], idx), view(incols[3], idx)) +end + +function do_call(f::Base.Callable, idx::AbstractVector{<:Integer}, + starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, + gd::GroupedDataFrame, incols::NTuple{4, AbstractVector}, i::Integer) + idx = idx[starts[i]:ends[i]] + return f(view(incols[1], idx), view(incols[2], idx), view(incols[3], idx), + view(incols[4], idx)) +end + +function do_call(f::Base.Callable, idx::AbstractVector{<:Integer}, + starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, + gd::GroupedDataFrame, incols::Tuple, i::Integer) + idx = idx[starts[i]:ends[i]] + return f(map(c -> view(c, idx), incols)...) +end + +function do_call(f::Base.Callable, idx::AbstractVector{<:Integer}, + starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, + gd::GroupedDataFrame, incols::NamedTuple, i::Integer) + if f isa ByRow && isempty(incols) + return [f.fun(NamedTuple()) for _ in 1:(ends[i] - starts[i] + 1)] + else + idx = idx[starts[i]:ends[i]] + return f(map(c -> view(c, idx), incols)) + end +end + +function do_call(f::Base.Callable, idx::AbstractVector{<:Integer}, + starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, + gd::GroupedDataFrame, incols::Nothing, i::Integer) + idx = idx[starts[i]:ends[i]] + return f(view(parent(gd), idx, :)) +end diff --git a/src/groupeddataframe/complextransforms.jl b/src/groupeddataframe/complextransforms.jl new file mode 100644 index 0000000000..8db068c398 --- /dev/null +++ b/src/groupeddataframe/complextransforms.jl @@ -0,0 +1,236 @@ +_nrow(df::AbstractDataFrame) = nrow(df) +_nrow(x::NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}) = + isempty(x) ? 0 : length(x[1]) +_ncol(df::AbstractDataFrame) = ncol(df) +_ncol(x::Union{NamedTuple, DataFrameRow}) = length(x) + +function _combine_multicol(firstres, fun::Base.Callable, gd::GroupedDataFrame, + incols::Union{Nothing, AbstractVector, Tuple, NamedTuple}) + firstmulticol = firstres isa MULTI_COLS_TYPE + if !(firstres isa Union{AbstractVecOrMat, AbstractDataFrame, + NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}) + idx_agg = Vector{Int}(undef, length(gd)) + fillfirst!(nothing, idx_agg, 1:length(gd.groups), gd) + else + idx_agg = nothing + end + return _combine_with_first(wrap(firstres), fun, gd, incols, + Val(firstmulticol), idx_agg) +end + +function _combine_with_first(first::Union{NamedTuple, DataFrameRow, AbstractDataFrame}, + f::Base.Callable, gd::GroupedDataFrame, + incols::Union{Nothing, AbstractVector, Tuple, NamedTuple}, + firstmulticol::Val, idx_agg::Union{Nothing, AbstractVector{<:Integer}}) + extrude = false + + if first isa AbstractDataFrame + n = 0 + eltys = eltype.(eachcol(first)) + elseif first isa NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}} + n = 0 + eltys = map(eltype, first) + elseif first isa DataFrameRow + n = length(gd) + eltys = [eltype(parent(first)[!, i]) for i in parentcols(index(first))] + elseif firstmulticol == Val(false) && first[1] isa Union{AbstractArray{<:Any, 0}, Ref} + extrude = true + first = wrap_row(first[1], firstmulticol) + n = length(gd) + eltys = (typeof(first[1]),) + else # other NamedTuple giving a single row + n = length(gd) + eltys = map(typeof, first) + if any(x -> x <: AbstractVector, eltys) + throw(ArgumentError("mixing single values and vectors in a named tuple is not allowed")) + end + end + idx = isnothing(idx_agg) ? Vector{Int}(undef, n) : idx_agg + local initialcols + let eltys=eltys, n=n # Workaround for julia#15276 + initialcols = ntuple(i -> Tables.allocatecolumn(eltys[i], n), _ncol(first)) + end + targetcolnames = tuple(propertynames(first)...) + if !extrude && first isa Union{AbstractDataFrame, + NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}} + outcols, finalcolnames = _combine_tables_with_first!(first, initialcols, idx, 1, 1, + f, gd, incols, targetcolnames, + firstmulticol) + else + outcols, finalcolnames = _combine_rows_with_first!(first, initialcols, 1, 1, + f, gd, incols, targetcolnames, + firstmulticol) + end + return idx, outcols, collect(Symbol, finalcolnames) +end + +function fill_row!(row, outcols::NTuple{N, AbstractVector}, + i::Integer, colstart::Integer, + colnames::NTuple{N, Symbol}) where N + if _ncol(row) != N + throw(ArgumentError("return value must have the same number of columns " * + "for all groups (got $N and $(length(row)))")) + end + @inbounds for j in colstart:length(outcols) + col = outcols[j] + cn = colnames[j] + local val + try + val = row[cn] + catch + throw(ArgumentError("return value must have the same column names " * + "for all groups (got $colnames and $(propertynames(row)))")) + end + S = typeof(val) + T = eltype(col) + if S <: T || promote_type(S, T) <: T + col[i] = val + else + return j + end + end + return nothing +end + +function _combine_rows_with_first!(first::Union{NamedTuple, DataFrameRow}, + outcols::NTuple{N, AbstractVector}, + rowstart::Integer, colstart::Integer, + f::Base.Callable, gd::GroupedDataFrame, + incols::Union{Nothing, AbstractVector, Tuple, NamedTuple}, + colnames::NTuple{N, Symbol}, + firstmulticol::Val) where N + len = length(gd) + gdidx = gd.idx + starts = gd.starts + ends = gd.ends + + # handle empty GroupedDataFrame + len == 0 && return outcols, colnames + + # Handle first group + j = fill_row!(first, outcols, rowstart, colstart, colnames) + @assert j === nothing # eltype is guaranteed to match + # Handle remaining groups + @inbounds for i in rowstart+1:len + row = wrap_row(do_call(f, gdidx, starts, ends, gd, incols, i), firstmulticol) + j = fill_row!(row, outcols, i, 1, colnames) + if j !== nothing # Need to widen column type + local newcols + let i = i, j = j, outcols=outcols, row=row # Workaround for julia#15276 + newcols = ntuple(length(outcols)) do k + S = typeof(row[k]) + T = eltype(outcols[k]) + U = promote_type(S, T) + if S <: T || U <: T + outcols[k] + else + copyto!(Tables.allocatecolumn(U, length(outcols[k])), + 1, outcols[k], 1, k >= j ? i-1 : i) + end + end + end + return _combine_rows_with_first!(row, newcols, i, j, + f, gd, incols, colnames, firstmulticol) + end + end + return outcols, colnames +end + +# This needs to be in a separate function +# to work around a crash due to JuliaLang/julia#29430 +if VERSION >= v"1.1.0-DEV.723" + @inline function do_append!(do_it, col, vals) + do_it && append!(col, vals) + return do_it + end +else + @noinline function do_append!(do_it, col, vals) + do_it && append!(col, vals) + return do_it + end +end + +function append_rows!(rows, outcols::NTuple{N, AbstractVector}, + colstart::Integer, colnames::NTuple{N, Symbol}) where N + if !isa(rows, Union{AbstractDataFrame, NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}) + throw(ArgumentError(ERROR_ROW_COUNT)) + elseif _ncol(rows) != N + throw(ArgumentError("return value must have the same number of columns " * + "for all groups (got $N and $(_ncol(rows)))")) + end + @inbounds for j in colstart:length(outcols) + col = outcols[j] + cn = colnames[j] + local vals + try + vals = getproperty(rows, cn) + catch + throw(ArgumentError("return value must have the same column names " * + "for all groups (got $colnames and $(propertynames(rows)))")) + end + S = eltype(vals) + T = eltype(col) + if !do_append!(S <: T || promote_type(S, T) <: T, col, vals) + return j + end + end + return nothing +end + +function _combine_tables_with_first!(first::Union{AbstractDataFrame, + NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}, + outcols::NTuple{N, AbstractVector}, + idx::Vector{Int}, rowstart::Integer, colstart::Integer, + f::Base.Callable, gd::GroupedDataFrame, + incols::Union{Nothing, AbstractVector, Tuple, NamedTuple}, + colnames::NTuple{N, Symbol}, + firstmulticol::Val) where N + len = length(gd) + gdidx = gd.idx + starts = gd.starts + ends = gd.ends + # Handle first group + + @assert _ncol(first) == N + if !isempty(colnames) && length(gd) > 0 + j = append_rows!(first, outcols, colstart, colnames) + @assert j === nothing # eltype is guaranteed to match + append!(idx, Iterators.repeated(gdidx[starts[rowstart]], _nrow(first))) + end + # Handle remaining groups + @inbounds for i in rowstart+1:len + rows = wrap_table(do_call(f, gdidx, starts, ends, gd, incols, i), firstmulticol) + _ncol(rows) == 0 && continue + if isempty(colnames) + newcolnames = tuple(propertynames(rows)...) + if rows isa AbstractDataFrame + eltys = eltype.(eachcol(rows)) + else + eltys = map(eltype, rows) + end + initialcols = ntuple(i -> Tables.allocatecolumn(eltys[i], 0), _ncol(rows)) + return _combine_tables_with_first!(rows, initialcols, idx, i, 1, + f, gd, incols, newcolnames, firstmulticol) + end + j = append_rows!(rows, outcols, 1, colnames) + if j !== nothing # Need to widen column type + local newcols + let i = i, j = j, outcols=outcols, rows=rows # Workaround for julia#15276 + newcols = ntuple(length(outcols)) do k + S = eltype(rows isa AbstractDataFrame ? rows[!, k] : rows[k]) + T = eltype(outcols[k]) + U = promote_type(S, T) + if S <: T || U <: T + outcols[k] + else + copyto!(Tables.allocatecolumn(U, length(outcols[k])), outcols[k]) + end + end + end + return _combine_tables_with_first!(rows, newcols, idx, i, j, + f, gd, incols, colnames, firstmulticol) + end + append!(idx, Iterators.repeated(gdidx[starts[i]], _nrow(rows))) + end + return outcols, colnames +end diff --git a/src/groupeddataframe/fastaggregates.jl b/src/groupeddataframe/fastaggregates.jl new file mode 100644 index 0000000000..9d0d6e8cd4 --- /dev/null +++ b/src/groupeddataframe/fastaggregates.jl @@ -0,0 +1,284 @@ +abstract type AbstractAggregate end + +struct Reduce{O, C, A} <: AbstractAggregate + op::O + condf::C + adjust::A + checkempty::Bool +end +Reduce(f, condf=nothing, adjust=nothing) = Reduce(f, condf, adjust, false) + +check_aggregate(f::Any, ::AbstractVector) = f +check_aggregate(f::typeof(sum), ::AbstractVector{<:Union{Missing, Number}}) = + Reduce(Base.add_sum) +check_aggregate(f::typeof(sum∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = + Reduce(Base.add_sum, !ismissing) +check_aggregate(f::typeof(prod), ::AbstractVector{<:Union{Missing, Number}}) = + Reduce(Base.mul_prod) +check_aggregate(f::typeof(prod∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = + Reduce(Base.mul_prod, !ismissing) +check_aggregate(f::typeof(maximum), + ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f +check_aggregate(f::typeof(maximum), v::AbstractVector{<:Union{Missing, Real}}) = + eltype(v) === Any ? f : Reduce(max) +check_aggregate(f::typeof(maximum∘skipmissing), + ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f +check_aggregate(f::typeof(maximum∘skipmissing), v::AbstractVector{<:Union{Missing, Real}}) = + eltype(v) === Any ? f : Reduce(max, !ismissing, nothing, true) +check_aggregate(f::typeof(minimum), + ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f +check_aggregate(f::typeof(minimum), v::AbstractVector{<:Union{Missing, Real}}) = + eltype(v) === Any ? f : Reduce(min) +check_aggregate(f::typeof(minimum∘skipmissing), + ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f +check_aggregate(f::typeof(minimum∘skipmissing), v::AbstractVector{<:Union{Missing, Real}}) = + eltype(v) === Any ? f : Reduce(min, !ismissing, nothing, true) +check_aggregate(f::typeof(mean), ::AbstractVector{<:Union{Missing, Number}}) = + Reduce(Base.add_sum, nothing, /) +check_aggregate(f::typeof(mean∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = + Reduce(Base.add_sum, !ismissing, /) + +# Other aggregate functions which are not strictly reductions +struct Aggregate{F, C} <: AbstractAggregate + f::F + condf::C +end +Aggregate(f) = Aggregate(f, nothing) + +check_aggregate(f::typeof(var), ::AbstractVector{<:Union{Missing, Number}}) = + Aggregate(var) +check_aggregate(f::typeof(var∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = + Aggregate(var, !ismissing) +check_aggregate(f::typeof(std), ::AbstractVector{<:Union{Missing, Number}}) = + Aggregate(std) +check_aggregate(f::typeof(std∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = + Aggregate(std, !ismissing) +check_aggregate(f::typeof(first), v::AbstractVector) = + eltype(v) === Any ? f : Aggregate(first) +check_aggregate(f::typeof(first), + ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f +check_aggregate(f::typeof(first∘skipmissing), v::AbstractVector) = + eltype(v) === Any ? f : Aggregate(first, !ismissing) +check_aggregate(f::typeof(first∘skipmissing), + ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f +check_aggregate(f::typeof(last), v::AbstractVector) = + eltype(v) === Any ? f : Aggregate(last) +check_aggregate(f::typeof(last), + ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f +check_aggregate(f::typeof(last∘skipmissing), v::AbstractVector) = + eltype(v) === Any ? f : Aggregate(last, !ismissing) +check_aggregate(f::typeof(last∘skipmissing), + ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f +check_aggregate(f::typeof(length), ::AbstractVector) = Aggregate(length) + +# SkipMissing does not support length + +# Use a strategy similar to reducedim_init from Base to get the vector of the right type +function groupreduce_init(op, condf, adjust, + incol::AbstractVector{U}, gd::GroupedDataFrame) where U + T = Base.promote_union(U) + + if op === Base.add_sum + initf = zero + elseif op === Base.mul_prod + initf = one + else + throw(ErrorException("Unrecognized op $op")) + end + + Tnm = nonmissingtype(T) + if isconcretetype(Tnm) && applicable(initf, Tnm) + tmpv = initf(Tnm) + initv = op(tmpv, tmpv) + if adjust isa Nothing + x = Tnm <: AbstractIrrational ? float(initv) : initv + else + x = adjust(initv, 1) + end + if condf === !ismissing + V = typeof(x) + else + V = U >: Missing ? Union{typeof(x), Missing} : typeof(x) + end + v = similar(incol, V, length(gd)) + fill!(v, x) + return v + else + # do not try to determine the narrowest possible type nor starting value + # as this is not possible to do correctly in general without processing + # groups; it will get fixed later in groupreduce!; later we + # will make use of the fact that this vector is filled with #undef + # while above the vector is filled with a concrete value + return Vector{Any}(undef, length(gd)) + end +end + +for (op, initf) in ((:max, :typemin), (:min, :typemax)) + @eval begin + function groupreduce_init(::typeof($op), condf, adjust, + incol::AbstractVector{T}, gd::GroupedDataFrame) where T + @assert isnothing(adjust) + S = nonmissingtype(T) + # !ismissing check is purely an optimization to avoid a copy later + outcol = similar(incol, condf === !ismissing ? S : T, length(gd)) + # Comparison is possible only between CatValues from the same pool + if incol isa CategoricalVector + U = Union{CategoricalArrays.leveltype(outcol), + eltype(outcol) >: Missing ? Missing : Union{}} + outcol = CategoricalArray{U, 1}(outcol.refs, incol.pool) + end + # It is safe to use a non-missing init value + # since missing will poison the result if present + # we assume here that groups are non-empty (current design assures this) + # + workaround for https://github.com/JuliaLang/julia/issues/36978 + if isconcretetype(S) && hasmethod($initf, Tuple{S}) && !(S <: Irrational) + fill!(outcol, $initf(S)) + else + fillfirst!(condf, outcol, incol, gd) + end + return outcol + end + end +end + +function copyto_widen!(res::AbstractVector{T}, x::AbstractVector) where T + @inbounds for i in eachindex(res, x) + val = x[i] + S = typeof(val) + if S <: T || promote_type(S, T) <: T + res[i] = val + else + newres = Tables.allocatecolumn(promote_type(S, T), length(x)) + return copyto_widen!(newres, x) + end + end + return res +end + +function groupreduce!(res::AbstractVector, f, op, condf, adjust, checkempty::Bool, + incol::AbstractVector, gd::GroupedDataFrame) + n = length(gd) + if adjust !== nothing || checkempty + counts = zeros(Int, n) + end + groups = gd.groups + @inbounds for i in eachindex(incol, groups) + gix = groups[i] + x = incol[i] + if gix > 0 && (condf === nothing || condf(x)) + # this check should be optimized out if U is not Any + if eltype(res) === Any && !isassigned(res, gix) + res[gix] = f(x, gix) + else + res[gix] = op(res[gix], f(x, gix)) + end + if adjust !== nothing || checkempty + counts[gix] += 1 + end + end + end + # handle the case of an unitialized reduction + if eltype(res) === Any + if op === Base.add_sum + initf = zero + elseif op === Base.mul_prod + initf = one + else + initf = x -> throw(ErrorException("Unrecognized op $op")) + end + @inbounds for gix in eachindex(res) + if !isassigned(res, gix) + res[gix] = initf(nonmissingtype(eltype(incol))) + end + end + end + if adjust !== nothing + res .= adjust.(res, counts) + end + if checkempty && any(iszero, counts) + throw(ArgumentError("some groups contain only missing values")) + end + # Undo pool sharing done by groupreduce_init + if res isa CategoricalVector && res.pool === incol.pool + V = Union{CategoricalArrays.leveltype(res), + eltype(res) >: Missing ? Missing : Union{}} + res = CategoricalArray{V, 1}(res.refs, copy(res.pool)) + end + if isconcretetype(eltype(res)) + return res + else + return copyto_widen!(Tables.allocatecolumn(typeof(first(res)), n), res) + end +end + +# function barrier works around type instability of groupreduce_init due to applicable +groupreduce(f, op, condf, adjust, checkempty::Bool, + incol::AbstractVector, gd::GroupedDataFrame) = + groupreduce!(groupreduce_init(op, condf, adjust, incol, gd), + f, op, condf, adjust, checkempty, incol, gd) +# Avoids the overhead due to Missing when computing reduction +groupreduce(f, op, condf::typeof(!ismissing), adjust, checkempty::Bool, + incol::AbstractVector, gd::GroupedDataFrame) = + groupreduce!(disallowmissing(groupreduce_init(op, condf, adjust, incol, gd)), + f, op, condf, adjust, checkempty, incol, gd) + +(r::Reduce)(incol::AbstractVector, gd::GroupedDataFrame) = + groupreduce((x, i) -> x, r.op, r.condf, r.adjust, r.checkempty, incol, gd) + +# this definition is missing in Julia 1.0 LTS and is required by aggregation for var +# TODO: remove this when we drop 1.0 support +if VERSION < v"1.1" + Base.zero(::Type{Missing}) = missing +end + +function (agg::Aggregate{typeof(var)})(incol::AbstractVector, gd::GroupedDataFrame) + means = groupreduce((x, i) -> x, Base.add_sum, agg.condf, /, false, incol, gd) + # !ismissing check is purely an optimization to avoid a copy later + if eltype(means) >: Missing && agg.condf !== !ismissing + T = Union{Missing, real(eltype(means))} + else + T = real(eltype(means)) + end + res = zeros(T, length(gd)) + return groupreduce!(res, (x, i) -> @inbounds(abs2(x - means[i])), +, agg.condf, + (x, l) -> l <= 1 ? oftype(x / (l-1), NaN) : x / (l-1), + false, incol, gd) +end + +function (agg::Aggregate{typeof(std)})(incol::AbstractVector, gd::GroupedDataFrame) + outcol = Aggregate(var, agg.condf)(incol, gd) + if eltype(outcol) <: Union{Missing, Rational} + return sqrt.(outcol) + else + return map!(sqrt, outcol, outcol) + end +end + +for f in (first, last) + function (agg::Aggregate{typeof(f)})(incol::AbstractVector, gd::GroupedDataFrame) + n = length(gd) + outcol = similar(incol, n) + fillfirst!(agg.condf, outcol, incol, gd, rev=agg.f === last) + if isconcretetype(eltype(outcol)) + return outcol + else + return copyto_widen!(Tables.allocatecolumn(typeof(first(outcol)), n), outcol) + end + end +end + +function (agg::Aggregate{typeof(length)})(incol::AbstractVector, gd::GroupedDataFrame) + if getfield(gd, :idx) === nothing + lens = zeros(Int, length(gd)) + @inbounds for gix in gd.groups + gix > 0 && (lens[gix] += 1) + end + return lens + else + return gd.ends .- gd.starts .+ 1 + end +end + +isagg((col, (fun, outcol))::Pair{<:ColumnIndex, <:Pair{<:Any, <:SymbolOrString}}, gdf::GroupedDataFrame) = + check_aggregate(fun, parent(gdf)[!, col]) isa AbstractAggregate +isagg(::Any, gdf::GroupedDataFrame) = false diff --git a/src/groupeddataframe/groupeddataframe.jl b/src/groupeddataframe/groupeddataframe.jl index 6b97bcae3d..b46338293e 100644 --- a/src/groupeddataframe/groupeddataframe.jl +++ b/src/groupeddataframe/groupeddataframe.jl @@ -33,6 +33,192 @@ mutable struct GroupedDataFrame{T<:AbstractDataFrame} # thread safe end +""" + groupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false) + +Return a `GroupedDataFrame` representing a view of an `AbstractDataFrame` split +into row groups. + +# Arguments +- `df` : an `AbstractDataFrame` to split +- `cols` : data frame columns to group by. Can be any column selector + ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR). +- `sort` : whether to sort groups according to the values of the grouping columns + `cols`; if `sort=false` then the order of groups in the result is undefined + and may change in future releases. In the current implementation + groups are ordered following the order of appearance of values in the grouping + columns, except when all grouping columns provide non-`nothing` + `DataAPI.refpool` in which case the order of groups follows the order of + values returned by `DataAPI.refpool`. As a particular application of this rule + if all `cols` are `CategoricalVector`s then groups are always sorted + irrespective of the value of `sort`. +- `skipmissing` : whether to skip groups with `missing` values in one of the + grouping columns `cols` + +# Details +An iterator over a `GroupedDataFrame` returns a `SubDataFrame` view +for each grouping into `df`. +Within each group, the order of rows in `df` is preserved. + +`cols` can be any valid data frame indexing expression. +In particular if it is an empty vector then a single-group `GroupedDataFrame` +is created. + +A `GroupedDataFrame` also supports +indexing by groups, `map` (which applies a function to each group) +and `combine` (which applies a function to each group +and combines the result into a data frame). + +`GroupedDataFrame` also supports the dictionary interface. The keys are +[`GroupKey`](@ref) objects returned by [`keys(::GroupedDataFrame)`](@ref), +which can also be used to get the values of the grouping columns for each group. +`Tuples` and `NamedTuple`s containing the values of the grouping columns (in the +same order as the `cols` argument) are also accepted as indices. Finally, +an `AbstractDict` can be used to index into a grouped data frame where +the keys are column names of the data frame. The order of the keys does +not matter in this case. + +# See also + +[`combine`](@ref), [`select`](@ref), [`select!`](@ref), [`transform`](@ref), [`transform!`](@ref) + +# Examples +```julia +julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]), + b = repeat([2, 1], outer=[4]), + c = 1:8); + +julia> gd = groupby(df, :a) +GroupedDataFrame with 4 groups based on key: a +First Group (2 rows): a = 1 +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 2 │ 1 │ +│ 2 │ 1 │ 2 │ 5 │ +⋮ +Last Group (2 rows): a = 4 +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 4 │ 1 │ 4 │ +│ 2 │ 4 │ 1 │ 8 │ + +julia> gd[1] +2×3 SubDataFrame +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 2 │ 1 │ +│ 2 │ 1 │ 2 │ 5 │ + +julia> last(gd) +2×3 SubDataFrame +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 4 │ 1 │ 4 │ +│ 2 │ 4 │ 1 │ 8 │ + +julia> gd[(a=3,)] +2×3 SubDataFrame +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 3 │ 2 │ 3 │ +│ 2 │ 3 │ 2 │ 7 │ + +julia> gd[Dict("a" => 3)] +2×3 SubDataFrame +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 3 │ 2 │ 3 │ +│ 2 │ 3 │ 2 │ 7 │ + +julia> gd[(3,)] +2×3 SubDataFrame +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 3 │ 2 │ 3 │ +│ 2 │ 3 │ 2 │ 7 │ + +julia> k = first(keys(gd)) +GroupKey: (a = 3) + +julia> gd[k] +2×3 SubDataFrame +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 3 │ 2 │ 3 │ +│ 2 │ 3 │ 2 │ 7 │ + +julia> for g in gd + println(g) + end +2×3 SubDataFrame +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 1 │ 2 │ 1 │ +│ 2 │ 1 │ 2 │ 5 │ +2×3 SubDataFrame +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 2 │ 1 │ 2 │ +│ 2 │ 2 │ 1 │ 6 │ +2×3 SubDataFrame +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 3 │ 2 │ 3 │ +│ 2 │ 3 │ 2 │ 7 │ +2×3 SubDataFrame +│ Row │ a │ b │ c │ +│ │ Int64 │ Int64 │ Int64 │ +├─────┼───────┼───────┼───────┤ +│ 1 │ 4 │ 1 │ 4 │ +│ 2 │ 4 │ 1 │ 8 │ +``` +""" +function groupby(df::AbstractDataFrame, cols; + sort::Bool=false, skipmissing::Bool=false) + _check_consistency(df) + idxcols = index(df)[cols] + if isempty(idxcols) + return GroupedDataFrame(df, Symbol[], ones(Int, nrow(df)), + nothing, nothing, nothing, nrow(df) == 0 ? 0 : 1, + nothing, Threads.ReentrantLock()) + end + sdf = select(df, idxcols, copycols=false) + + groups = Vector{Int}(undef, nrow(df)) + ngroups, rhashes, gslots, sorted = + row_group_slots(ntuple(i -> sdf[!, i], ncol(sdf)), Val(false), + groups, skipmissing, sort) + + gd = GroupedDataFrame(df, copy(_names(sdf)), groups, nothing, nothing, nothing, ngroups, nothing, + Threads.ReentrantLock()) + + # sort groups if row_group_slots hasn't already done that + if sort && !sorted + # Find index of representative row for each group + idx = Vector{Int}(undef, length(gd)) + fillfirst!(nothing, idx, 1:nrow(parent(gd)), gd) + group_invperm = invperm(sortperm(view(parent(gd)[!, gd.cols], idx, :))) + groups = gd.groups + @inbounds for i in eachindex(groups) + gix = groups[i] + groups[i] = gix == 0 ? 0 : group_invperm[gix] + end + end + + return gd +end + function genkeymap(gd, cols) # currently we use Dict{Any,Int} because then field :keymap in GroupedDataFrame # has a concrete type which makes the access to it faster as we do not have a dynamic diff --git a/src/groupeddataframe/splitapplycombine.jl b/src/groupeddataframe/splitapplycombine.jl index 5664e7449e..0b848b3d81 100644 --- a/src/groupeddataframe/splitapplycombine.jl +++ b/src/groupeddataframe/splitapplycombine.jl @@ -1,504 +1,32 @@ +# in this file we use cs and cs_i variable names that mean "target columns specification" + # this constant defines which types of values returned by aggregation function # in combine are considered to produce multiple columns in the resulting data frame const MULTI_COLS_TYPE = Union{AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix} -""" - groupby(d::AbstractDataFrame, cols; sort=false, skipmissing=false) - -Return a `GroupedDataFrame` representing a view of an `AbstractDataFrame` split -into row groups. - -# Arguments -- `df` : an `AbstractDataFrame` to split -- `cols` : data frame columns to group by. Can be any column selector - ($COLUMNINDEX_STR; $MULTICOLUMNINDEX_STR). -- `sort` : whether to sort groups according to the values of the grouping columns - `cols`; if all `cols` are `CategoricalVector`s then groups are always sorted - irrespective of the value of `sort` -- `skipmissing` : whether to skip groups with `missing` values in one of the - grouping columns `cols` - -# Details -An iterator over a `GroupedDataFrame` returns a `SubDataFrame` view -for each grouping into `df`. -Within each group, the order of rows in `df` is preserved. - -`cols` can be any valid data frame indexing expression. -In particular if it is an empty vector then a single-group `GroupedDataFrame` -is created. - -A `GroupedDataFrame` also supports -indexing by groups, `map` (which applies a function to each group) -and `combine` (which applies a function to each group -and combines the result into a data frame). - -`GroupedDataFrame` also supports the dictionary interface. The keys are -[`GroupKey`](@ref) objects returned by [`keys(::GroupedDataFrame)`](@ref), -which can also be used to get the values of the grouping columns for each group. -`Tuples` and `NamedTuple`s containing the values of the grouping columns (in the -same order as the `cols` argument) are also accepted as indices. Finally, -an `AbstractDict` can be used to index into a grouped data frame where -the keys are column names of the data frame. The order of the keys does -not matter in this case. - -# See also - -[`combine`](@ref), [`select`](@ref), [`select!`](@ref), [`transform`](@ref), [`transform!`](@ref) - -# Examples -```julia -julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]), - b = repeat([2, 1], outer=[4]), - c = 1:8); - -julia> gd = groupby(df, :a) -GroupedDataFrame with 4 groups based on key: a -First Group (2 rows): a = 1 -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 2 │ 1 │ -│ 2 │ 1 │ 2 │ 5 │ -⋮ -Last Group (2 rows): a = 4 -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 4 │ 1 │ 4 │ -│ 2 │ 4 │ 1 │ 8 │ - -julia> gd[1] -2×3 SubDataFrame -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 2 │ 1 │ -│ 2 │ 1 │ 2 │ 5 │ - -julia> last(gd) -2×3 SubDataFrame -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 4 │ 1 │ 4 │ -│ 2 │ 4 │ 1 │ 8 │ - -julia> gd[(a=3,)] -2×3 SubDataFrame -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 3 │ 2 │ 3 │ -│ 2 │ 3 │ 2 │ 7 │ - -julia> gd[Dict("a" => 3)] -2×3 SubDataFrame -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 3 │ 2 │ 3 │ -│ 2 │ 3 │ 2 │ 7 │ - -julia> gd[(3,)] -2×3 SubDataFrame -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 3 │ 2 │ 3 │ -│ 2 │ 3 │ 2 │ 7 │ - -julia> k = first(keys(gd)) -GroupKey: (a = 3) - -julia> gd[k] -2×3 SubDataFrame -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 3 │ 2 │ 3 │ -│ 2 │ 3 │ 2 │ 7 │ - -julia> for g in gd - println(g) - end -2×3 SubDataFrame -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 2 │ 1 │ -│ 2 │ 1 │ 2 │ 5 │ -2×3 SubDataFrame -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 2 │ 1 │ 2 │ -│ 2 │ 2 │ 1 │ 6 │ -2×3 SubDataFrame -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 3 │ 2 │ 3 │ -│ 2 │ 3 │ 2 │ 7 │ -2×3 SubDataFrame -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 4 │ 1 │ 4 │ -│ 2 │ 4 │ 1 │ 8 │ -``` -""" -function groupby(df::AbstractDataFrame, cols; - sort::Bool=false, skipmissing::Bool=false) - _check_consistency(df) - idxcols = index(df)[cols] - if isempty(idxcols) - return GroupedDataFrame(df, Symbol[], ones(Int, nrow(df)), - nothing, nothing, nothing, nrow(df) == 0 ? 0 : 1, - nothing, Threads.ReentrantLock()) - end - sdf = select(df, idxcols, copycols=false) - - groups = Vector{Int}(undef, nrow(df)) - ngroups, rhashes, gslots, sorted = - row_group_slots(ntuple(i -> sdf[!, i], ncol(sdf)), Val(false), - groups, skipmissing, sort) - - gd = GroupedDataFrame(df, copy(_names(sdf)), groups, nothing, nothing, nothing, ngroups, nothing, - Threads.ReentrantLock()) - - # sort groups if row_group_slots hasn't already done that - if sort && !sorted - # Find index of representative row for each group - idx = Vector{Int}(undef, length(gd)) - fillfirst!(nothing, idx, 1:nrow(parent(gd)), gd) - group_invperm = invperm(sortperm(view(parent(gd)[!, gd.cols], idx, :))) - groups = gd.groups - @inbounds for i in eachindex(groups) - gix = groups[i] - groups[i] = gix == 0 ? 0 : group_invperm[gix] - end - end - - return gd -end - -const F_TYPE_RULES = - """ - `fun` can return a single value, a row, a vector, or multiple rows. - The type of the returned value determines the shape of the resulting `DataFrame`. - There are four kind of return values allowed: - - A single value gives a `DataFrame` with a single additional column and one row - per group. - - A named tuple of single values or a [`DataFrameRow`](@ref) gives a `DataFrame` - with one additional column for each field and one row per group (returning a - named tuple will be faster). It is not allowed to mix single values and vectors - if a named tuple is returned. - - A vector gives a `DataFrame` with a single additional column and as many rows - for each group as the length of the returned vector for that group. - - A data frame, a named tuple of vectors or a matrix gives a `DataFrame` with - the same additional columns and as many rows for each group as the rows - returned for that group (returning a named tuple is the fastest option). - Returning a table with zero columns is allowed, whatever the number of columns - returned for other groups. - - `fun` must always return the same kind of object (out of four - kinds defined above) for all groups, and with the same column names. - - Optimized methods are used when standard summary functions (`sum`, `prod`, - `minimum`, `maximum`, `mean`, `var`, `std`, `first`, `last` and `length`) - are specified using the `Pair` syntax (e.g. `:col => sum`). - When computing the `sum` or `mean` over floating point columns, results will be - less accurate than the standard `sum` function (which uses pairwise - summation). Use `col => x -> sum(x)` to avoid the optimized method and use the - slower, more accurate one. - - Column names are automatically generated when necessary using the rules defined - in [`select`](@ref) if the `Pair` syntax is used and `fun` returns a single - value or a vector (e.g. for `:col => sum` the column name is `col_sum`); otherwise - (if `fun` is a function or a return value is an `AbstractMatrix`) columns are - named `x1`, `x2` and so on. - """ - -const F_ARGUMENT_RULES = - """ - - Arguments passed as `args...` can be: - - * Any index that is allowed for column indexing ($COLUMNINDEX_STR, $MULTICOLUMNINDEX_STR). - * Column transformation operations using the `Pair` notation that is described below - and vectors of such pairs. - - Transformations allowed using `Pair`s follow the rules specified for - [`select`](@ref) and have the form `source_cols => fun`, `source_cols => fun - => target_col`, or `source_col => target_col`. Function `fun` is passed - `SubArray` views as positional arguments for each column specified to be - selected, or a `NamedTuple` containing these `SubArray`s if `source_cols` is - an `AsTable` selector. It can return a vector or a single value (defined - precisely below). If automatic generation of target column - name is required it respects the `renamecols` keyword argument following the - rules described in [`select`](@ref). - - As a special case `nrow` or `nrow => target_col` can be passed without specifying - input columns to efficiently calculate number of rows in each group. - If `nrow` is passed the resulting column name is `:nrow`. - - If multiple `args` are passed then return values of different `fun`s are allowed - to mix single values and vectors. In this case single values will be - broadcasted to match the length of columns specified by returned vectors. - As a particular rule, values wrapped in a `Ref` or a `0`-dimensional `AbstractArray` - are unwrapped and then broadcasted. - - If the first or last argument is `pair` then it must be a `Pair` following the - rules for pairs described above, except that in this case function defined - by `fun` can return any return value defined below. - - If the first or last argument is a function `fun`, it is passed a [`SubDataFrame`](@ref) - view for each group and can return any return value defined below. - Note that this form is slower than `pair` or `args` due to type instability. - - If `gd` has zero groups then no transformations are applied. - """ - -const KWARG_PROCESSING_RULES = - """ - If `keepkeys=true`, the resulting `DataFrame` contains all the grouping columns - in addition to those generated. In this case if the returned - value contains columns with the same names as the grouping columns, they are - required to be equal. - If `keepkeys=false` and some generated columns have the same name as grouping columns, - they are kept and are not required to be equal to grouping columns. - - If `ungroup=true` (the default) a `DataFrame` is returned. - If `ungroup=false` a `GroupedDataFrame` grouped using `keycols(gdf)` is returned. - - If `gd` has zero groups then transformations are applied to vectors of zero length. - """ - -""" - combine(gd::GroupedDataFrame, args...; keepkeys::Bool=true, ungroup::Bool=true, - renamecols::Bool=true) - combine(fun::Union{Function, Type}, gd::GroupedDataFrame; - keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) - combine(pair::Pair, gd::GroupedDataFrame; keepkeys::Bool=true, ungroup::Bool=true, - renamecols::Bool=true) - -Apply operations to each group in a [`GroupedDataFrame`](@ref) and return the combined -result as a `DataFrame` if `ungroup=true` or `GroupedDataFrame` if `ungroup=false`. - -If an `AbstractDataFrame` is passed, apply operations to the data frame as a whole -and a `DataFrame` is always returend. - -$F_ARGUMENT_RULES - -$F_TYPE_RULES - -$KWARG_PROCESSING_RULES - -Ordering of rows follows the order of groups in `gdf`. - -# See also - -[`groupby`](@ref), [`select`](@ref), [`select!`](@ref), [`transform`](@ref), [`transform!`](@ref) - -# Examples -```jldoctest -julia> df = DataFrame(a = repeat([1, 2, 3, 4], outer=[2]), - b = repeat([2, 1], outer=[4]), - c = 1:8); - -julia> gd = groupby(df, :a); - -julia> combine(gd, :c => sum, nrow) -4×3 DataFrame -│ Row │ a │ c_sum │ nrow │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 6 │ 2 │ -│ 2 │ 2 │ 8 │ 2 │ -│ 3 │ 3 │ 10 │ 2 │ -│ 4 │ 4 │ 12 │ 2 │ - -julia> combine(gd, :c => sum, nrow, ungroup=false) -GroupedDataFrame with 4 groups based on key: a -First Group (1 row): a = 1 -│ Row │ a │ c_sum │ nrow │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 6 │ 2 │ -⋮ -Last Group (1 row): a = 4 -│ Row │ a │ c_sum │ nrow │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 4 │ 12 │ 2 │ - -julia> combine(sdf -> sum(sdf.c), gd) # Slower variant -4×2 DataFrame -│ Row │ a │ x1 │ -│ │ Int64 │ Int64 │ -├─────┼───────┼───────┤ -│ 1 │ 1 │ 6 │ -│ 2 │ 2 │ 8 │ -│ 3 │ 3 │ 10 │ -│ 4 │ 4 │ 12 │ - -julia> combine(gdf) do d # do syntax for the slower variant - sum(d.c) - end -4×2 DataFrame -│ Row │ a │ x1 │ -│ │ Int64 │ Int64 │ -├─────┼───────┼───────┤ -│ 1 │ 1 │ 6 │ -│ 2 │ 2 │ 8 │ -│ 3 │ 3 │ 10 │ -│ 4 │ 4 │ 12 │ - -julia> combine(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column -4×2 DataFrame -│ Row │ a │ sum_log_c │ -│ │ Int64 │ Float64 │ -├─────┼───────┼───────────┤ -│ 1 │ 1 │ 1.60944 │ -│ 2 │ 2 │ 2.48491 │ -│ 3 │ 3 │ 3.04452 │ -│ 4 │ 4 │ 3.46574 │ - - -julia> combine(gd, [:b, :c] .=> sum) # passing a vector of pairs -4×3 DataFrame -│ Row │ a │ b_sum │ c_sum │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 4 │ 6 │ -│ 2 │ 2 │ 2 │ 8 │ -│ 3 │ 3 │ 4 │ 10 │ -│ 4 │ 4 │ 2 │ 12 │ - -julia> combine(gd) do sdf # dropping group when DataFrame() is returned - sdf.c[1] != 1 ? sdf : DataFrame() - end -6×3 DataFrame -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 2 │ 1 │ 2 │ -│ 2 │ 2 │ 1 │ 6 │ -│ 3 │ 3 │ 2 │ 3 │ -│ 4 │ 3 │ 2 │ 7 │ -│ 5 │ 4 │ 1 │ 4 │ -│ 6 │ 4 │ 1 │ 8 │ - -julia> combine(gd, :b => :b1, :c => :c1, - [:b, :c] => +, keepkeys=false) # auto-splatting, renaming and keepkeys -8×3 DataFrame -│ Row │ b1 │ c1 │ b_c_+ │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 2 │ 1 │ 3 │ -│ 2 │ 2 │ 5 │ 7 │ -│ 3 │ 1 │ 2 │ 3 │ -│ 4 │ 1 │ 6 │ 7 │ -│ 5 │ 2 │ 3 │ 5 │ -│ 6 │ 2 │ 7 │ 9 │ -│ 7 │ 1 │ 4 │ 5 │ -│ 8 │ 1 │ 8 │ 9 │ - -julia> combine(gd, :b, :c => sum) # passing columns and broadcasting -8×3 DataFrame -│ Row │ a │ b │ c_sum │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 2 │ 6 │ -│ 2 │ 1 │ 2 │ 6 │ -│ 3 │ 2 │ 1 │ 8 │ -│ 4 │ 2 │ 1 │ 8 │ -│ 5 │ 3 │ 2 │ 10 │ -│ 6 │ 3 │ 2 │ 10 │ -│ 7 │ 4 │ 1 │ 12 │ -│ 8 │ 4 │ 1 │ 12 │ - -julia> combine(gd, [:b, :c] .=> Ref) -4×3 DataFrame -│ Row │ a │ b_Ref │ c_Ref │ -│ │ Int64 │ SubArra… │ SubArra… │ -├─────┼───────┼──────────┼──────────┤ -│ 1 │ 1 │ [2, 2] │ [1, 5] │ -│ 2 │ 2 │ [1, 1] │ [2, 6] │ -│ 3 │ 3 │ [2, 2] │ [3, 7] │ -│ 4 │ 4 │ [1, 1] │ [4, 8] │ - -julia> combine(gd, AsTable(:) => Ref) -4×2 DataFrame -│ Row │ a │ a_b_c_Ref │ -│ │ Int64 │ NamedTuple… │ -├─────┼───────┼──────────────────────────────────────┤ -│ 1 │ 1 │ (a = [1, 1], b = [2, 2], c = [1, 5]) │ -│ 2 │ 2 │ (a = [2, 2], b = [1, 1], c = [2, 6]) │ -│ 3 │ 3 │ (a = [3, 3], b = [2, 2], c = [3, 7]) │ -│ 4 │ 4 │ (a = [4, 4], b = [1, 1], c = [4, 8]) │ - -julia> combine(gd, :, AsTable(Not(:a)) => sum, renamecols=false) -8×4 DataFrame -│ Row │ a │ b │ c │ b_c │ -│ │ Int64 │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 2 │ 1 │ 3 │ -│ 2 │ 1 │ 2 │ 5 │ 7 │ -│ 3 │ 2 │ 1 │ 2 │ 3 │ -│ 4 │ 2 │ 1 │ 6 │ 7 │ -│ 5 │ 3 │ 2 │ 3 │ 5 │ -│ 6 │ 3 │ 2 │ 7 │ 9 │ -│ 7 │ 4 │ 1 │ 4 │ 5 │ -│ 8 │ 4 │ 1 │ 8 │ 9 │ -``` -""" -function combine(f::Base.Callable, gd::GroupedDataFrame; - keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) - return combine_helper(f, gd, keepkeys=keepkeys, ungroup=ungroup, - copycols=true, keeprows=false, renamecols=renamecols) -end - -combine(f::typeof(nrow), gd::GroupedDataFrame; - keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) = - combine(gd, [nrow => :nrow], keepkeys=keepkeys, ungroup=ungroup, - renamecols=renamecols) - -function combine(p::Pair, gd::GroupedDataFrame; - keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) - # move handling of aggregate to specialized combine - p_from, p_to = p - - # verify if it is not better to use a fast path, which we achieve - # by moving to combine(::GroupedDataFrame, ::AbstractVector) method - # note that even if length(gd) == 0 we can do this step - if isagg(p_from => (p_to isa Pair ? first(p_to) : p_to), gd) || p_from === nrow - return combine(gd, [p], keepkeys=keepkeys, ungroup=ungroup, renamecols=renamecols) - end - - if p_from isa Tuple - cs = collect(p_from) - # an explicit error is thrown as this was allowed in the past - throw(ArgumentError("passing a Tuple $p_from as column selector is not supported" * - ", use a vector $cs instead")) - else - cs = p_from +function gen_groups(idx::Vector{Int}) + groups = zeros(Int, length(idx)) + groups[1] = 1 + j = 1 + last_idx = idx[1] + @inbounds for i in 2:length(idx) + cur_idx = idx[i] + j += cur_idx != last_idx + last_idx = cur_idx + groups[i] = j end - return combine_helper(cs => p_to, gd, keepkeys=keepkeys, ungroup=ungroup, - copycols=true, keeprows=false, renamecols=renamecols) + return groups end -combine(gd::GroupedDataFrame, - cs::Union{Pair, typeof(nrow), ColumnIndex, MultiColumnIndex}...; - keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) = - _combine_prepare(gd, cs..., keepkeys=keepkeys, ungroup=ungroup, - copycols=true, keeprows=false, renamecols=renamecols) - function _combine_prepare(gd::GroupedDataFrame, - @nospecialize(cs::Union{Pair, typeof(nrow), + @nospecialize(cs::Union{Pair, Base.Callable, ColumnIndex, MultiColumnIndex}...); keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool) + if !ungroup && !keepkeys + throw(ArgumentError("keepkeys=false when ungroup=false is not allowed")) + end + cs_vec = [] for p in cs if p === nrow @@ -514,91 +42,33 @@ function _combine_prepare(gd::GroupedDataFrame, # an explicit error is thrown as this was allowed in the past throw(ArgumentError("passing a Tuple $(first(x)) as column selector is not supported" * ", use a vector $(collect(first(x))) instead")) - for (i, v) in enumerate(cs_vec) - if first(v) isa Tuple - cs_vec[i] = collect(first(v)) => last(v) - end - end end - cs_norm_pre = [normalize_selection(index(parent(gd)), c, renamecols) for c in cs_vec] - seen_cols = Set{Symbol}() - process_vectors = false - for v in cs_norm_pre - if v isa Pair - out_col = last(last(v)) - if out_col in seen_cols - throw(ArgumentError("Duplicate output column name $out_col requested")) + + cs_norm = [] + optional_transform = Bool[] + for c in cs_vec + arg = normalize_selection(index(parent(gd)), c, renamecols) + if arg isa AbstractVector{Int} + for col_idx in arg + push!(cs_norm, col_idx => identity => _names(gd)[col_idx]) + push!(optional_transform, true) end - push!(seen_cols, out_col) else - @assert v isa AbstractVector{Int} - process_vectors = true - end - end - processed_cols = Set{Symbol}() - if process_vectors - cs_norm = Pair[] - for (i, v) in enumerate(cs_norm_pre) - if v isa Pair - push!(cs_norm, v) - push!(processed_cols, last(last(v))) - else - @assert v isa AbstractVector{Int} - for col_idx in v - col_name = _names(gd)[col_idx] - if !(col_name in processed_cols) - push!(processed_cols, col_name) - if col_name in seen_cols - trans_idx = findfirst(cs_norm_pre) do p - p isa Pair || return false - last(last(p)) == col_name - end - @assert !isnothing(trans_idx) && trans_idx > i - push!(cs_norm, cs_norm_pre[trans_idx]) - # it is safe to delete from cs_norm_pre - # as we have not reached trans_idx index yet - deleteat!(cs_norm_pre, trans_idx) - else - push!(cs_norm, col_idx => identity => col_name) - end - end - end - end + push!(cs_norm, arg) + push!(optional_transform, false) end - else - cs_norm = collect(Pair, cs_norm_pre) end - f = Pair[first(x) => first(last(x)) for x in cs_norm] - nms = Symbol[last(last(x)) for x in cs_norm] - return combine_helper(f, gd, nms, keepkeys=keepkeys, ungroup=ungroup, - copycols=copycols, keeprows=keeprows, renamecols=renamecols) -end -function gen_groups(idx::Vector{Int}) - groups = zeros(Int, length(idx)) - groups[1] = 1 - j = 1 - last_idx = idx[1] - @inbounds for i in 2:length(idx) - cur_idx = idx[i] - j += cur_idx != last_idx - last_idx = cur_idx - groups[i] = j - end - return groups -end + # cs_norm holds now either src => fun => dst or just fun + # if optional_transform[i] is true then the transformation will be skipped + # if earlier column with a column with the same name was created + + idx, valscat = _combine(gd, cs_norm, optional_transform, copycols, keeprows, renamecols) -function combine_helper(f, gd::GroupedDataFrame, - nms::Union{AbstractVector{Symbol},Nothing}=nothing; - keepkeys::Bool, ungroup::Bool, - copycols::Bool, keeprows::Bool, renamecols::Bool) - if !ungroup && !keepkeys - throw(ArgumentError("keepkeys=false when ungroup=false is not allowed")) - end - idx, valscat = _combine(f, gd, nms, copycols, keeprows, renamecols) !keepkeys && ungroup && return valscat - keys = groupcols(gd) - for key in keys + + gd_keys = groupcols(gd) + for key in gd_keys if hasproperty(valscat, key) if (keeprows && !isequal(valscat[!, key], parent(gd)[!, key])) || (!keeprows && !isequal(valscat[!, key], view(parent(gd)[!, key], idx))) @@ -612,17 +82,17 @@ function combine_helper(f, gd::GroupedDataFrame, else newparent = length(gd) > 0 ? parent(gd)[idx, gd.cols] : parent(gd)[1:0, gd.cols] end - added_cols = select(valscat, Not(intersect(keys, _names(valscat))), copycols=false) + added_cols = select(valscat, Not(intersect(gd_keys, _names(valscat))), copycols=false) hcat!(newparent, length(gd) > 0 ? added_cols : similar(added_cols, 0), copycols=false) ungroup && return newparent - if length(idx) == 0 && !(keeprows && length(keys) > 0) + if length(idx) == 0 && !(keeprows && length(gd_keys) > 0) @assert nrow(newparent) == 0 return GroupedDataFrame(newparent, copy(gd.cols), Int[], Int[], Int[], Int[], 0, Dict{Any,Int}(), Threads.ReentrantLock()) elseif keeprows - @assert length(keys) > 0 || idx == gd.idx + @assert length(gd_keys) > 0 || idx == gd.idx # in this case we are sure that the result GroupedDataFrame has the # same structure as the source except that grouping columns are at the start return Threads.lock(gd.lazy_lock) do @@ -640,220 +110,6 @@ function combine_helper(f, gd::GroupedDataFrame, end end -# Wrapping automatically adds column names when the value returned -# by the user-provided function lacks them -wrap(x::Union{AbstractDataFrame, NamedTuple, DataFrameRow}) = x -wrap(x::AbstractMatrix) = - NamedTuple{Tuple(gennames(size(x, 2)))}(Tuple(view(x, :, i) for i in 1:size(x, 2))) -wrap(x::Any) = (x1=x,) - -const ERROR_ROW_COUNT = "return value must not change its kind " * - "(single row or variable number of rows) across groups" - -const ERROR_COL_COUNT = "function must return only single-column values, " * - "or only multiple-column values" - -wrap_table(x::Any, ::Val) = - throw(ArgumentError(ERROR_ROW_COUNT)) -function wrap_table(x::Union{NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}, - AbstractDataFrame, AbstractMatrix}, - ::Val{firstmulticol}) where firstmulticol - if !firstmulticol - throw(ArgumentError(ERROR_COL_COUNT)) - end - return wrap(x) -end - -function wrap_table(x::AbstractVector, ::Val{firstmulticol}) where firstmulticol - if firstmulticol - throw(ArgumentError(ERROR_COL_COUNT)) - end - return wrap(x) -end - -function wrap_row(x::Any, ::Val{firstmulticol}) where firstmulticol - # NamedTuple is not possible in this branch - if (x isa DataFrameRow) ⊻ firstmulticol - throw(ArgumentError(ERROR_COL_COUNT)) - end - return wrap(x) -end - -function wrap_row(x::Union{AbstractArray{<:Any, 0}, Ref}, - ::Val{firstmulticol}) where firstmulticol - if firstmulticol - throw(ArgumentError(ERROR_COL_COUNT)) - end - return (x1 = x[],) -end - -# note that also NamedTuple() is correctly captured by this definition -# as it is more specific than the one below -wrap_row(::Union{AbstractVecOrMat, AbstractDataFrame, - NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}, ::Val) = - throw(ArgumentError(ERROR_ROW_COUNT)) - -function wrap_row(x::NamedTuple, ::Val{firstmulticol}) where firstmulticol - if any(v -> v isa AbstractVector, x) - throw(ArgumentError("mixing single values and vectors in a named tuple is not allowed")) - end - if !firstmulticol - throw(ArgumentError(ERROR_COL_COUNT)) - end - return x -end - -# idx, starts and ends are passed separately to avoid cost of field access in tight loop -# Manual unrolling of Tuple is used as it turned out more efficient than @generated -# for small number of columns passed. -# For more than 4 columns `map` is slower than @generated -# but this case is probably rare and if huge number of columns is passed @generated -# has very high compilation cost -function do_call(f::Any, idx::AbstractVector{<:Integer}, - starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, - gd::GroupedDataFrame, incols::Tuple{}, i::Integer) - if f isa ByRow - return [f.fun() for _ in 1:(ends[i] - starts[i] + 1)] - else - return f() - end -end - -function do_call(f::Any, idx::AbstractVector{<:Integer}, - starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, - gd::GroupedDataFrame, incols::Tuple{AbstractVector}, i::Integer) - idx = idx[starts[i]:ends[i]] - return f(view(incols[1], idx)) -end - -function do_call(f::Any, idx::AbstractVector{<:Integer}, - starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, - gd::GroupedDataFrame, incols::NTuple{2, AbstractVector}, i::Integer) - idx = idx[starts[i]:ends[i]] - return f(view(incols[1], idx), view(incols[2], idx)) -end - -function do_call(f::Any, idx::AbstractVector{<:Integer}, - starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, - gd::GroupedDataFrame, incols::NTuple{3, AbstractVector}, i::Integer) - idx = idx[starts[i]:ends[i]] - return f(view(incols[1], idx), view(incols[2], idx), view(incols[3], idx)) -end - -function do_call(f::Any, idx::AbstractVector{<:Integer}, - starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, - gd::GroupedDataFrame, incols::NTuple{4, AbstractVector}, i::Integer) - idx = idx[starts[i]:ends[i]] - return f(view(incols[1], idx), view(incols[2], idx), view(incols[3], idx), - view(incols[4], idx)) -end - -function do_call(f::Any, idx::AbstractVector{<:Integer}, - starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, - gd::GroupedDataFrame, incols::Tuple, i::Integer) - idx = idx[starts[i]:ends[i]] - return f(map(c -> view(c, idx), incols)...) -end - -function do_call(f::Any, idx::AbstractVector{<:Integer}, - starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, - gd::GroupedDataFrame, incols::NamedTuple, i::Integer) - if f isa ByRow && isempty(incols) - return [f.fun(NamedTuple()) for _ in 1:(ends[i] - starts[i] + 1)] - else - idx = idx[starts[i]:ends[i]] - return f(map(c -> view(c, idx), incols)) - end -end - -function do_call(f::Any, idx::AbstractVector{<:Integer}, - starts::AbstractVector{<:Integer}, ends::AbstractVector{<:Integer}, - gd::GroupedDataFrame, incols::Nothing, i::Integer) - idx = idx[starts[i]:ends[i]] - return f(view(parent(gd), idx, :)) -end - -_nrow(df::AbstractDataFrame) = nrow(df) -_nrow(x::NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}) = - isempty(x) ? 0 : length(x[1]) -_ncol(df::AbstractDataFrame) = ncol(df) -_ncol(x::Union{NamedTuple, DataFrameRow}) = length(x) - -abstract type AbstractAggregate end - -struct Reduce{O, C, A} <: AbstractAggregate - op::O - condf::C - adjust::A - checkempty::Bool -end -Reduce(f, condf=nothing, adjust=nothing) = Reduce(f, condf, adjust, false) - -check_aggregate(f::Any, ::AbstractVector) = f -check_aggregate(f::typeof(sum), ::AbstractVector{<:Union{Missing, Number}}) = - Reduce(Base.add_sum) -check_aggregate(f::typeof(sum∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = - Reduce(Base.add_sum, !ismissing) -check_aggregate(f::typeof(prod), ::AbstractVector{<:Union{Missing, Number}}) = - Reduce(Base.mul_prod) -check_aggregate(f::typeof(prod∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = - Reduce(Base.mul_prod, !ismissing) -check_aggregate(f::typeof(maximum), - ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f -check_aggregate(f::typeof(maximum), v::AbstractVector{<:Union{Missing, Real}}) = - eltype(v) === Any ? f : Reduce(max) -check_aggregate(f::typeof(maximum∘skipmissing), - ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f -check_aggregate(f::typeof(maximum∘skipmissing), v::AbstractVector{<:Union{Missing, Real}}) = - eltype(v) === Any ? f : Reduce(max, !ismissing, nothing, true) -check_aggregate(f::typeof(minimum), - ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f -check_aggregate(f::typeof(minimum), v::AbstractVector{<:Union{Missing, Real}}) = - eltype(v) === Any ? f : Reduce(min) -check_aggregate(f::typeof(minimum∘skipmissing), - ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f -check_aggregate(f::typeof(minimum∘skipmissing), v::AbstractVector{<:Union{Missing, Real}}) = - eltype(v) === Any ? f : Reduce(min, !ismissing, nothing, true) -check_aggregate(f::typeof(mean), ::AbstractVector{<:Union{Missing, Number}}) = - Reduce(Base.add_sum, nothing, /) -check_aggregate(f::typeof(mean∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = - Reduce(Base.add_sum, !ismissing, /) - -# Other aggregate functions which are not strictly reductions -struct Aggregate{F, C} <: AbstractAggregate - f::F - condf::C -end -Aggregate(f) = Aggregate(f, nothing) - -check_aggregate(f::typeof(var), ::AbstractVector{<:Union{Missing, Number}}) = - Aggregate(var) -check_aggregate(f::typeof(var∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = - Aggregate(var, !ismissing) -check_aggregate(f::typeof(std), ::AbstractVector{<:Union{Missing, Number}}) = - Aggregate(std) -check_aggregate(f::typeof(std∘skipmissing), ::AbstractVector{<:Union{Missing, Number}}) = - Aggregate(std, !ismissing) -check_aggregate(f::typeof(first), v::AbstractVector) = - eltype(v) === Any ? f : Aggregate(first) -check_aggregate(f::typeof(first), - ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f -check_aggregate(f::typeof(first∘skipmissing), v::AbstractVector) = - eltype(v) === Any ? f : Aggregate(first, !ismissing) -check_aggregate(f::typeof(first∘skipmissing), - ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f -check_aggregate(f::typeof(last), v::AbstractVector) = - eltype(v) === Any ? f : Aggregate(last) -check_aggregate(f::typeof(last), - ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f -check_aggregate(f::typeof(last∘skipmissing), v::AbstractVector) = - eltype(v) === Any ? f : Aggregate(last, !ismissing) -check_aggregate(f::typeof(last∘skipmissing), - ::AbstractVector{<:Union{Missing, MULTI_COLS_TYPE, AbstractVector}}) = f -check_aggregate(f::typeof(length), ::AbstractVector) = Aggregate(length) - -# SkipMissing does not support length - # Find first value matching condition for each group # Optimized for situations where a matching value is typically encountered # among the first rows for each group @@ -911,226 +167,303 @@ function fillfirst!(condf, outcol::AbstractVector, incol::AbstractVector, outcol end -# Use a strategy similar to reducedim_init from Base to get the vector of the right type -function groupreduce_init(op, condf, adjust, - incol::AbstractVector{U}, gd::GroupedDataFrame) where U - T = Base.promote_union(U) - - if op === Base.add_sum - initf = zero - elseif op === Base.mul_prod - initf = one - else - throw(ErrorException("Unrecognized op $op")) +function _agg2idx_map_helper(idx::AbstractVector, idx_agg::AbstractVector) + agg2idx_map = fill(-1, length(idx)) + aggj = 1 + @inbounds for (j, idxj) in enumerate(idx) + while idx_agg[aggj] != idxj + aggj += 1 + @assert aggj <= length(idx_agg) + end + agg2idx_map[j] = aggj end + return agg2idx_map +end - Tnm = nonmissingtype(T) - if isconcretetype(Tnm) && applicable(initf, Tnm) - tmpv = initf(Tnm) - initv = op(tmpv, tmpv) - if adjust isa Nothing - x = Tnm <: AbstractIrrational ? float(initv) : initv - else - x = adjust(initv, 1) - end - if condf === !ismissing - V = typeof(x) +struct TransformationResult + col_idx::Vector{Int} # index for a column + col::AbstractVector # computed value of a column + name::Symbol # name of a column + optional::Bool # whether a column is allowed to be replaced in the future +end + +# the transformation is an aggregation for which we have the fast path +function _combine_process_agg(@nospecialize(cs_i::Pair{Int, <:Pair{<:Function, Symbol}}), + optional_i::Bool, + parentdf::AbstractDataFrame, + gd::GroupedDataFrame, + seen_cols::Dict{Symbol, Tuple{Bool, Int}}, + trans_res::Vector{TransformationResult}, + idx_agg::Union{Nothing, AbstractVector{Int}}) + @assert isagg(cs_i, gd) + @assert !optional_i + out_col_name = last(last(cs_i)) + incol = parentdf[!, first(cs_i)] + agg = check_aggregate(first(last(cs_i)), incol) + outcol = agg(incol, gd) + + if haskey(seen_cols, out_col_name) + optional, loc = seen_cols[out_col_name] + # we have seen this col but it is not allowed to replace it + optional || throw(ArgumentError("duplicate output column name: :$out_col_name")) + @assert trans_res[loc].optional && trans_res[loc].name == out_col_name + trans_res[loc] = TransformationResult(idx_agg, outcol, out_col_name, optional_i) + seen_cols[out_col_name] = (optional_i, loc) + else + push!(trans_res, TransformationResult(idx_agg, outcol, out_col_name, optional_i)) + seen_cols[out_col_name] = (optional_i, length(trans_res)) + end +end + +# move one column without transorming it +function _combine_process_noop(cs_i::Pair{<:Union{Int, AbstractVector{Int}}, Pair{typeof(identity), Symbol}}, + optional_i::Bool, + parentdf::AbstractDataFrame, + seen_cols::Dict{Symbol, Tuple{Bool, Int}}, + trans_res::Vector{TransformationResult}, + idx_keeprows::AbstractVector{Int}, + copycols::Bool) + source_cols = first(cs_i) + out_col_name = last(last(cs_i)) + if length(source_cols) != 1 + throw(ArgumentError("Exactly one column can be transformed to one output column" * + " when using identity transformation")) + end + outcol = parentdf[!, first(source_cols)] + + if haskey(seen_cols, out_col_name) + optional, loc = seen_cols[out_col_name] + @assert trans_res[loc].name == out_col_name + if optional + if !optional_i + @assert trans_res[loc].optional + trans_res[loc] = TransformationResult(idx_keeprows, copycols ? copy(outcol) : outcol, + out_col_name, optional_i) + seen_cols[out_col_name] = (optional_i, loc) + end else - V = U >: Missing ? Union{typeof(x), Missing} : typeof(x) + # if optional_i is true, then we ignore processing this column + optional_i || throw(ArgumentError("duplicate output column name: :$out_col_name")) end - v = similar(incol, V, length(gd)) - fill!(v, x) - return v else - # do not try to determine the narrowest possible type nor starting value - # as this is not possible to do correctly in general without processing - # groups; it will get fixed later in groupreduce!; later we - # will make use of the fact that this vector is filled with #undef - # while above the vector is filled with a concrete value - return Vector{Any}(undef, length(gd)) + push!(trans_res, TransformationResult(idx_keeprows, copycols ? copy(outcol) : outcol, + out_col_name, optional_i)) + seen_cols[out_col_name] = (optional_i, length(trans_res)) end end -for (op, initf) in ((:max, :typemin), (:min, :typemax)) - @eval begin - function groupreduce_init(::typeof($op), condf, adjust, - incol::AbstractVector{T}, gd::GroupedDataFrame) where T - @assert isnothing(adjust) - S = nonmissingtype(T) - # !ismissing check is purely an optimization to avoid a copy later - outcol = similar(incol, condf === !ismissing ? S : T, length(gd)) - # Comparison is possible only between CatValues from the same pool - if incol isa CategoricalVector - U = Union{CategoricalArrays.leveltype(outcol), - eltype(outcol) >: Missing ? Missing : Union{}} - outcol = CategoricalArray{U, 1}(outcol.refs, incol.pool) - end - # It is safe to use a non-missing init value - # since missing will poison the result if present - # we assume here that groups are non-empty (current design assures this) - # + workaround for https://github.com/JuliaLang/julia/issues/36978 - if isconcretetype(S) && hasmethod($initf, Tuple{S}) && !(S <: Irrational) - fill!(outcol, $initf(S)) - else - fillfirst!(condf, outcol, incol, gd) - end - return outcol +# perform a transformation taking SubDataFrame as an input +function _combine_process_callable(@nospecialize(cs_i::Base.Callable), + optional_i::Bool, + parentdf::AbstractDataFrame, + gd::GroupedDataFrame, + seen_cols::Dict{Symbol, Tuple{Bool, Int}}, + trans_res::Vector{TransformationResult}, + idx_agg::Union{Nothing, AbstractVector{Int}}) + firstres = length(gd) > 0 ? cs_i(gd[1]) : cs_i(similar(parentdf, 0)) + idx, outcols, nms = _combine_multicol(firstres, cs_i, gd, nothing) + + if !(firstres isa Union{AbstractVecOrMat, AbstractDataFrame, + NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}) + # if idx_agg was not computed yet it is nothing + # in this case if we are not passed a vector compute it. + if isnothing(idx_agg) + idx_agg = Vector{Int}(undef, length(gd)) + fillfirst!(nothing, idx_agg, 1:length(gd.groups), gd) end + @assert idx == idx_agg + idx = idx_agg end -end - -function copyto_widen!(res::AbstractVector{T}, x::AbstractVector) where T - @inbounds for i in eachindex(res, x) - val = x[i] - S = typeof(val) - if S <: T || promote_type(S, T) <: T - res[i] = val + @assert length(outcols) == length(nms) + for j in eachindex(outcols) + outcol = outcols[j] + out_col_name = nms[j] + if haskey(seen_cols, out_col_name) + optional, loc = seen_cols[out_col_name] + # if column was seen and it is optional now ignore it + if !optional_i + optional, loc = seen_cols[out_col_name] + # we have seen this col but it is not allowed to replace it + optional || throw(ArgumentError("duplicate output column name: :$out_col_name")) + @assert trans_res[loc].optional && trans_res[loc].name == out_col_name + trans_res[loc] = TransformationResult(idx, outcol, out_col_name, optional_i) + seen_cols[out_col_name] = (optional_i, loc) + end else - newres = Tables.allocatecolumn(promote_type(S, T), length(x)) - return copyto_widen!(newres, x) + push!(trans_res, TransformationResult(idx, outcol, out_col_name, optional_i)) + seen_cols[out_col_name] = (optional_i, length(trans_res)) end end - return res + return idx_agg end -function groupreduce!(res::AbstractVector, f, op, condf, adjust, checkempty::Bool, - incol::AbstractVector, gd::GroupedDataFrame) - n = length(gd) - if adjust !== nothing || checkempty - counts = zeros(Int, n) +# perform a transformation specified using the Pair notation with a single output column +function _combine_process_pair_symbol(optional_i::Bool, + gd::GroupedDataFrame, + seen_cols::Dict{Symbol, Tuple{Bool, Int}}, + trans_res::Vector{TransformationResult}, + idx_agg::Union{Nothing, AbstractVector{Int}}, + out_col_name::Symbol, + firstmulticol::Bool, + firstres::Any, + @nospecialize(fun::Base.Callable), + incols::Union{Tuple, NamedTuple}) + if firstmulticol + throw(ArgumentError("a single value or vector result is required (got $(typeof(firstres)))")) end - groups = gd.groups - @inbounds for i in eachindex(incol, groups) - gix = groups[i] - x = incol[i] - if gix > 0 && (condf === nothing || condf(x)) - # this check should be optimized out if U is not Any - if eltype(res) === Any && !isassigned(res, gix) - res[gix] = f(x, gix) - else - res[gix] = op(res[gix], f(x, gix)) - end - if adjust !== nothing || checkempty - counts[gix] += 1 - end - end + # if idx_agg was not computed yet it is nothing + # in this case if we are not passed a vector compute it. + if !(firstres isa AbstractVector) && isnothing(idx_agg) + idx_agg = Vector{Int}(undef, length(gd)) + fillfirst!(nothing, idx_agg, 1:length(gd.groups), gd) end - # handle the case of an unitialized reduction - if eltype(res) === Any - if op === Base.add_sum - initf = zero - elseif op === Base.mul_prod - initf = one - else - initf = x -> throw(ErrorException("Unrecognized op $op")) + # TODO: if firstres is a vector we recompute idx for every function + # this could be avoided - it could be computed only the first time + # and later we could just check if lengths of groups match this first idx + + # the last argument passed to _combine_with_first informs it about precomputed + # idx. Currently we do it only for single-row return values otherwise we pass + # nothing to signal that idx has to be computed in _combine_with_first + idx, outcols, _ = _combine_with_first(wrap(firstres), fun, gd, incols, + Val(firstmulticol), + firstres isa AbstractVector ? nothing : idx_agg) + @assert length(outcols) == 1 + outcol = outcols[1] + + if haskey(seen_cols, out_col_name) + # if column was seen and it is optional now ignore it + if !optional_i + optional, loc = seen_cols[out_col_name] + # we have seen this col but it is not allowed to replace it + optional || throw(ArgumentError("duplicate output column name: :$out_col_name")) + @assert trans_res[loc].optional && trans_res[loc].name == out_col_name + trans_res[loc] = TransformationResult(idx, outcol, out_col_name, optional_i) + seen_cols[out_col_name] = (optional_i, loc) end - @inbounds for gix in eachindex(res) - if !isassigned(res, gix) - res[gix] = initf(nonmissingtype(eltype(incol))) - end - end - end - if adjust !== nothing - res .= adjust.(res, counts) - end - if checkempty && any(iszero, counts) - throw(ArgumentError("some groups contain only missing values")) - end - # Undo pool sharing done by groupreduce_init - if res isa CategoricalVector && res.pool === incol.pool - V = Union{CategoricalArrays.leveltype(res), - eltype(res) >: Missing ? Missing : Union{}} - res = CategoricalArray{V, 1}(res.refs, copy(res.pool)) - end - if isconcretetype(eltype(res)) - return res else - return copyto_widen!(Tables.allocatecolumn(typeof(first(res)), n), res) - end -end - -# function barrier works around type instability of groupreduce_init due to applicable -groupreduce(f, op, condf, adjust, checkempty::Bool, - incol::AbstractVector, gd::GroupedDataFrame) = - groupreduce!(groupreduce_init(op, condf, adjust, incol, gd), - f, op, condf, adjust, checkempty, incol, gd) -# Avoids the overhead due to Missing when computing reduction -groupreduce(f, op, condf::typeof(!ismissing), adjust, checkempty::Bool, - incol::AbstractVector, gd::GroupedDataFrame) = - groupreduce!(disallowmissing(groupreduce_init(op, condf, adjust, incol, gd)), - f, op, condf, adjust, checkempty, incol, gd) - -(r::Reduce)(incol::AbstractVector, gd::GroupedDataFrame) = - groupreduce((x, i) -> x, r.op, r.condf, r.adjust, r.checkempty, incol, gd) - -# this definition is missing in Julia 1.0 LTS and is required by aggregation for var -# TODO: remove this when we drop 1.0 support -if VERSION < v"1.1" - Base.zero(::Type{Missing}) = missing -end - -function (agg::Aggregate{typeof(var)})(incol::AbstractVector, gd::GroupedDataFrame) - means = groupreduce((x, i) -> x, Base.add_sum, agg.condf, /, false, incol, gd) - # !ismissing check is purely an optimization to avoid a copy later - if eltype(means) >: Missing && agg.condf !== !ismissing - T = Union{Missing, real(eltype(means))} + push!(trans_res, TransformationResult(idx, outcol, out_col_name, optional_i)) + seen_cols[out_col_name] = (optional_i, length(trans_res)) + end + return idx_agg +end + +# perform a transformation specified using the Pair notation with multiple output columns +function _combine_process_pair_astable(optional_i::Bool, + gd::GroupedDataFrame, + seen_cols::Dict{Symbol, Tuple{Bool, Int}}, + trans_res::Vector{TransformationResult}, + idx_agg::Union{Nothing, AbstractVector{Int}}, + out_col_name::Union{Type{AsTable}, AbstractVector{Symbol}}, + firstmulticol::Bool, + firstres::Any, + @nospecialize(fun::Base.Callable), + incols::Union{Tuple, NamedTuple}) + if firstres isa AbstractVector + idx, outcol_vec, _ = _combine_with_first(wrap(firstres), fun, gd, incols, + Val(firstmulticol), nothing) + @assert length(outcol_vec) == 1 + res = outcol_vec[1] + @assert length(res) > 0 + + kp1 = keys(res[1]) + prepend = all(x -> x isa Integer, kp1) + if !(prepend || all(x -> x isa Symbol, kp1) || all(x -> x isa AbstractString, kp1)) + throw(ArgumentError("keys of the returned elements must be " * + "`Symbol`s, strings or integers")) + end + if any(x -> !isequal(keys(x), kp1), res) + throw(ArgumentError("keys of the returned elements must be identical")) + end + outcols = [[x[n] for x in res] for n in kp1] + nms = [prepend ? Symbol("x", n) : Symbol(n) for n in kp1] else - T = real(eltype(means)) - end - res = zeros(T, length(gd)) - return groupreduce!(res, (x, i) -> @inbounds(abs2(x - means[i])), +, agg.condf, - (x, l) -> l <= 1 ? oftype(x / (l-1), NaN) : x / (l-1), - false, incol, gd) -end + if !firstmulticol + firstres = Tables.columntable(firstres) + oldfun = fun + fun = (x...) -> Tables.columntable(oldfun(x...)) + end + idx, outcols, nms = _combine_multicol(firstres, fun, gd, incols) -function (agg::Aggregate{typeof(std)})(incol::AbstractVector, gd::GroupedDataFrame) - outcol = Aggregate(var, agg.condf)(incol, gd) - if eltype(outcol) <: Union{Missing, Rational} - return sqrt.(outcol) - else - return map!(sqrt, outcol, outcol) + if !(firstres isa Union{AbstractVecOrMat, AbstractDataFrame, + NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}) + # if idx_agg was not computed yet it is nothing + # in this case if we are not passed a vector compute it. + if isnothing(idx_agg) + idx_agg = Vector{Int}(undef, length(gd)) + fillfirst!(nothing, idx_agg, 1:length(gd.groups), gd) + end + @assert idx == idx_agg + idx = idx_agg + end + @assert length(outcols) == length(nms) end -end - -for f in (first, last) - function (agg::Aggregate{typeof(f)})(incol::AbstractVector, gd::GroupedDataFrame) - n = length(gd) - outcol = similar(incol, n) - fillfirst!(agg.condf, outcol, incol, gd, rev=agg.f === last) - if isconcretetype(eltype(outcol)) - return outcol + if out_col_name isa AbstractVector{Symbol} + if length(out_col_name) != length(nms) + throw(ArgumentError("Number of returned columns does not " * + "match the length of requested output")) + else + nms = out_col_name + end + end + for j in eachindex(outcols) + outcol = outcols[j] + out_col_name = nms[j] + if haskey(seen_cols, out_col_name) + optional, loc = seen_cols[out_col_name] + # if column was seen and it is optional now ignore it + if !optional_i + optional, loc = seen_cols[out_col_name] + # we have seen this col but it is not allowed to replace it + optional || throw(ArgumentError("duplicate output column name: :$out_col_name")) + @assert trans_res[loc].optional && trans_res[loc].name == out_col_name + trans_res[loc] = TransformationResult(idx, outcol, out_col_name, optional_i) + seen_cols[out_col_name] = (optional_i, loc) + end else - return copyto_widen!(Tables.allocatecolumn(typeof(first(outcol)), n), outcol) + push!(trans_res, TransformationResult(idx, outcol, out_col_name, optional_i)) + seen_cols[out_col_name] = (optional_i, length(trans_res)) end end + return idx_agg end -function (agg::Aggregate{typeof(length)})(incol::AbstractVector, gd::GroupedDataFrame) - if getfield(gd, :idx) === nothing - lens = zeros(Int, length(gd)) - @inbounds for gix in gd.groups - gix > 0 && (lens[gix] += 1) - end - return lens +# perform a transformation specified using the Pair notation +# cs_i is a Pair that has many possible forms so this function is used to dispatch +# to an appropriate more specialized function +function _combine_process_pair(@nospecialize(cs_i::Pair), + optional_i::Bool, + parentdf::AbstractDataFrame, + gd::GroupedDataFrame, + seen_cols::Dict{Symbol, Tuple{Bool, Int}}, + trans_res::Vector{TransformationResult}, + idx_agg::Union{Nothing, AbstractVector{Int}}) + source_cols, (fun, out_col_name) = cs_i + + if source_cols isa Int + incols = (parentdf[!, source_cols],) + elseif source_cols isa AsTable + incols = Tables.columntable(select(parentdf, + source_cols.cols, + copycols=false)) else - return gd.ends .- gd.starts .+ 1 + @assert source_cols isa AbstractVector{Int} + incols = ntuple(i -> parentdf[!, source_cols[i]], length(source_cols)) end -end -isagg((col, fun)::Pair, gdf::GroupedDataFrame) = - col isa ColumnIndex && check_aggregate(fun, parent(gdf)[!, col]) isa AbstractAggregate + firstres = length(gd) > 0 ? + do_call(fun, gd.idx, gd.starts, gd.ends, gd, incols, 1) : + do_call(fun, Int[], 1:1, 0:0, gd, incols, 1) + firstmulticol = firstres isa MULTI_COLS_TYPE -function _agg2idx_map_helper(idx, idx_agg) - agg2idx_map = fill(-1, length(idx)) - aggj = 1 - @inbounds for (j, idxj) in enumerate(idx) - while idx_agg[aggj] != idxj - aggj += 1 - @assert aggj <= length(idx_agg) - end - agg2idx_map[j] = aggj + if out_col_name isa Symbol + return _combine_process_pair_symbol(optional_i, gd, seen_cols, trans_res, idx_agg, + out_col_name, firstmulticol, firstres, fun, incols) end - return agg2idx_map + if out_col_name == AsTable || out_col_name isa AbstractVector{Symbol} + return _combine_process_pair_astable(optional_i, gd, seen_cols, trans_res, idx_agg, + out_col_name, firstmulticol, firstres, fun, incols) + end + throw(ArgumentError("unsupported target column name specifier $out_col_name")) end function prepare_idx_keeprows(idx::AbstractVector{<:Integer}, @@ -1150,14 +483,10 @@ function prepare_idx_keeprows(idx::AbstractVector{<:Integer}, return idx_keeprows end -function _combine(f::AbstractVector{<:Pair}, - gd::GroupedDataFrame, nms::AbstractVector{Symbol}, +function _combine(gd::GroupedDataFrame, + @nospecialize(cs_norm::Vector{Any}), optional_transform::Vector{Bool}, copycols::Bool, keeprows::Bool, renamecols::Bool) - # here f should be normalized and in a form of source_cols => fun - @assert all(x -> first(x) isa Union{Int, AbstractVector{Int}, AsTable}, f) - @assert all(x -> last(x) isa Base.Callable, f) - - if isempty(f) + if isempty(cs_norm) if keeprows && nrow(parent(gd)) > 0 && minimum(gd.groups) == 0 throw(ArgumentError("select and transform do not support " * "`GroupedDataFrame`s from which some groups have "* @@ -1178,87 +507,76 @@ function _combine(f::AbstractVector{<:Pair}, end idx_agg = nothing - if length(gd) > 0 && any(x -> isagg(x, gd), f) + if length(gd) > 0 && any(x -> isagg(x, gd), cs_norm) # Compute indices of representative rows only once for all AbstractAggregates idx_agg = Vector{Int}(undef, length(gd)) fillfirst!(nothing, idx_agg, 1:length(gd.groups), gd) - elseif length(gd) == 0 || !all(x -> isagg(x, gd), f) + elseif length(gd) == 0 || !all(x -> isagg(x, gd), cs_norm) # Trigger computation of indices # This can speed up some aggregates that would not trigger this on their own @assert gd.idx !== nothing end - res = Vector{Any}(undef, length(f)) + + trans_res = Vector{TransformationResult}() + + # seen_cols keeps an information about location of columns already processed + # and if a given column can be replaced in the future + seen_cols = Dict{Symbol, Tuple{Bool, Int}}() + parentdf = parent(gd) - for (i, p) in enumerate(f) - source_cols, fun = p - if length(gd) > 0 && isagg(p, gd) - incol = parentdf[!, source_cols] - agg = check_aggregate(last(p), incol) - outcol = agg(incol, gd) - res[i] = idx_agg, outcol - elseif keeprows && fun === identity && !(source_cols isa AsTable) - @assert source_cols isa Union{Int, AbstractVector{Int}} - @assert length(source_cols) == 1 - outcol = parentdf[!, first(source_cols)] - res[i] = idx_keeprows, copycols ? copy(outcol) : outcol - else - if source_cols isa Int - incols = (parentdf[!, source_cols],) - elseif source_cols isa AsTable - incols = Tables.columntable(select(parentdf, - source_cols.cols, - copycols=false)) - else - @assert source_cols isa AbstractVector{Int} - incols = ntuple(i -> parentdf[!, source_cols[i]], length(source_cols)) - end - firstres = length(gd) > 0 ? - do_call(fun, gd.idx, gd.starts, gd.ends, gd, incols, 1) : - do_call(fun, Int[], 1:1, 0:0, gd, incols, 1) - firstmulticol = firstres isa MULTI_COLS_TYPE - if firstmulticol - throw(ArgumentError("a single value or vector result is required when " * - "passing multiple functions (got $(typeof(res)))")) + for i in eachindex(cs_norm, optional_transform) + cs_i = cs_norm[i] + optional_i = optional_transform[i] + + if length(gd) > 0 && isagg(cs_i, gd) + _combine_process_agg(cs_i, optional_i, parentdf, gd, seen_cols, trans_res, idx_agg) + elseif keeprows && cs_i isa Pair && first(last(cs_i)) === identity && + !(first(cs_i) isa AsTable) && (last(last(cs_i)) isa Symbol) + # this is a fast path used when we pass a column or rename a column in select or transform + _combine_process_noop(cs_i, optional_i, parentdf, seen_cols, trans_res, idx_keeprows, copycols) + elseif cs_i isa Base.Callable + idx_callable = _combine_process_callable(cs_i, optional_i, parentdf, gd, seen_cols, trans_res, idx_agg) + if idx_callable !== nothing + if idx_agg === nothing + idx_agg = idx_callable + else + @assert idx_agg === idx_callable + end end - # if idx_agg was not computed yet it is nothing - # in this case if we are not passed a vector compute it. - if !(firstres isa AbstractVector) && isnothing(idx_agg) - idx_agg = Vector{Int}(undef, length(gd)) - fillfirst!(nothing, idx_agg, 1:length(gd.groups), gd) + else + @assert cs_i isa Pair + idx_pair = _combine_process_pair(cs_i, optional_i, parentdf, gd, seen_cols, trans_res, idx_agg) + if idx_pair !== nothing + if idx_agg === nothing + idx_agg = idx_pair + else + @assert idx_agg === idx_pair + end end - # TODO: if firstres is a vector we recompute idx for every function - # this could be avoided - it could be computed only the first time - # and later we could just check if lengths of groups match this first idx - - # the last argument passed to _combine_with_first informs it about precomputed - # idx. Currently we do it only for single-row return values otherwise we pass - # nothing to signal that idx has to be computed in _combine_with_first - idx, outcols, _ = _combine_with_first(wrap(firstres), fun, gd, incols, - Val(firstmulticol), - firstres isa AbstractVector ? nothing : idx_agg) - @assert length(outcols) == 1 - res[i] = idx, outcols[1] end end + + isempty(trans_res) && return Int[], DataFrame() # idx_agg === nothing then we have only functions that # returned multiple rows and idx_loc = 1 - idx_loc = findfirst(x -> x[1] !== idx_agg, res) + idx_loc = findfirst(x -> x.col_idx !== idx_agg, trans_res) if !keeprows && isnothing(idx_loc) @assert !isnothing(idx_agg) idx = idx_agg else - idx = keeprows ? idx_keeprows : res[idx_loc][1] + idx = keeprows ? idx_keeprows : trans_res[idx_loc].col_idx agg2idx_map = nothing - for i in 1:length(res) - if res[i][1] !== idx && res[i][1] != idx - if res[i][1] === idx_agg + for i in 1:length(trans_res) + if trans_res[i].col_idx !== idx + if trans_res[i].col_idx === idx_agg # we perform pseudo broadcasting here # keep -1 as a sentinel for errors if isnothing(agg2idx_map) agg2idx_map = _agg2idx_map_helper(idx, idx_agg) end - res[i] = idx_agg, res[i][2][agg2idx_map] - elseif idx != res[i][1] + trans_res[i] = TransformationResult(idx_agg, trans_res[i].col[agg2idx_map], + trans_res[i].name, trans_res[i].optional) + elseif idx != trans_res[i].col_idx if keeprows throw(ArgumentError("all functions must return vectors with " * "as many values as rows in each group")) @@ -1270,469 +588,79 @@ function _combine(f::AbstractVector{<:Pair}, end end - # here first field in res[i] is used to keep track how the column was generated + # here first field in trans_res[i] is used to keep track how the column was generated # a correct index is stored in idx variable - for (i, (col_idx, col)) in enumerate(res) - if keeprows && res[i][1] !== idx_keeprows # we need to reorder the column + for i in eachindex(trans_res) + col_idx = trans_res[i].col_idx + col = trans_res[i].col + if keeprows && col_idx !== idx_keeprows # we need to reorder the column newcol = similar(col) # we can probably make it more efficient, but I leave it as an optimization for the future gd_idx = gd.idx - for j in eachindex(gd.idx, col) - newcol[gd_idx[j]] = col[j] + k = 0 + # consider adding @inbounds later + for (s, e) in zip(gd.starts, gd.ends) + for j in s:e + k += 1 + newcol[gd_idx[j]] = col[k] + end end - res[i] = (col_idx, newcol) + @assert k == length(gd_idx) + trans_res[i] = TransformationResult(col_idx, newcol, trans_res[i].name, trans_res[i].optional) end end - outcols = map(x -> x[2], res) + + outcols = AbstractVector[x.col for x in trans_res] + nms = Symbol[x.name for x in trans_res] # this check is redundant given we check idx above # but it is safer to double check and it is cheap @assert all(x -> length(x) == length(outcols[1]), outcols) - return idx, DataFrame(collect(AbstractVector, outcols), nms, copycols=false) -end - -function _combine(fun::Base.Callable, gd::GroupedDataFrame, ::Nothing, - copycols::Bool, keeprows::Bool, renamecols::Bool) - @assert copycols && !keeprows - # use `similar` as `gd` might have been subsetted - firstres = length(gd) > 0 ? fun(gd[1]) : fun(similar(parent(gd), 0)) - idx, outcols, nms = _combine_multicol(firstres, fun, gd, nothing) - valscat = DataFrame(collect(AbstractVector, outcols), nms) - return idx, valscat -end - -function _combine(p::Pair, gd::GroupedDataFrame, ::Nothing, - copycols::Bool, keeprows::Bool, renamecols::Bool) - # here p should not be normalized as we allow tabular return value from fun - # map and combine should not dispatch here if p is isagg - @assert copycols && !keeprows - source_cols, (fun, out_col) = normalize_selection(index(parent(gd)), p, renamecols) - parentdf = parent(gd) - if source_cols isa Int - incols = (parent(gd)[!, source_cols],) - elseif source_cols isa AsTable - incols = Tables.columntable(select(parentdf, - source_cols.cols, - copycols=false)) - else - @assert source_cols isa AbstractVector{Int} - incols = ntuple(i -> parent(gd)[!, source_cols[i]], length(source_cols)) - end - firstres = length(gd) > 0 ? - do_call(fun, gd.idx, gd.starts, gd.ends, gd, incols, 1) : - do_call(fun, Int[], 1:1, 0:0, gd, incols, 1) - idx, outcols, nms = _combine_multicol(firstres, fun, gd, incols) - # disallow passing target column name to genuine tables - if firstres isa MULTI_COLS_TYPE - if p isa Pair{<:Any, <:Pair{<:Any, <:SymbolOrString}} - throw(ArgumentError("setting column name for tabular return value is disallowed")) - end - else - # fetch auto generated or passed target column name to nms overwritting - # what _combine_with_first produced - nms = [out_col] - end - valscat = DataFrame(collect(AbstractVector, outcols), nms) - return idx, valscat -end - -function _combine_multicol(firstres, fun::Any, gd::GroupedDataFrame, - incols::Union{Nothing, AbstractVector, Tuple, NamedTuple}) - firstmulticol = firstres isa MULTI_COLS_TYPE - if !(firstres isa Union{AbstractVecOrMat, AbstractDataFrame, - NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}) - idx_agg = Vector{Int}(undef, length(gd)) - fillfirst!(nothing, idx_agg, 1:length(gd.groups), gd) - else - idx_agg = nothing - end - return _combine_with_first(wrap(firstres), fun, gd, incols, - Val(firstmulticol), idx_agg) -end - -function _combine_with_first(first::Union{NamedTuple, DataFrameRow, AbstractDataFrame}, - f::Any, gd::GroupedDataFrame, - incols::Union{Nothing, AbstractVector, Tuple, NamedTuple}, - firstmulticol::Val, idx_agg::Union{Nothing, AbstractVector{<:Integer}}) - extrude = false - - if first isa AbstractDataFrame - n = 0 - eltys = eltype.(eachcol(first)) - elseif first isa NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}} - n = 0 - eltys = map(eltype, first) - elseif first isa DataFrameRow - n = length(gd) - eltys = [eltype(parent(first)[!, i]) for i in parentcols(index(first))] - elseif firstmulticol == Val(false) && first[1] isa Union{AbstractArray{<:Any, 0}, Ref} - extrude = true - first = wrap_row(first[1], firstmulticol) - n = length(gd) - eltys = (typeof(first[1]),) - else # other NamedTuple giving a single row - n = length(gd) - eltys = map(typeof, first) - if any(x -> x <: AbstractVector, eltys) - throw(ArgumentError("mixing single values and vectors in a named tuple is not allowed")) - end - end - idx = isnothing(idx_agg) ? Vector{Int}(undef, n) : idx_agg - local initialcols - let eltys=eltys, n=n # Workaround for julia#15276 - initialcols = ntuple(i -> Tables.allocatecolumn(eltys[i], n), _ncol(first)) - end - targetcolnames = tuple(propertynames(first)...) - if !extrude && first isa Union{AbstractDataFrame, - NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}} - outcols, finalcolnames = _combine_tables_with_first!(first, initialcols, idx, 1, 1, - f, gd, incols, targetcolnames, - firstmulticol) - else - outcols, finalcolnames = _combine_rows_with_first!(first, initialcols, 1, 1, - f, gd, incols, targetcolnames, - firstmulticol) - end - return idx, outcols, collect(Symbol, finalcolnames) -end - -function fill_row!(row, outcols::NTuple{N, AbstractVector}, - i::Integer, colstart::Integer, - colnames::NTuple{N, Symbol}) where N - if _ncol(row) != N - throw(ArgumentError("return value must have the same number of columns " * - "for all groups (got $N and $(length(row)))")) - end - @inbounds for j in colstart:length(outcols) - col = outcols[j] - cn = colnames[j] - local val - try - val = row[cn] - catch - throw(ArgumentError("return value must have the same column names " * - "for all groups (got $colnames and $(propertynames(row)))")) - end - S = typeof(val) - T = eltype(col) - if S <: T || promote_type(S, T) <: T - col[i] = val - else - return j - end - end - return nothing -end - -function _combine_rows_with_first!(first::Union{NamedTuple, DataFrameRow}, - outcols::NTuple{N, AbstractVector}, - rowstart::Integer, colstart::Integer, - f::Any, gd::GroupedDataFrame, - incols::Union{Nothing, AbstractVector, Tuple, NamedTuple}, - colnames::NTuple{N, Symbol}, - firstmulticol::Val) where N - len = length(gd) - gdidx = gd.idx - starts = gd.starts - ends = gd.ends - - # handle empty GroupedDataFrame - len == 0 && return outcols, colnames - - # Handle first group - j = fill_row!(first, outcols, rowstart, colstart, colnames) - @assert j === nothing # eltype is guaranteed to match - # Handle remaining groups - @inbounds for i in rowstart+1:len - row = wrap_row(do_call(f, gdidx, starts, ends, gd, incols, i), firstmulticol) - j = fill_row!(row, outcols, i, 1, colnames) - if j !== nothing # Need to widen column type - local newcols - let i = i, j = j, outcols=outcols, row=row # Workaround for julia#15276 - newcols = ntuple(length(outcols)) do k - S = typeof(row[k]) - T = eltype(outcols[k]) - U = promote_type(S, T) - if S <: T || U <: T - outcols[k] - else - copyto!(Tables.allocatecolumn(U, length(outcols[k])), - 1, outcols[k], 1, k >= j ? i-1 : i) - end - end - end - return _combine_rows_with_first!(row, newcols, i, j, - f, gd, incols, colnames, firstmulticol) - end - end - return outcols, colnames + return idx, DataFrame(outcols, nms, copycols=false) end -# This needs to be in a separate function -# to work around a crash due to JuliaLang/julia#29430 -if VERSION >= v"1.1.0-DEV.723" - @inline function do_append!(do_it, col, vals) - do_it && append!(col, vals) - return do_it - end -else - @noinline function do_append!(do_it, col, vals) - do_it && append!(col, vals) - return do_it +function combine(f::Base.Callable, gd::GroupedDataFrame; + keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) + if f isa Colon + throw(ArgumentError("First argument must be a transformation if the second argument is a GroupedDataFrame")) end + return combine(gd, f, keepkeys=keepkeys, ungroup=ungroup, renamecols=renamecols) end -function append_rows!(rows, outcols::NTuple{N, AbstractVector}, - colstart::Integer, colnames::NTuple{N, Symbol}) where N - if !isa(rows, Union{AbstractDataFrame, NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}) - throw(ArgumentError(ERROR_ROW_COUNT)) - elseif _ncol(rows) != N - throw(ArgumentError("return value must have the same number of columns " * - "for all groups (got $N and $(_ncol(rows)))")) - end - @inbounds for j in colstart:length(outcols) - col = outcols[j] - cn = colnames[j] - local vals - try - vals = getproperty(rows, cn) - catch - throw(ArgumentError("return value must have the same column names " * - "for all groups (got $colnames and $(propertynames(rows)))")) - end - S = eltype(vals) - T = eltype(col) - if !do_append!(S <: T || promote_type(S, T) <: T, col, vals) - return j - end - end - return nothing -end +combine(f::Pair, gd::GroupedDataFrame; + keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) = + throw(ArgumentError("First argument must be a transformation if the second argument is a GroupedDataFrame. " * + "You can pass a `Pair` as the second argument of the transformation. If you want the return " * + "value to be processed as having multiple columns add `=> AsTable` suffix to the pair.")) -function _combine_tables_with_first!(first::Union{AbstractDataFrame, - NamedTuple{<:Any, <:Tuple{Vararg{AbstractVector}}}}, - outcols::NTuple{N, AbstractVector}, - idx::Vector{Int}, rowstart::Integer, colstart::Integer, - f::Any, gd::GroupedDataFrame, - incols::Union{Nothing, AbstractVector, Tuple, NamedTuple}, - colnames::NTuple{N, Symbol}, - firstmulticol::Val) where N - len = length(gd) - gdidx = gd.idx - starts = gd.starts - ends = gd.ends - # Handle first group +combine(gd::GroupedDataFrame, + cs::Union{Pair, Base.Callable, ColumnIndex, MultiColumnIndex}...; + keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) = + _combine_prepare(gd, cs..., keepkeys=keepkeys, ungroup=ungroup, + copycols=true, keeprows=false, renamecols=renamecols) - @assert _ncol(first) == N - if !isempty(colnames) && length(gd) > 0 - j = append_rows!(first, outcols, colstart, colnames) - @assert j === nothing # eltype is guaranteed to match - append!(idx, Iterators.repeated(gdidx[starts[rowstart]], _nrow(first))) +function select(f::Base.Callable, gd::GroupedDataFrame; copycols::Bool=true, + keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) + if f isa Colon + throw(ArgumentError("First argument must be a transformation if the second argument is a grouped data frame")) end - # Handle remaining groups - @inbounds for i in rowstart+1:len - rows = wrap_table(do_call(f, gdidx, starts, ends, gd, incols, i), firstmulticol) - _ncol(rows) == 0 && continue - if isempty(colnames) - newcolnames = tuple(propertynames(rows)...) - if rows isa AbstractDataFrame - eltys = eltype.(eachcol(rows)) - else - eltys = map(eltype, rows) - end - initialcols = ntuple(i -> Tables.allocatecolumn(eltys[i], 0), _ncol(rows)) - return _combine_tables_with_first!(rows, initialcols, idx, i, 1, - f, gd, incols, newcolnames, firstmulticol) - end - j = append_rows!(rows, outcols, 1, colnames) - if j !== nothing # Need to widen column type - local newcols - let i = i, j = j, outcols=outcols, rows=rows # Workaround for julia#15276 - newcols = ntuple(length(outcols)) do k - S = eltype(rows isa AbstractDataFrame ? rows[!, k] : rows[k]) - T = eltype(outcols[k]) - U = promote_type(S, T) - if S <: T || U <: T - outcols[k] - else - copyto!(Tables.allocatecolumn(U, length(outcols[k])), outcols[k]) - end - end - end - return _combine_tables_with_first!(rows, newcols, idx, i, j, - f, gd, incols, colnames, firstmulticol) - end - append!(idx, Iterators.repeated(gdidx[starts[i]], _nrow(rows))) - end - return outcols, colnames + return select(gd, f, copycols=copycols, keepkeys=keepkeys, ungroup=ungroup) end -""" - select(gd::GroupedDataFrame, args...; copycols::Bool=true, keepkeys::Bool=true, - ungroup::Bool=true, renamecols::Bool=true) - -Apply `args` to `gd` following the rules described in [`combine`](@ref). - -If `ungroup=true` the result is a `DataFrame`. -If `ungroup=false` the result is a `GroupedDataFrame` -(in this case the returned value retains the order of groups of `gd`). - -The `parent` of the returned value has as many rows as `parent(gd)` and -in the same order, except when the returned value has no columns -(in which case it has zero rows). If an operation in `args` returns -a single value it is always broadcasted to have this number of rows. - -If `copycols=false` then do not perform copying of columns that are not transformed. -$KWARG_PROCESSING_RULES - -# See also - -[`groupby`](@ref), [`combine`](@ref), [`select!`](@ref), [`transform`](@ref), [`transform!`](@ref) - -# Examples -```jldoctest -julia> df = DataFrame(a = [1, 1, 1, 2, 2, 1, 1, 2], - b = repeat([2, 1], outer=[4]), - c = 1:8) -8×3 DataFrame -│ Row │ a │ b │ c │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 2 │ 1 │ -│ 2 │ 1 │ 1 │ 2 │ -│ 3 │ 1 │ 2 │ 3 │ -│ 4 │ 2 │ 1 │ 4 │ -│ 5 │ 2 │ 2 │ 5 │ -│ 6 │ 1 │ 1 │ 6 │ -│ 7 │ 1 │ 2 │ 7 │ -│ 8 │ 2 │ 1 │ 8 │ - -julia> gd = groupby(df, :a); - -julia> select(gd, :c => sum, nrow) -8×3 DataFrame -│ Row │ a │ c_sum │ nrow │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 19 │ 5 │ -│ 2 │ 1 │ 19 │ 5 │ -│ 3 │ 1 │ 19 │ 5 │ -│ 4 │ 2 │ 17 │ 3 │ -│ 5 │ 2 │ 17 │ 3 │ -│ 6 │ 1 │ 19 │ 5 │ -│ 7 │ 1 │ 19 │ 5 │ -│ 8 │ 2 │ 17 │ 3 │ - -julia> select(gd, :c => sum, nrow, ungroup=false) -GroupedDataFrame with 2 groups based on key: a -First Group (5 rows): a = 1 -│ Row │ a │ c_sum │ nrow │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 19 │ 5 │ -│ 2 │ 1 │ 19 │ 5 │ -│ 3 │ 1 │ 19 │ 5 │ -│ 4 │ 1 │ 19 │ 5 │ -│ 5 │ 1 │ 19 │ 5 │ -⋮ -Last Group (3 rows): a = 2 -│ Row │ a │ c_sum │ nrow │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 2 │ 17 │ 3 │ -│ 2 │ 2 │ 17 │ 3 │ -│ 3 │ 2 │ 17 │ 3 │ - -julia> select(gd, :c => (x -> sum(log, x)) => :sum_log_c) # specifying a name for target column -8×2 DataFrame -│ Row │ a │ sum_log_c │ -│ │ Int64 │ Float64 │ -├─────┼───────┼───────────┤ -│ 1 │ 1 │ 5.52943 │ -│ 2 │ 1 │ 5.52943 │ -│ 3 │ 1 │ 5.52943 │ -│ 4 │ 2 │ 5.07517 │ -│ 5 │ 2 │ 5.07517 │ -│ 6 │ 1 │ 5.52943 │ -│ 7 │ 1 │ 5.52943 │ -│ 8 │ 2 │ 5.07517 │ - -julia> select(gd, [:b, :c] .=> sum) # passing a vector of pairs -8×3 DataFrame -│ Row │ a │ b_sum │ c_sum │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 8 │ 19 │ -│ 2 │ 1 │ 8 │ 19 │ -│ 3 │ 1 │ 8 │ 19 │ -│ 4 │ 2 │ 4 │ 17 │ -│ 5 │ 2 │ 4 │ 17 │ -│ 6 │ 1 │ 8 │ 19 │ -│ 7 │ 1 │ 8 │ 19 │ -│ 8 │ 2 │ 4 │ 17 │ - -julia> select(gd, :b => :b1, :c => :c1, - [:b, :c] => +, keepkeys=false) # multiple arguments, renaming and keepkeys -8×3 DataFrame -│ Row │ b1 │ c1 │ b_c_+ │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 2 │ 1 │ 3 │ -│ 2 │ 1 │ 2 │ 3 │ -│ 3 │ 2 │ 3 │ 5 │ -│ 4 │ 1 │ 4 │ 5 │ -│ 5 │ 2 │ 5 │ 7 │ -│ 6 │ 1 │ 6 │ 7 │ -│ 7 │ 2 │ 7 │ 9 │ -│ 8 │ 1 │ 8 │ 9 │ - -julia> select(gd, :b, :c => sum) # passing columns and broadcasting -8×3 DataFrame -│ Row │ a │ b │ c_sum │ -│ │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 2 │ 19 │ -│ 2 │ 1 │ 1 │ 19 │ -│ 3 │ 1 │ 2 │ 19 │ -│ 4 │ 2 │ 1 │ 17 │ -│ 5 │ 2 │ 2 │ 17 │ -│ 6 │ 1 │ 1 │ 19 │ -│ 7 │ 1 │ 2 │ 19 │ -│ 8 │ 2 │ 1 │ 17 │ - -julia> select(gd, :, AsTable(Not(:a)) => sum, renamecols=false) -8×4 DataFrame -│ Row │ a │ b │ c │ b_c │ -│ │ Int64 │ Int64 │ Int64 │ Int64 │ -├─────┼───────┼───────┼───────┼───────┤ -│ 1 │ 1 │ 2 │ 1 │ 3 │ -│ 2 │ 1 │ 1 │ 2 │ 3 │ -│ 3 │ 1 │ 2 │ 3 │ 5 │ -│ 4 │ 2 │ 1 │ 4 │ 5 │ -│ 5 │ 2 │ 2 │ 5 │ 7 │ -│ 6 │ 1 │ 1 │ 6 │ 7 │ -│ 7 │ 1 │ 2 │ 7 │ 9 │ -│ 8 │ 2 │ 1 │ 8 │ 9 │ -``` -""" select(gd::GroupedDataFrame, args...; copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) = _combine_prepare(gd, args..., copycols=copycols, keepkeys=keepkeys, ungroup=ungroup, keeprows=true, renamecols=renamecols) -""" - transform(gd::GroupedDataFrame, args...; - copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true) - -An equivalent of -`select(gd, :, args..., copycols=copycols, keepkeys=keepkeys, ungroup=ungroup, renamecols=renamecols)` -but keeps the columns of `parent(gd)` in their original order. - -# See also +function transform(f::Base.Callable, gd::GroupedDataFrame; copycols::Bool=true, + keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) + if f isa Colon + throw(ArgumentError("First argument must be a transformation if the second argument is a grouped data frame")) + end + return transform(gd, f, copycols=copycols, keepkeys=keepkeys, ungroup=ungroup) +end -[`groupby`](@ref), [`combine`](@ref), [`select`](@ref), [`select!`](@ref), [`transform!`](@ref) -""" function transform(gd::GroupedDataFrame, args...; copycols::Bool=true, keepkeys::Bool=true, ungroup::Bool=true, renamecols::Bool=true) res = select(gd, :, args..., copycols=copycols, keepkeys=keepkeys, @@ -1743,21 +671,13 @@ function transform(gd::GroupedDataFrame, args...; copycols::Bool=true, return res end -""" - select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true) - -An equivalent of -`select(gd, args..., copycols=false, keepkeys=true, ungroup=ungroup, renamecols=renamecols)` -but updates `parent(gd)` in place. - -`gd` is updated to reflect the new rows of its updated parent. -If there are independent `GroupedDataFrame` objects constructed -using the same parent data frame they might get corrupt. - -# See also +function select!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true) + if f isa Colon + throw(ArgumentError("First argument must be a transformation if the second argument is a grouped data frame")) + end + return select!(gd, f, ungroup=ungroup) +end -[`groupby`](@ref), [`combine`](@ref), [`select`](@ref), [`transform`](@ref), [`transform!`](@ref) -""" function select!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true) newdf = select(gd, args..., copycols=false, renamecols=renamecols) @@ -1766,18 +686,13 @@ function select!(gd::GroupedDataFrame{DataFrame}, args...; return ungroup ? df : gd end -""" - transform!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true) - -An equivalent of -`transform(gd, args..., copycols=false, keepkeys=true, ungroup=ungroup, renamecols=renamecols)` -but updates `parent(gd)` in place -and keeps the columns of `parent(gd)` in their original order. - -# See also +function transform!(f::Base.Callable, gd::GroupedDataFrame; ungroup::Bool=true, renamecols::Bool=true) + if f isa Colon + throw(ArgumentError("First argument must be a transformation if the second argument is a grouped data frame")) + end + return transform!(gd, f, ungroup=ungroup) +end -[`groupby`](@ref), [`combine`](@ref), [`select`](@ref), [`select!`](@ref), [`transform`](@ref) -""" function transform!(gd::GroupedDataFrame{DataFrame}, args...; ungroup::Bool=true, renamecols::Bool=true) newdf = select(gd, :, args..., copycols=false, renamecols=renamecols) diff --git a/test/grouping.jl b/test/grouping.jl index 12d71e3068..ef7de6fb7a 100644 --- a/test/grouping.jl +++ b/test/grouping.jl @@ -492,6 +492,12 @@ end @test isempty(gd2.starts) @test isempty(gd2.ends) @test isequal_typed(parent(gd2), DataFrame(A=Int[], X=Int[])) + + @test_throws ArgumentError combine(:x => identity, groupby_checked(DataFrame(x=[1,2,3]), :x)) + @test_throws ArgumentError select(groupby_checked(DataFrame(x=[1,2,3], y=1), :x), [] => identity) + @test_throws ArgumentError select(groupby_checked(DataFrame(x=[1,2,3], y=1), :x), [:x, :y] => identity) + @test_throws ArgumentError select(groupby_checked(DataFrame(x=[1,2,3], y=1), :x), [] => identity => :z) + @test_throws ArgumentError select(groupby_checked(DataFrame(x=[1,2,3], y=1), :x), [:x, :y] => identity => :z) end @testset "grouping with missings" begin @@ -770,61 +776,66 @@ end # Only test that different combine syntaxes work, # and rely on tests below for deeper checks @test combine(gd, :c => sum) == - combine(:c => sum, gd) == combine(gd, :c => sum => :c_sum) == - combine(:c => sum => :c_sum, gd) == combine(gd, [:c => sum]) == combine(gd, [:c => sum => :c_sum]) == - combine(d -> (c_sum=sum(d.c),), gd) - @test_throws MethodError combine(gd, d -> (c_sum=sum(d.c),)) + combine(d -> (c_sum=sum(d.c),), gd) == + combine(gd, d -> (c_sum=sum(d.c),)) == + combine(gd, d -> (c_sum=[sum(d.c)],)) == + combine(gd, d -> DataFrame(c_sum=sum(d.c))) == + combine(gd, :c => (x -> [sum(x)]) => [:c_sum]) == + combine(gd, :c => (x -> [(c_sum=sum(x),)]) => AsTable) == + combine(gd, :c => (x -> fill(sum(x),1,1)) => [:c_sum]) == + combine(gd, :c => (x -> [Dict(:c_sum => sum(x))]) => AsTable) + @test_throws ArgumentError combine(:c => sum, gd) + @test_throws ArgumentError combine(:, gd) @test combine(gd, :c => vexp) == - combine(:c => vexp, gd) == combine(gd, :c => vexp => :c_function) == - combine(:c => vexp => :c_function, gd) == - combine(:c => c -> (c_function = vexp(c),), gd) == combine(gd, [:c => vexp]) == combine(gd, [:c => vexp => :c_function]) == - combine(d -> (c_function=exp.(d.c),), gd) + combine(d -> (c_function=exp.(d.c),), gd) == + combine(gd, d -> (c_function=exp.(d.c),)) == + combine(gd, :c => (x -> (c_function=exp.(x),)) => AsTable) == + combine(gd, :c => ByRow(exp) => :c_function) == + combine(gd, :c => ByRow(x -> [exp(x)]) => [:c_function]) @test_throws ArgumentError combine(gd, :c => c -> (c_function = vexp(c),)) - @test_throws MethodError combine(gd, d -> (c_function=exp.(d.c),)) @test combine(gd, :b => sum, :c => sum) == combine(gd, :b => sum => :b_sum, :c => sum => :c_sum) == combine(gd, [:b => sum, :c => sum]) == combine(gd, [:b => sum => :b_sum, :c => sum => :c_sum]) == - combine(d -> (b_sum=sum(d.b), c_sum=sum(d.c)), gd) - @test_throws MethodError combine(gd, d -> (b_sum=sum(d.b), c_sum=sum(d.c))) + combine(d -> (b_sum=sum(d.b), c_sum=sum(d.c)), gd) == + combine(gd, d -> (b_sum=sum(d.b), c_sum=sum(d.c))) == + combine(gd, d -> (b_sum=sum(d.b),), d -> (c_sum=sum(d.c),)) @test combine(gd, :b => vexp, :c => identity) == combine(gd, :b => vexp => :b_function, :c => identity => :c_identity) == combine(gd, [:b => vexp, :c => identity]) == combine(gd, [:b => vexp => :b_function, :c => identity => :c_identity]) == combine(d -> (b_function=vexp(d.b), c_identity=d.c), gd) == - combine([:b, :c] => (b, c) -> (b_function=vexp(b), c_identity=c), gd) - @test_throws MethodError combine(gd, d -> (b_function=vexp(d.b), c_identity=d.c)) + combine(gd, [:b, :c] => ((b, c) -> (b_function=vexp(b), c_identity=c)) => AsTable) == + combine(gd, d -> (b_function=vexp(d.b), c_identity=d.c)) @test_throws ArgumentError combine(gd, [:b, :c] => (b, c) -> (b_function=vexp(b), c_identity=c)) - @test combine(x -> extrema(x.c), gd) == combine(:c => (x -> extrema(x)) => :x1, gd) - @test combine(x -> x.b+x.c, gd) == combine([:b,:c] => (+) => :x1, gd) - @test combine(x -> (p=x.b, q=x.c), gd) == - combine([:b,:c] => (b,c) -> (p=b,q=c), gd) - @test_throws MethodError combine(gd, x -> (p=x.b, q=x.c)) + @test combine(x -> extrema(x.c), gd) == combine(gd, :c => (x -> extrema(x)) => :x1) + @test combine(x -> hcat(extrema(x.c)...), gd) == combine(gd, :c => (x -> [extrema(x)]) => AsTable) + @test combine(x -> x.b+x.c, gd) == combine(gd, [:b,:c] => (+) => :x1) + @test combine(x -> (p=x.b, q=x.c), gd) == combine(gd, [:b,:c] => ((b,c) -> (p=b,q=c)) => AsTable) @test_throws ArgumentError combine(gd, [:b,:c] => (b,c) -> (p=b,q=c)) @test combine(x -> DataFrame(p=x.b, q=x.c), gd) == - combine([:b,:c] => (b,c) -> DataFrame(p=b,q=c), gd) - @test_throws MethodError combine(gd, x -> DataFrame(p=x.b, q=x.c)) + combine(gd, [:b,:c] => ((b,c) -> DataFrame(p=b,q=c)) => AsTable) == + combine(gd, x -> DataFrame(p=x.b, q=x.c)) @test_throws ArgumentError combine(gd, [:b,:c] => (b,c) -> DataFrame(p=b,q=c)) @test combine(x -> [1 2; 3 4], gd) == - combine([:b,:c] => (b,c) -> [1 2; 3 4], gd) - @test_throws MethodError combine(gd, x -> [1 2; 3 4]) + combine(gd, [:b,:c] => ((b,c) -> [1 2; 3 4]) => AsTable) @test_throws ArgumentError combine(gd, [:b,:c] => (b,c) -> [1 2; 3 4]) @test combine(nrow, gd) == combine(gd, nrow) == combine(gd, [nrow => :nrow]) == combine(gd, 1 => length => :nrow) - @test combine(nrow => :res, gd) == combine(gd, nrow => :res) == + @test combine(gd, nrow => :res) == combine(gd, [nrow => :res]) == combine(gd, 1 => length => :res) @test combine(gd, nrow => :res, nrow, [nrow => :res2]) == combine(gd, 1 => length => :res, 1 => length => :nrow, 1 => length => :res2) @@ -834,64 +845,54 @@ end @test_throws ArgumentError combine(gd, [nrow]) for col in (:c, 3) - @test combine(col => sum, gd) == combine(d -> (c_sum=sum(d.c),), gd) - @test combine(col => x -> sum(x), gd) == combine(d -> (c_function=sum(d.c),), gd) - @test combine(col => x -> (z=sum(x),), gd) == combine(d -> (z=sum(d.c),), gd) - @test combine(col => x -> DataFrame(z=sum(x),), gd) == combine(d -> (z=sum(d.c),), gd) - @test combine(col => identity, gd) == combine(d -> (c_identity=d.c,), gd) - @test combine(col => x -> (z=x,), gd) == combine(d -> (z=d.c,), gd) - - @test combine(col => sum => :xyz, gd) == - combine(d -> (xyz=sum(d.c),), gd) - @test combine(col => (x -> sum(x)) => :xyz, gd) == - combine(d -> (xyz=sum(d.c),), gd) - @test combine(col => (x -> (sum(x),)) => :xyz, gd) == - combine(d -> (xyz=(sum(d.c),),), gd) + @test combine(gd, col => sum) == combine(d -> (c_sum=sum(d.c),), gd) + @test combine(gd, col => x -> sum(x)) == combine(d -> (c_function=sum(d.c),), gd) + @test combine(gd, col => (x -> (z=sum(x),)) => AsTable) == combine(d -> (z=sum(d.c),), gd) + @test combine(gd, col => (x -> DataFrame(z=sum(x),)) => AsTable) == combine(d -> (z=sum(d.c),), gd) + @test combine(gd, col => identity) == combine(d -> (c_identity=d.c,), gd) + @test combine(gd, col => (x -> (z=x,)) => AsTable) == combine(d -> (z=d.c,), gd) + + @test combine(gd, col => sum => :xyz) == combine(d -> (xyz=sum(d.c),), gd) + @test combine(gd, col => (x -> sum(x)) => :xyz) == combine(d -> (xyz=sum(d.c),), gd) + @test combine(gd, col => (x -> (sum(x),)) => :xyz) == combine(d -> (xyz=(sum(d.c),),), gd) @test combine(nrow, gd) == combine(d -> (nrow=length(d.c),), gd) - @test combine(nrow => :res, gd) == combine(d -> (res=length(d.c),), gd) - @test combine(col => sum => :res, gd) == combine(d -> (res=sum(d.c),), gd) - @test combine(col => (x -> sum(x)) => :res, gd) == combine(d -> (res=sum(d.c),), gd) - @test_throws ArgumentError combine(col => (x -> (z=sum(x),)) => :xyz, gd) - @test_throws ArgumentError combine(col => (x -> DataFrame(z=sum(x),)) => :xyz, gd) - @test_throws ArgumentError combine(col => (x -> (z=x,)) => :xyz, gd) - @test_throws ArgumentError combine(col => x -> (z=1, xzz=[1]), gd) + @test combine(gd, nrow => :res) == combine(d -> (res=length(d.c),), gd) + @test combine(gd, col => sum => :res) == combine(d -> (res=sum(d.c),), gd) + @test combine(gd, col => (x -> sum(x)) => :res) == combine(d -> (res=sum(d.c),), gd) + + @test_throws ArgumentError combine(gd, col => (x -> (z=sum(x),)) => :xyz) + @test_throws ArgumentError combine(gd, col => (x -> DataFrame(z=sum(x),)) => :xyz) + @test_throws ArgumentError combine(gd, col => (x -> (z=x,)) => :xyz) + @test_throws ArgumentError combine(gd, col => x -> (z=1, xzz=[1])) end + for cols in ([:b, :c], 2:3, [2, 3], [false, true, true]), ungroup in (true, false) - @test combine(cols => (b,c) -> (y=exp.(b), z=c), gd, ungroup=ungroup) == - combine(d -> (y=exp.(d.b), z=d.c), gd, ungroup=ungroup) - @test combine(cols => (b,c) -> [exp.(b) c], gd, ungroup=ungroup) == + @test combine(gd, cols => ((b,c) -> (y=exp.(b), z=c)) => AsTable, ungroup=ungroup) == + combine(gd, d -> (y=exp.(d.b), z=d.c), ungroup=ungroup) + @test combine(gd, cols => ((b,c) -> [exp.(b) c]) => AsTable, ungroup=ungroup) == combine(d -> [exp.(d.b) d.c], gd, ungroup=ungroup) - @test combine(cols => ((b,c) -> sum(b) + sum(c)) => :xyz, gd, ungroup=ungroup) == + @test combine(gd, cols => ((b,c) -> sum(b) + sum(c)) => :xyz, ungroup=ungroup) == combine(d -> (xyz=sum(d.b) + sum(d.c),), gd, ungroup=ungroup) - if eltype(cols) === Bool - cols2 = [[false, true, false], [false, false, true]] - @test_throws MethodError combine((xyz = cols[1] => sum, xzz = cols2[2] => sum), - gd, ungroup=ungroup) - @test_throws MethodError combine((xyz = cols[1] => sum, xzz = cols2[1] => sum), - gd, ungroup=ungroup) - @test_throws MethodError combine((xyz = cols[1] => sum, xzz = cols2[2] => x -> first(x)), - gd, ungroup=ungroup) - else - cols2 = cols - @test combine(gd, cols2[1] => sum => :xyz, cols2[2] => sum => :xzz, ungroup=ungroup) == + if eltype(cols) !== Bool + @test combine(gd, cols[1] => sum => :xyz, cols[2] => sum => :xzz, ungroup=ungroup) == combine(d -> (xyz=sum(d.b), xzz=sum(d.c)), gd, ungroup=ungroup) - @test combine(gd, cols2[1] => sum => :xyz, cols2[1] => sum => :xzz, ungroup=ungroup) == + @test combine(gd, cols[1] => sum => :xyz, cols[1] => sum => :xzz, ungroup=ungroup) == combine(d -> (xyz=sum(d.b), xzz=sum(d.b)), gd, ungroup=ungroup) - @test combine(gd, cols2[1] => sum => :xyz, - cols2[2] => (x -> first(x)) => :xzz, ungroup=ungroup) == + @test combine(gd, cols[1] => sum => :xyz, + cols[2] => (x -> first(x)) => :xzz, ungroup=ungroup) == combine(d -> (xyz=sum(d.b), xzz=first(d.c)), gd, ungroup=ungroup) - @test combine(gd, cols2[1] => vexp => :xyz, - cols2[2] => sum => :xzz, ungroup=ungroup) == + @test combine(gd, cols[1] => vexp => :xyz, + cols[2] => sum => :xzz, ungroup=ungroup) == combine(d -> (xyz=vexp(d.b), xzz=fill(sum(d.c), length(vexp(d.b)))), gd, ungroup=ungroup) end - @test_throws ArgumentError combine(cols => (b,c) -> (y=exp.(b), z=sum(c)), - gd, ungroup=ungroup) - @test_throws ArgumentError combine(cols2 => ((b,c) -> DataFrame(y=exp.(b), - z=sum(c))) => :xyz, gd, ungroup=ungroup) - @test_throws ArgumentError combine(cols2 => ((b,c) -> [exp.(b) c]) => :xyz, - gd, ungroup=ungroup) + @test_throws ArgumentError combine(gd, cols => (b,c) -> (y=exp.(b), z=sum(c)), + ungroup=ungroup) + @test_throws ArgumentError combine(gd, cols => ((b,c) -> DataFrame(y=exp.(b), + z=sum(c))) => :xyz, ungroup=ungroup) + @test_throws ArgumentError combine(gd, cols => ((b,c) -> [exp.(b) c]) => :xyz, + ungroup=ungroup) end end @@ -1441,9 +1442,9 @@ end @test gdf[:] == gdf @test gdf[1:1] == gdf - @test validate_gdf(combine(nrow => :x1, gdf, ungroup=false)) == + @test validate_gdf(combine(gdf, nrow => :x1, ungroup=false)) == groupby_checked(DataFrame(x1=3), []) - @test validate_gdf(combine(:x2 => identity => :x2_identity, gdf, ungroup=false)) == + @test validate_gdf(combine(gdf, :x2 => identity => :x2_identity, ungroup=false)) == groupby_checked(DataFrame(x2_identity=[1,1,2]), []) @test isequal_typed(DataFrame(gdf), df) @@ -1838,9 +1839,9 @@ end @test res == DataFrame(validate_gdf(combine(sdf -> sdf.x1[1] ? fr : er, groupby_checked(df, :a), ungroup=false))) if fr isa AbstractVector && df.x1[1] - @test res == combine(:x1 => (x1 -> x1[1] ? fr : er) => :x1, gdf) + @test res == combine(gdf, :x1 => (x1 -> x1[1] ? fr : er) => :x1) else - @test res == combine(:x1 => x1 -> x1[1] ? fr : er, gdf) + @test res == combine(gdf, :x1 => (x1 -> x1[1] ? fr : er) => AsTable) end if nrow(res) == 0 && length(propertynames(er)) == 0 && er != rand(0, 1) @test res == DataFrame(a=[]) @@ -1867,9 +1868,8 @@ end @test combine(gdf, r"x" => cor) == DataFrame(g=[1,2], x1_x2_cor = [1.0, 1.0]) @test combine(gdf, Not(:g) => ByRow(/)) == DataFrame(:g => [1,1,1,2,2,2], Symbol("x1_x2_/") => 1.0) @test combine(gdf, Between(:x2, :x1) => () -> 1) == DataFrame(:g => 1:2, Symbol("function") => 1) - @test combine(gdf, :x1 => :z) == combine(gdf, [:x1 => :z]) == combine(:x1 => :z, gdf) == - DataFrame(g=[1,1,1,2,2,2], z=1:6) - @test validate_gdf(combine(:x1 => :z, groupby_checked(df, :g), ungroup=false)) == + @test combine(gdf, :x1 => :z) == combine(gdf, [:x1 => :z]) == DataFrame(g=[1,1,1,2,2,2], z=1:6) + @test validate_gdf(combine(groupby_checked(df, :g), :x1 => :z, ungroup=false)) == groupby_checked(DataFrame(g=[1,1,1,2,2,2], z=1:6), :g) end @@ -1879,10 +1879,10 @@ end gdf = groupby_checked(df, :b) res = combine(sdf -> sdf.x[1:2], gdf) @test names(res) == ["b", "x1"] - res2 = combine(:x => x -> x[1:2], gdf) + res2 = combine(gdf, :x => x -> x[1:2]) @test names(res2) == ["b", "x_function"] @test Matrix(res) == Matrix(res2) - res2 = combine(:x => (x -> x[1:2]) => :z, gdf) + res2 = combine(gdf, :x => (x -> x[1:2]) => :z) @test names(res2) == ["b", "z"] @test Matrix(res) == Matrix(res2) @@ -1916,8 +1916,8 @@ end end for i in 1:2, v1 in [1, 1:2], v2 in [1, 1:2] - @test_throws ArgumentError combine([:b, :x] => ((b,x) -> b[1] == i ? x[v1] : (c=x[v2],)) => :v, gdf) - @test_throws ArgumentError combine([:b, :x] => ((b,x) -> b[1] == i ? x[v1] : (v=x[v2],)) => :v, gdf) + @test_throws ArgumentError combine(gdf, [:b, :x] => ((b,x) -> b[1] == i ? x[v1] : (c=x[v2],)) => :v) + @test_throws ArgumentError combine(gdf, [:b, :x] => ((b,x) -> b[1] == i ? x[v1] : (v=x[v2],)) => :v) end end @@ -1927,8 +1927,8 @@ end @test_throws ArgumentError combine(gdf, :x1 => x -> DataFrame()) @test_throws ArgumentError combine(gdf, :x1 => x -> (x=1, y=2)) @test_throws ArgumentError combine(gdf, :x1 => x -> (x=[1], y=[2])) - @test_throws ArgumentError combine(gdf, :x1 => x -> (x=[1],y=2)) - @test_throws ArgumentError combine(:x1 => x -> (x=[1], y=2), gdf) + @test_throws ArgumentError combine(gdf, :x1 => (x -> (x=[1],y=2)) => AsTable) + @test_throws ArgumentError combine(gdf, :x1 => x -> (x=[1], y=2)) @test_throws ArgumentError combine(gdf, :x1 => x -> ones(2, 2)) @test_throws ArgumentError combine(gdf, :x1 => x -> df[1, Not(:g)]) end @@ -2070,9 +2070,9 @@ end # whole column 4 options of single pair passed @test combine(gdf , AsTable([:x, :y]) => Ref) == - combine(AsTable([:x, :y]) => Ref, gdf) == + combine(gdf, AsTable([:x, :y]) => Ref) == DataFrame(g=1:2, x_y_Ref=[(x=[1,2,3], y=[6,7,8]), (x=[4,5], y=[9,10])]) - @test validate_gdf(combine(AsTable([:x, :y]) => Ref, gdf, ungroup=false)) == + @test validate_gdf(combine(gdf, AsTable([:x, :y]) => Ref, ungroup=false)) == groupby_checked(combine(gdf, AsTable([:x, :y]) => Ref), :g) @test combine(gdf, AsTable(1) => Ref) == @@ -2081,10 +2081,10 @@ end # ByRow 4 options of single pair passed @test combine(gdf, AsTable([:x, :y]) => ByRow(x -> [x])) == - combine(AsTable([:x, :y]) => ByRow(x -> [x]), gdf) == + combine(gdf, AsTable([:x, :y]) => ByRow(x -> [x])) == DataFrame(g=[1,1,1,2,2], x_y_function=[[(x=1,y=6)], [(x=2,y=7)], [(x=3,y=8)], [(x=4,y=9)], [(x=5,y=10)]]) - @test validate_gdf(combine(AsTable([:x, :y]) => ByRow(x -> [x]), gdf, ungroup=false)) == + @test validate_gdf(combine(gdf, AsTable([:x, :y]) => ByRow(x -> [x]), ungroup=false)) == groupby_checked(combine(gdf, AsTable([:x, :y]) => ByRow(x -> [x])), :g) # whole column and ByRow test for multiple pairs passed @@ -2967,7 +2967,7 @@ end DataFrame(a=1:3, b=4:6, c=7:9, d=10:12, a_b=5:2:9, a_b_etc=22:4:30) @test combine(gdf, :a => +, [:a, :b] => +, All() => +, renamecols=false) == DataFrame(a=1:3, a_b=5:2:9, a_b_etc=22:4:30) - @test combine([:a, :b] => +, gdf, renamecols=false) == DataFrame(a=1:3, a_b=5:2:9) + @test combine(gdf, [:a, :b] => +, renamecols=false) == DataFrame(a=1:3, a_b=5:2:9) @test combine(identity, gdf, renamecols=false) == df df = DataFrame(a=1:3, b=4:6, c=7:9, d=10:12) @@ -3022,4 +3022,154 @@ end @test_throws MethodError select(gdf, AsTable([]) => ByRow(inc0) => :bin) end +@testset "aggregation of reordered groups" begin + df = DataFrame(id=[1, 2, 3, 1, 3, 2], x=1:6) + gdf = groupby(df, :id) + @test select(df, :id, :x => x -> 2x) == select(gdf, :x => x -> 2x) + @test select(df, identity) == select(gdf, identity) + @test select(df, :id, x -> (a=x.x, b=x.x)) == select(gdf, x -> (a=x.x, b=x.x)) + @test transform(df, :x => x -> 2x) == transform(gdf, :x => x -> 2x) + @test transform(df, identity) == transform(gdf, identity) + @test transform(df, x -> (a=x.x, b=x.x)) == transform(gdf, x -> (a=x.x, b=x.x)) + @test combine(gdf, :x => x -> 2x) == + DataFrame(id=[1, 1, 2, 2, 3, 3], x_function=[2, 8, 4, 12, 6, 10]) + @test combine(gdf, identity) == DataFrame(gdf) + @test combine(gdf, x -> (a=x.x, b=x.x)) == + DataFrame(id=[1, 1, 2, 2, 3, 3], a=[1, 4, 2, 6, 3, 5], b=[1, 4, 2, 6, 3, 5]) + gdf = groupby(df, :id)[[3, 1, 2]] + @test select(df, :id, :x => x -> 2x) == select(gdf, :x => x -> 2x) + @test select(df, identity) == select(gdf, identity) + @test select(df, :id, x -> (a=x.x, b=x.x)) == select(gdf, x -> (a=x.x, b=x.x)) + @test transform(df, :x => x -> 2x) == transform(gdf, :x => x -> 2x) + @test transform(df, identity) == transform(gdf, identity) + @test transform(df, x -> (a=x.x, b=x.x)) == transform(gdf, x -> (a=x.x, b=x.x)) + @test combine(gdf, :x => x -> 2x) == + DataFrame(id=[3, 3, 1, 1, 2, 2], x_function=[6, 10, 2, 8, 4, 12]) + @test combine(gdf, identity) == df[[3, 5, 1, 4, 2, 6], :] + @test combine(gdf, x -> (a=x.x, b=x.x)) == + DataFrame(id=[3, 3, 1, 1, 2, 2], a=[3, 5, 1, 4, 2, 6], b=[3, 5, 1, 4, 2, 6]) + + df = DataFrame(id = [3, 2, 1, 3, 1, 2], x=1:6) + gdf = groupby(df, :id, sort=true) + @test select(df, :id, :x => x -> 2x) == select(gdf, :x => x -> 2x) + @test select(df, identity) == select(gdf, identity) + @test select(df, :id, x -> (a=x.x, b=x.x)) == select(gdf, x -> (a=x.x, b=x.x)) + @test transform(df, :x => x -> 2x) == transform(gdf, :x => x -> 2x) + @test transform(df, identity) == transform(gdf, identity) + @test transform(df, x -> (a=x.x, b=x.x)) == transform(gdf, x -> (a=x.x, b=x.x)) + @test combine(gdf, :x => x -> 2x) == + DataFrame(id=[1, 1, 2, 2, 3, 3], x_function=[6, 10, 4, 12, 2, 8]) + @test combine(gdf, identity) == DataFrame(id=[1, 1, 2, 2, 3, 3], x=[3, 5, 2, 6, 1, 4]) + @test combine(gdf, x -> (a=x.x, b=x.x)) == + DataFrame(id=[1, 1, 2, 2, 3, 3], a=[3, 5, 2, 6, 1, 4], b=[3, 5, 2, 6, 1, 4]) + + gdf = groupby(df, :id)[[3, 1, 2]] + @test select(df, :id, :x => x -> 2x) == select(gdf, :x => x -> 2x) + @test select(df, identity) == select(gdf, identity) + @test select(df, :id, x -> (a=x.x, b=x.x)) == select(gdf, x -> (a=x.x, b=x.x)) + @test transform(df, :x => x -> 2x) == transform(gdf, :x => x -> 2x) + @test transform(df, identity) == transform(gdf, identity) + @test transform(df, x -> (a=x.x, b=x.x)) == transform(gdf, x -> (a=x.x, b=x.x)) + @test combine(gdf, :x => x -> 2x) == + DataFrame(id=[1, 1, 3, 3, 2, 2], x_function=[6, 10, 2, 8, 4, 12]) + @test combine(gdf, identity) == DataFrame(id=[1, 1, 3, 3, 2, 2], x=[3, 5, 1, 4, 2, 6]) + @test combine(gdf, x -> (a=x.x, b=x.x)) == + DataFrame(id=[1, 1, 3, 3, 2, 2], a=[3, 5, 1, 4, 2, 6], b=[3, 5, 1, 4, 2, 6]) +end + +@testset "basic tests of advanced rules with multicolumn output" begin + df = DataFrame(id=[1, 2, 3, 1, 3, 2], x=1:6) + gdf = groupby(df, :id) + + @test combine(gdf, x -> reshape(1:4, 2, 2)) == + DataFrame(id=[1,1,2,2,3,3], x1=[1,2,1,2,1,2], x2=[3,4,3,4,3,4]) + @test combine(gdf, x -> DataFrame(a=1:2, b=3:4)) == + DataFrame(id=[1,1,2,2,3,3], a=[1,2,1,2,1,2], b=[3,4,3,4,3,4]) + @test combine(gdf, x -> DataFrame(a=1:2, b=3:4)[1, :]) == + DataFrame(id=[1,2,3], a=[1,1,1], b=[3,3,3]) + @test combine(gdf, x -> (a=1, b=3)) == + DataFrame(id=[1,2,3], a=[1,1,1], b=[3,3,3]) + @test combine(gdf, x -> (a=1:2, b=3:4)) == + DataFrame(id=[1,1,2,2,3,3], a=[1,2,1,2,1,2], b=[3,4,3,4,3,4]) + @test combine(gdf, :x => (x -> Dict(:a => 1:2, :b => 3:4)) => AsTable) == + DataFrame(id=[1,1,2,2,3,3], a=[1,2,1,2,1,2], b=[3,4,3,4,3,4]) + @test combine(gdf, :x => ByRow(x -> [x,x+1,x+2]) => AsTable) == + DataFrame(id=[1,1,2,2,3,3], x1=[1,4,2,6,3,5], x2=[2,5,3,7,4,6], x3=[3,6,4,8,5,7]) + @test combine(gdf, :x => ByRow(x -> (x,x+1,x+2)) => AsTable) == + DataFrame(id=[1,1,2,2,3,3], x1=[1,4,2,6,3,5], x2=[2,5,3,7,4,6], x3=[3,6,4,8,5,7]) + @test combine(gdf, :x => ByRow(x -> (a=x,b=x+1,c=x+2)) => AsTable) == + DataFrame(id=[1,1,2,2,3,3], a=[1,4,2,6,3,5], b=[2,5,3,7,4,6], c=[3,6,4,8,5,7]) + @test combine(gdf, :x => ByRow(x -> [x,x+1,x+2]) => [:p, :q, :r]) == + DataFrame(id=[1,1,2,2,3,3], p=[1,4,2,6,3,5], q=[2,5,3,7,4,6], r=[3,6,4,8,5,7]) + @test combine(gdf, :x => ByRow(x -> (x,x+1,x+2)) => [:p, :q, :r]) == + DataFrame(id=[1,1,2,2,3,3], p=[1,4,2,6,3,5], q=[2,5,3,7,4,6], r=[3,6,4,8,5,7]) + @test combine(gdf, :x => ByRow(x -> (a=x,b=x+1,c=x+2)) => [:p, :q, :r]) == + DataFrame(id=[1,1,2,2,3,3], p=[1,4,2,6,3,5], q=[2,5,3,7,4,6], r=[3,6,4,8,5,7]) + @test combine(gdf, :x => ByRow(x -> 1) => [:p]) == DataFrame(id=[1,1,2,2,3,3], p=1) + @test_throws ArgumentError combine(gdf, :x => (x -> 1) => [:p]) + + @test select(gdf, x -> reshape(1:4, 2, 2)) == + DataFrame(id=[1,2,3,1,3,2], x1=[1,1,1,2,2,2], x2=[3,3,3,4,4,4]) + @test select(gdf, x -> DataFrame(a=1:2, b=3:4)) == + DataFrame(id=[1,2,3,1,3,2], a=[1,1,1,2,2,2], b=[3,3,3,4,4,4]) + @test select(gdf, x -> DataFrame(a=1:2, b=3:4)[1, :]) == + DataFrame(id=[1,2,3,1,3,2], a=[1,1,1,1,1,1], b=[3,3,3,3,3,3]) + @test select(gdf, x -> (a=1, b=3)) == + DataFrame(id=[1,2,3,1,3,2], a=[1,1,1,1,1,1], b=[3,3,3,3,3,3]) + @test select(gdf, x -> (a=1:2, b=3:4)) == + DataFrame(id=[1,2,3,1,3,2], a=[1,1,1,2,2,2], b=[3,3,3,4,4,4]) + @test select(gdf, :x => (x -> Dict(:a => 1:2, :b => 3:4)) => AsTable) == + DataFrame(id=[1,2,3,1,3,2], a=[1,1,1,2,2,2], b=[3,3,3,4,4,4]) + @test select(gdf, :x => ByRow(x -> [x,x+1,x+2]) => AsTable) == + DataFrame(id=[1,2,3,1,3,2], x1=[1,2,3,4,5,6], x2=[2,3,4,5,6,7], x3=[3,4,5,6,7,8]) + @test select(gdf, :x => ByRow(x -> (x,x+1,x+2)) => AsTable) == + DataFrame(id=[1,2,3,1,3,2], x1=[1,2,3,4,5,6], x2=[2,3,4,5,6,7], x3=[3,4,5,6,7,8]) + @test select(gdf, :x => ByRow(x -> (a=x,b=x+1,c=x+2)) => AsTable) == + DataFrame(id=[1,2,3,1,3,2], a=[1,2,3,4,5,6], b=[2,3,4,5,6,7], c=[3,4,5,6,7,8]) + @test select(gdf, :x => ByRow(x -> [x,x+1,x+2]) => [:p, :q, :r]) == + DataFrame(id=[1,2,3,1,3,2], p=[1,2,3,4,5,6], q=[2,3,4,5,6,7], r=[3,4,5,6,7,8]) + @test select(gdf, :x => ByRow(x -> (x,x+1,x+2)) => [:p, :q, :r]) == + DataFrame(id=[1,2,3,1,3,2], p=[1,2,3,4,5,6], q=[2,3,4,5,6,7], r=[3,4,5,6,7,8]) + @test select(gdf, :x => ByRow(x -> (a=x,b=x+1,c=x+2)) => [:p, :q, :r]) == + DataFrame(id=[1,2,3,1,3,2], p=[1,2,3,4,5,6], q=[2,3,4,5,6,7], r=[3,4,5,6,7,8]) + @test select(gdf, :x => ByRow(x -> 1) => [:p]) == DataFrame(id=[1,2,3,1,3,2], p=1) + @test_throws ArgumentError select(gdf, :x => (x -> 1) => [:p]) +end + +@testset "tests of invariants of transformation functions" begin + Random.seed!(1234) + df = DataFrame(x=rand(1000), id=rand(1:20, 1000), y=rand(1000), z=rand(1000)) + gdf = groupby_checked(df, :id) + gdf2 = gdf[20:-1:1] + @test transform(df, x -> sum(df.x), x -> (p=2x.x, q=2x.y), :id => :id2, :z => :x, + [:x, :y, :z] => +, [:y, :z] => ByRow(minmax) => [:min, :max], :y) == + transform(gdf, x -> sum(parent(x).x), x -> (p=2x.x, q=2x.y), :id => :id2, :z => :x, + [:x, :y, :z] => +, [:y, :z] => ByRow(minmax) => [:min, :max], :y) == + transform(gdf2, x -> sum(parent(x).x), x -> (p=2x.x, q=2x.y), :id => :id2, :z => :x, + [:x, :y, :z] => +, [:y, :z] => ByRow(minmax) => [:min, :max], :y) == + DataFrame(:x => df.z, :id => df.id, :y => df.y, :z => df.z, :x1 => sum(df.x), + :p => 2df.x, :q => 2df.y, :id2 => df.id, Symbol("x_y_z_+") => df.x+df.y+df.z, + :min => min.(df.y, df.z), :max => max.(df.y, df.z)) + + @test select(df, x -> sum(df.x), x -> (p=2x.x, q=2x.y), :id => :id2, :z => :x, + [:x, :y, :z] => +, [:y, :z] => ByRow(minmax) => [:min, :max], :y) == + select(gdf, x -> sum(parent(x).x), x -> (p=2x.x, q=2x.y), :id => :id2, :z => :x, + [:x, :y, :z] => +, [:y, :z] => ByRow(minmax) => [:min, :max], :y, keepkeys=false) == + select(gdf2, x -> sum(parent(x).x), x -> (p=2x.x, q=2x.y), :id => :id2, :z => :x, + [:x, :y, :z] => +, [:y, :z] => ByRow(minmax) => [:min, :max], :y, keepkeys=false) == + DataFrame(:x1 => sum(df.x), :p => 2df.x, :q => 2df.y, :id2 => df.id, + :x => df.z, Symbol("x_y_z_+") => df.x+df.y+df.z, + :min => min.(df.y, df.z), :max => max.(df.y, df.z), :y => df.y) + + @test combine(df, x -> sum(df.x), x -> (p=2x.x, q=2x.y), :id => :id2, :z => :x, + [:x, :y, :z] => +, [:y, :z] => ByRow(minmax) => [:min, :max], :y) |> sort == + combine(gdf, x -> sum(parent(x).x), x -> (p=2x.x, q=2x.y), :id => :id2, :z => :x, + [:x, :y, :z] => +, [:y, :z] => ByRow(minmax) => [:min, :max], :y, keepkeys=false) |> sort == + combine(gdf2, x -> sum(parent(x).x), x -> (p=2x.x, q=2x.y), :id => :id2, :z => :x, + [:x, :y, :z] => +, [:y, :z] => ByRow(minmax) => [:min, :max], :y, keepkeys=false) |> sort == + DataFrame(:x1 => sum(df.x), :p => 2df.x, :q => 2df.y, :id2 => df.id, + :x => df.z, Symbol("x_y_z_+") => df.x+df.y+df.z, + :min => min.(df.y, df.z), :max => max.(df.y, df.z), :y => df.y) |> sort +end + end # module diff --git a/test/select.jl b/test/select.jl index fa9b4143e6..fe612794de 100644 --- a/test/select.jl +++ b/test/select.jl @@ -1342,178 +1342,178 @@ end @test df == DataFrame(a=1:3, b=4:6, c=7:9, d=10:12, a_b=5:2:9, a_b_etc=22:4:30) end -@testset "additional tests for new rules" begin - @testset "transformation function with a function as first argument" begin - for df in (DataFrame(a=1:2, b=3:4, c=5:6), view(DataFrame(a=1:3, b=3:5, c=5:7, d=11:13), 1:2, 1:3)) - @test select(sdf -> sdf.b, df) == DataFrame(x1=3:4) - @test select(sdf -> (b = 2sdf.b,), df) == DataFrame(b=[6,8]) - @test select(sdf -> (b = 1,), df) == DataFrame(b=[1, 1]) - @test_throws ArgumentError select(sdf -> (b = [1],), df) - @test select(sdf -> (b = [1, 5],), df) == DataFrame(b=[1, 5]) - @test select(sdf -> 1, df) == DataFrame(x1=[1, 1]) - @test select(sdf -> fill([1]), df) == DataFrame(x1=[[1], [1]]) - @test select(sdf -> Ref([1]), df) == DataFrame(x1=[[1], [1]]) - @test select(sdf -> "x", df) == DataFrame(x1=["x", "x"]) - @test select(sdf -> [[1,2],[3,4]], df) == DataFrame(x1=[[1,2],[3,4]]) - for ret in (DataFrame(), NamedTuple(), zeros(0,0), DataFrame(t=1)[1, 1:0]) - @test select(sdf -> ret, df) == DataFrame() - end - @test_throws ArgumentError select(sdf -> DataFrame(a=10), df) - @test_throws ArgumentError select(sdf -> zeros(1, 2), df) - @test select(sdf -> DataFrame(a=[10, 11]), df) == DataFrame(a=[10, 11]) - @test select(sdf -> [10 11; 12 13], df) == DataFrame(x1=[10, 12], x2=[11, 13]) - @test select(sdf -> DataFrame(a=10)[1, :], df) == DataFrame(a=[10, 10]) - - @test transform(sdf -> sdf.b, df) == [df DataFrame(x1=3:4)] - @test transform(sdf -> (b = 2sdf.b,), df) == DataFrame(a=1:2, b=[6,8], c=5:6) - @test transform(sdf -> (b = 1,), df) == DataFrame(a=[1,2], b=[1, 1], c=[5,6]) - @test_throws ArgumentError transform(sdf -> (b = [1],), df) - @test transform(sdf -> (b = [1, 5],), df) == DataFrame(a=[1,2], b=[1, 5], c=[5,6]) - @test transform(sdf -> 1, df) == DataFrame(a=1:2, b=3:4, c=5:6, x1=1) - @test transform(sdf -> fill([1]), df) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[[1],[1]]) - @test transform(sdf -> Ref([1]), df) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[[1],[1]]) - @test transform(sdf -> "x", df) == DataFrame(a=1:2, b=3:4, c=5:6, x1="x") - @test transform(sdf -> [[1,2],[3,4]], df) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[[1,2],[3,4]]) - for ret in (DataFrame(), NamedTuple(), zeros(0,0), DataFrame(t=1)[1, 1:0]) - @test transform(sdf -> ret, df) == df - end - @test_throws ArgumentError transform(sdf -> DataFrame(a=10), df) - @test_throws ArgumentError transform(sdf -> zeros(1, 2), df) - @test transform(sdf -> DataFrame(a=[10, 11]), df) == DataFrame(a=[10, 11], b=3:4, c=5:6) - @test transform(sdf -> [10 11; 12 13], df) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[10, 12], x2=[11, 13]) - @test transform(sdf -> DataFrame(a=10)[1, :], df) == DataFrame(a=[10, 10], b=3:4, c=5:6) - - @test combine(sdf -> sdf.b, df) == DataFrame(x1=3:4) - @test combine(sdf -> (b = 2sdf.b,), df) == DataFrame(b=[6,8]) - @test combine(sdf -> (b = 1,), df) == DataFrame(b=[1]) - @test combine(sdf -> (b = [1],), df) == DataFrame(b=[1]) - @test combine(sdf -> (b = [1, 5],), df) == DataFrame(b=[1, 5]) - @test combine(sdf -> 1, df) == DataFrame(x1=[1]) - @test combine(sdf -> fill([1]), df) == DataFrame(x1=[[1]]) - @test combine(sdf -> Ref([1]), df) == DataFrame(x1=[[1]]) - @test combine(sdf -> "x", df) == DataFrame(x1=["x"]) - @test combine(sdf -> [[1,2],[3,4]], df) == DataFrame(x1=[[1,2],[3,4]]) - for ret in (DataFrame(), NamedTuple(), zeros(0,0), DataFrame(t=1)[1, 1:0]) - @test combine(sdf -> ret, df) == DataFrame() - end - @test combine(sdf -> DataFrame(a=10), df) == DataFrame(a=10) - @test combine(sdf -> zeros(1, 2), df) == DataFrame(x1=0, x2=0) - @test combine(sdf -> DataFrame(a=[10, 11]), df) == DataFrame(a=[10, 11]) - @test combine(sdf -> [10 11; 12 13], df) == DataFrame(x1=[10, 12], x2=[11, 13]) - @test combine(sdf -> DataFrame(a=10)[1, :], df) == DataFrame(a=[10]) +@testset "transformation function with a function as first argument" begin + for df in (DataFrame(a=1:2, b=3:4, c=5:6), view(DataFrame(a=1:3, b=3:5, c=5:7, d=11:13), 1:2, 1:3)) + @test select(sdf -> sdf.b, df) == DataFrame(x1=3:4) + @test select(sdf -> (b = 2sdf.b,), df) == DataFrame(b=[6,8]) + @test select(sdf -> (b = 1,), df) == DataFrame(b=[1, 1]) + @test_throws ArgumentError select(sdf -> (b = [1],), df) + @test select(sdf -> (b = [1, 5],), df) == DataFrame(b=[1, 5]) + @test select(sdf -> 1, df) == DataFrame(x1=[1, 1]) + @test select(sdf -> fill([1]), df) == DataFrame(x1=[[1], [1]]) + @test select(sdf -> Ref([1]), df) == DataFrame(x1=[[1], [1]]) + @test select(sdf -> "x", df) == DataFrame(x1=["x", "x"]) + @test select(sdf -> [[1,2],[3,4]], df) == DataFrame(x1=[[1,2],[3,4]]) + for ret in (DataFrame(), NamedTuple(), zeros(0,0), DataFrame(t=1)[1, 1:0]) + @test select(sdf -> ret, df) == DataFrame() end - - df = DataFrame(a=1:2, b=3:4, c=5:6) - @test select!(sdf -> sdf.b, copy(df)) == DataFrame(x1=3:4) - @test select!(sdf -> (b = 2sdf.b,), copy(df)) == DataFrame(b=[6,8]) - @test select!(sdf -> (b = 1,), copy(df)) == DataFrame(b=[1, 1]) - @test_throws ArgumentError select!(sdf -> (b = [1],), copy(df)) - @test select!(sdf -> (b = [1, 5],), copy(df)) == DataFrame(b=[1, 5]) - @test select!(sdf -> 1, copy(df)) == DataFrame(x1=[1, 1]) - @test select!(sdf -> fill([1]), copy(df)) == DataFrame(x1=[[1], [1]]) - @test select!(sdf -> Ref([1]), copy(df)) == DataFrame(x1=[[1], [1]]) - @test select!(sdf -> "x", copy(df)) == DataFrame(x1=["x", "x"]) - @test select!(sdf -> [[1,2],[3,4]], copy(df)) == DataFrame(x1=[[1,2],[3,4]]) + @test_throws ArgumentError select(sdf -> DataFrame(a=10), df) + @test_throws ArgumentError select(sdf -> zeros(1, 2), df) + @test select(sdf -> DataFrame(a=[10, 11]), df) == DataFrame(a=[10, 11]) + @test select(sdf -> [10 11; 12 13], df) == DataFrame(x1=[10, 12], x2=[11, 13]) + @test select(sdf -> DataFrame(a=10)[1, :], df) == DataFrame(a=[10, 10]) + + @test transform(sdf -> sdf.b, df) == [df DataFrame(x1=3:4)] + @test transform(sdf -> (b = 2sdf.b,), df) == DataFrame(a=1:2, b=[6,8], c=5:6) + @test transform(sdf -> (b = 1,), df) == DataFrame(a=[1,2], b=[1, 1], c=[5,6]) + @test_throws ArgumentError transform(sdf -> (b = [1],), df) + @test transform(sdf -> (b = [1, 5],), df) == DataFrame(a=[1,2], b=[1, 5], c=[5,6]) + @test transform(sdf -> 1, df) == DataFrame(a=1:2, b=3:4, c=5:6, x1=1) + @test transform(sdf -> fill([1]), df) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[[1],[1]]) + @test transform(sdf -> Ref([1]), df) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[[1],[1]]) + @test transform(sdf -> "x", df) == DataFrame(a=1:2, b=3:4, c=5:6, x1="x") + @test transform(sdf -> [[1,2],[3,4]], df) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[[1,2],[3,4]]) for ret in (DataFrame(), NamedTuple(), zeros(0,0), DataFrame(t=1)[1, 1:0]) - @test select!(sdf -> ret, copy(df)) == DataFrame() + @test transform(sdf -> ret, df) == df end - @test_throws ArgumentError select!(sdf -> DataFrame(a=10), copy(df)) - @test_throws ArgumentError select!(sdf -> zeros(1, 2), copy(df)) - @test select!(sdf -> DataFrame(a=[10, 11]), copy(df)) == DataFrame(a=[10, 11]) - @test select!(sdf -> [10 11; 12 13], copy(df)) == DataFrame(x1=[10, 12], x2=[11, 13]) - @test select!(sdf -> DataFrame(a=10)[1, :], copy(df)) == DataFrame(a=[10, 10]) - - @test transform!(sdf -> sdf.b, copy(df)) == [df DataFrame(x1=3:4)] - @test transform!(sdf -> (b = 2sdf.b,), copy(df)) == DataFrame(a=1:2, b=[6,8], c=5:6) - @test transform!(sdf -> (b = 1,), copy(df)) == DataFrame(a=[1,2], b=[1, 1], c=[5,6]) - @test_throws ArgumentError transform!(sdf -> (b = [1],), copy(df)) - @test transform!(sdf -> (b = [1, 5],), copy(df)) == DataFrame(a=[1,2], b=[1, 5], c=[5,6]) - @test transform!(sdf -> 1, copy(df)) == DataFrame(a=1:2, b=3:4, c=5:6, x1=1) - @test transform!(sdf -> fill([1]), copy(df)) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[[1],[1]]) - @test transform!(sdf -> Ref([1]), copy(df)) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[[1],[1]]) - @test transform!(sdf -> "x", copy(df)) == DataFrame(a=1:2, b=3:4, c=5:6, x1="x") - @test transform!(sdf -> [[1,2],[3,4]], copy(df)) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[[1,2],[3,4]]) + @test_throws ArgumentError transform(sdf -> DataFrame(a=10), df) + @test_throws ArgumentError transform(sdf -> zeros(1, 2), df) + @test transform(sdf -> DataFrame(a=[10, 11]), df) == DataFrame(a=[10, 11], b=3:4, c=5:6) + @test transform(sdf -> [10 11; 12 13], df) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[10, 12], x2=[11, 13]) + @test transform(sdf -> DataFrame(a=10)[1, :], df) == DataFrame(a=[10, 10], b=3:4, c=5:6) + + @test combine(sdf -> sdf.b, df) == DataFrame(x1=3:4) + @test combine(sdf -> (b = 2sdf.b,), df) == DataFrame(b=[6,8]) + @test combine(sdf -> (b = 1,), df) == DataFrame(b=[1]) + @test combine(sdf -> (b = [1],), df) == DataFrame(b=[1]) + @test combine(sdf -> (b = [1, 5],), df) == DataFrame(b=[1, 5]) + @test combine(sdf -> 1, df) == DataFrame(x1=[1]) + @test combine(sdf -> fill([1]), df) == DataFrame(x1=[[1]]) + @test combine(sdf -> Ref([1]), df) == DataFrame(x1=[[1]]) + @test combine(sdf -> "x", df) == DataFrame(x1=["x"]) + @test combine(sdf -> [[1,2],[3,4]], df) == DataFrame(x1=[[1,2],[3,4]]) for ret in (DataFrame(), NamedTuple(), zeros(0,0), DataFrame(t=1)[1, 1:0]) - @test transform!(sdf -> ret, copy(df)) == df + @test combine(sdf -> ret, df) == DataFrame() end - @test_throws ArgumentError transform!(sdf -> DataFrame(a=10), copy(df)) - @test_throws ArgumentError transform!(sdf -> zeros(1, 2), copy(df)) - @test transform!(sdf -> DataFrame(a=[10, 11]), copy(df)) == DataFrame(a=[10, 11], b=3:4, c=5:6) - @test transform!(sdf -> [10 11; 12 13], copy(df)) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[10, 12], x2=[11, 13]) - @test transform!(sdf -> DataFrame(a=10)[1, :], copy(df)) == DataFrame(a=[10, 10], b=3:4, c=5:6) + @test combine(sdf -> DataFrame(a=10), df) == DataFrame(a=10) + @test combine(sdf -> zeros(1, 2), df) == DataFrame(x1=0, x2=0) + @test combine(sdf -> DataFrame(a=[10, 11]), df) == DataFrame(a=[10, 11]) + @test combine(sdf -> [10 11; 12 13], df) == DataFrame(x1=[10, 12], x2=[11, 13]) + @test combine(sdf -> DataFrame(a=10)[1, :], df) == DataFrame(a=[10]) end - @testset "transformation function with multiple columns as destination" begin - for df in (DataFrame(a=1:2, b=3:4, c=5:6), view(DataFrame(a=1:3, b=3:5, c=5:7, d=11:13), 1:2, 1:3)) - for fun in (select, combine, transform), - res in (DataFrame(), DataFrame(a=1,b=2)[1, :], ones(1,1), - (a=1,b=2), (a=[1], b=[2]), (a=1, b=[2])) - @test_throws ArgumentError fun(df, :a => x -> res) - @test_throws ArgumentError fun(df, :a => (x -> res) => :z) - end - for res in (DataFrame(x1=1, x2=2)[1, :], (x1=1,x2=2)) - @test select(df, :a => (x -> res) => AsTable) == DataFrame(x1=[1,1], x2=[2,2]) - @test transform(df, :a => (x -> res) => AsTable) == [df DataFrame(x1=[1,1], x2=[2,2])] - @test combine(df, :a => (x -> res) => AsTable) == DataFrame(x1=[1], x2=[2]) - @test select(df, :a => (x -> res) => [:p, :q]) == DataFrame(p=[1,1], q=[2,2]) - @test transform(df, :a => (x -> res) => [:p, :q]) == [df DataFrame(p=[1,1], q=[2,2])] - @test combine(df, :a => (x -> res) => [:p, :q]) == DataFrame(p=[1], q=[2]) - @test_throws ArgumentError select(df, :a => (x -> res) => [:p, :q, :r]) - @test_throws ArgumentError select(df, :a => (x -> res) => [:p]) - end - for res in (DataFrame(x1=1, x2=2), [1 2], Tables.table([1 2], header=[:x1, :x2]), - (x1=[1], x2=[2])) - @test combine(df, :a => (x -> res) => AsTable) == DataFrame(x1=1, x2=2) - @test combine(df, :a => (x -> res) => [:p, :q]) == DataFrame(p=1, q=2) - @test_throws ArgumentError combine(df, :a => (x -> res) => [:p]) - @test_throws ArgumentError select(df, :a => (x -> res) => AsTable) - @test_throws ArgumentError transform(df, :a => (x -> res) => AsTable) - end - @test combine(df, :a => ByRow(x -> [x,x+1]), - :a => ByRow(x -> [x, x+1]) => AsTable, - :a => ByRow(x -> [x, x+1]) => [:p, :q], - :a => ByRow(x -> (s=x, t=x+1)) => AsTable, - :a => (x -> (k=x, l=x.+1)) => AsTable, - :a => ByRow(x -> (s=x, t=x+1)) => :z) == - DataFrame(a_function=[[1, 2], [2, 3]], x1=[1, 2], x2=[2, 3], - p=[1, 2], q=[2, 3], s=[1, 2], t=[2, 3], k=[1, 2], l=[2, 3], - z=[(s=1, t=2), (s=2, t=3)]) - @test select(df, :a => ByRow(x -> [x,x+1]), - :a => ByRow(x -> [x, x+1]) => AsTable, - :a => ByRow(x -> [x, x+1]) => [:p, :q], - :a => ByRow(x -> (s=x, t=x+1)) => AsTable, - :a => (x -> (k=x, l=x.+1)) => AsTable, - :a => ByRow(x -> (s=x, t=x+1)) => :z) == - DataFrame(a_function=[[1, 2], [2, 3]], x1=[1, 2], x2=[2, 3], - p=[1, 2], q=[2, 3], s=[1, 2], t=[2, 3], k=[1, 2], l=[2, 3], - z=[(s=1, t=2), (s=2, t=3)]) - @test transform(df, :a => ByRow(x -> [x,x+1]), - :a => ByRow(x -> [x, x+1]) => AsTable, - :a => ByRow(x -> [x, x+1]) => [:p, :q], - :a => ByRow(x -> (s=x, t=x+1)) => AsTable, - :a => (x -> (k=x, l=x.+1)) => AsTable, - :a => ByRow(x -> (s=x, t=x+1)) => :z) == - [df DataFrame(a_function=[[1, 2], [2, 3]], x1=[1, 2], x2=[2, 3], - p=[1, 2], q=[2, 3], s=[1, 2], t=[2, 3], k=[1, 2], l=[2, 3], - z=[(s=1, t=2), (s=2, t=3)])] - @test_throws ArgumentError select(df, :a => (x -> [(a=1,b=2), (a=1, b=2, c=3)]) => AsTable) - @test_throws ArgumentError select(df, :a => (x -> [(a=1,b=2), (a=1, c=3)]) => AsTable) - @test_throws ArgumentError combine(df, :a => (x -> (a=1,b=2)) => :x) - end + df = DataFrame(a=1:2, b=3:4, c=5:6) + @test select!(sdf -> sdf.b, copy(df)) == DataFrame(x1=3:4) + @test select!(sdf -> (b = 2sdf.b,), copy(df)) == DataFrame(b=[6,8]) + @test select!(sdf -> (b = 1,), copy(df)) == DataFrame(b=[1, 1]) + @test_throws ArgumentError select!(sdf -> (b = [1],), copy(df)) + @test select!(sdf -> (b = [1, 5],), copy(df)) == DataFrame(b=[1, 5]) + @test select!(sdf -> 1, copy(df)) == DataFrame(x1=[1, 1]) + @test select!(sdf -> fill([1]), copy(df)) == DataFrame(x1=[[1], [1]]) + @test select!(sdf -> Ref([1]), copy(df)) == DataFrame(x1=[[1], [1]]) + @test select!(sdf -> "x", copy(df)) == DataFrame(x1=["x", "x"]) + @test select!(sdf -> [[1,2],[3,4]], copy(df)) == DataFrame(x1=[[1,2],[3,4]]) + for ret in (DataFrame(), NamedTuple(), zeros(0,0), DataFrame(t=1)[1, 1:0]) + @test select!(sdf -> ret, copy(df)) == DataFrame() + end + @test_throws ArgumentError select!(sdf -> DataFrame(a=10), copy(df)) + @test_throws ArgumentError select!(sdf -> zeros(1, 2), copy(df)) + @test select!(sdf -> DataFrame(a=[10, 11]), copy(df)) == DataFrame(a=[10, 11]) + @test select!(sdf -> [10 11; 12 13], copy(df)) == DataFrame(x1=[10, 12], x2=[11, 13]) + @test select!(sdf -> DataFrame(a=10)[1, :], copy(df)) == DataFrame(a=[10, 10]) + + @test transform!(sdf -> sdf.b, copy(df)) == [df DataFrame(x1=3:4)] + @test transform!(sdf -> (b = 2sdf.b,), copy(df)) == DataFrame(a=1:2, b=[6,8], c=5:6) + @test transform!(sdf -> (b = 1,), copy(df)) == DataFrame(a=[1,2], b=[1, 1], c=[5,6]) + @test_throws ArgumentError transform!(sdf -> (b = [1],), copy(df)) + @test transform!(sdf -> (b = [1, 5],), copy(df)) == DataFrame(a=[1,2], b=[1, 5], c=[5,6]) + @test transform!(sdf -> 1, copy(df)) == DataFrame(a=1:2, b=3:4, c=5:6, x1=1) + @test transform!(sdf -> fill([1]), copy(df)) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[[1],[1]]) + @test transform!(sdf -> Ref([1]), copy(df)) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[[1],[1]]) + @test transform!(sdf -> "x", copy(df)) == DataFrame(a=1:2, b=3:4, c=5:6, x1="x") + @test transform!(sdf -> [[1,2],[3,4]], copy(df)) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[[1,2],[3,4]]) + for ret in (DataFrame(), NamedTuple(), zeros(0,0), DataFrame(t=1)[1, 1:0]) + @test transform!(sdf -> ret, copy(df)) == df end + @test_throws ArgumentError transform!(sdf -> DataFrame(a=10), copy(df)) + @test_throws ArgumentError transform!(sdf -> zeros(1, 2), copy(df)) + @test transform!(sdf -> DataFrame(a=[10, 11]), copy(df)) == DataFrame(a=[10, 11], b=3:4, c=5:6) + @test transform!(sdf -> [10 11; 12 13], copy(df)) == DataFrame(a=1:2, b=3:4, c=5:6, x1=[10, 12], x2=[11, 13]) + @test transform!(sdf -> DataFrame(a=10)[1, :], copy(df)) == DataFrame(a=[10, 10], b=3:4, c=5:6) + + @test_throws ArgumentError combine(:x => identity, DataFrame(x=[1,2,3])) +end - @testset "check correctness of duplicate column names" begin - for df in (DataFrame(a=1:2, b=3:4, c=5:6), view(DataFrame(a=1:3, b=3:5, c=5:7, d=11:13), 1:2, 1:3)) - @test select(df, :b, :) == DataFrame(b=3:4, a=1:2, c=5:6) - @test select(df, :b => :c, :) == DataFrame(c=3:4, a=1:2, b=3:4) - @test_throws ArgumentError select(df, :b => [:c, :d], :) - @test_throws ArgumentError select(df, :a, :a => x -> (a=[1,2], b=[3,4])) - @test_throws ArgumentError select(df, :a, :a => (x -> (a=[1,2], b=[3,4])) => AsTable) - @test select(df, [:b, :a], :a => (x -> (a=[11,12], b=[13,14])) => AsTable, :) == - DataFrame(b=[13, 14], a=[11, 12], c=[5, 6]) - @test select(df, [:b, :a], :a => (x -> (a=[11,12], b=[13,14])) => [:b, :a], :) == - DataFrame(b=[11, 12], a=[13, 14], c=[5, 6]) +@testset "transformation function with multiple columns as destination" begin + for df in (DataFrame(a=1:2, b=3:4, c=5:6), view(DataFrame(a=1:3, b=3:5, c=5:7, d=11:13), 1:2, 1:3)) + for fun in (select, combine, transform), + res in (DataFrame(), DataFrame(a=1,b=2)[1, :], ones(1,1), + (a=1,b=2), (a=[1], b=[2]), (a=1, b=[2])) + @test_throws ArgumentError fun(df, :a => x -> res) + @test_throws ArgumentError fun(df, :a => (x -> res) => :z) + end + for res in (DataFrame(x1=1, x2=2)[1, :], (x1=1,x2=2)) + @test select(df, :a => (x -> res) => AsTable) == DataFrame(x1=[1,1], x2=[2,2]) + @test transform(df, :a => (x -> res) => AsTable) == [df DataFrame(x1=[1,1], x2=[2,2])] + @test combine(df, :a => (x -> res) => AsTable) == DataFrame(x1=[1], x2=[2]) + @test select(df, :a => (x -> res) => [:p, :q]) == DataFrame(p=[1,1], q=[2,2]) + @test transform(df, :a => (x -> res) => [:p, :q]) == [df DataFrame(p=[1,1], q=[2,2])] + @test combine(df, :a => (x -> res) => [:p, :q]) == DataFrame(p=[1], q=[2]) + @test_throws ArgumentError select(df, :a => (x -> res) => [:p, :q, :r]) + @test_throws ArgumentError select(df, :a => (x -> res) => [:p]) end + for res in (DataFrame(x1=1, x2=2), [1 2], Tables.table([1 2], header=[:x1, :x2]), + (x1=[1], x2=[2])) + @test combine(df, :a => (x -> res) => AsTable) == DataFrame(x1=1, x2=2) + @test combine(df, :a => (x -> res) => [:p, :q]) == DataFrame(p=1, q=2) + @test_throws ArgumentError combine(df, :a => (x -> res) => [:p]) + @test_throws ArgumentError select(df, :a => (x -> res) => AsTable) + @test_throws ArgumentError transform(df, :a => (x -> res) => AsTable) + end + @test combine(df, :a => ByRow(x -> [x,x+1]), + :a => ByRow(x -> [x, x+1]) => AsTable, + :a => ByRow(x -> [x, x+1]) => [:p, :q], + :a => ByRow(x -> (s=x, t=x+1)) => AsTable, + :a => (x -> (k=x, l=x.+1)) => AsTable, + :a => ByRow(x -> (s=x, t=x+1)) => :z) == + DataFrame(a_function=[[1, 2], [2, 3]], x1=[1, 2], x2=[2, 3], + p=[1, 2], q=[2, 3], s=[1, 2], t=[2, 3], k=[1, 2], l=[2, 3], + z=[(s=1, t=2), (s=2, t=3)]) + @test select(df, :a => ByRow(x -> [x,x+1]), + :a => ByRow(x -> [x, x+1]) => AsTable, + :a => ByRow(x -> [x, x+1]) => [:p, :q], + :a => ByRow(x -> (s=x, t=x+1)) => AsTable, + :a => (x -> (k=x, l=x.+1)) => AsTable, + :a => ByRow(x -> (s=x, t=x+1)) => :z) == + DataFrame(a_function=[[1, 2], [2, 3]], x1=[1, 2], x2=[2, 3], + p=[1, 2], q=[2, 3], s=[1, 2], t=[2, 3], k=[1, 2], l=[2, 3], + z=[(s=1, t=2), (s=2, t=3)]) + @test transform(df, :a => ByRow(x -> [x,x+1]), + :a => ByRow(x -> [x, x+1]) => AsTable, + :a => ByRow(x -> [x, x+1]) => [:p, :q], + :a => ByRow(x -> (s=x, t=x+1)) => AsTable, + :a => (x -> (k=x, l=x.+1)) => AsTable, + :a => ByRow(x -> (s=x, t=x+1)) => :z) == + [df DataFrame(a_function=[[1, 2], [2, 3]], x1=[1, 2], x2=[2, 3], + p=[1, 2], q=[2, 3], s=[1, 2], t=[2, 3], k=[1, 2], l=[2, 3], + z=[(s=1, t=2), (s=2, t=3)])] + @test_throws ArgumentError select(df, :a => (x -> [(a=1,b=2), (a=1, b=2, c=3)]) => AsTable) + @test_throws ArgumentError select(df, :a => (x -> [(a=1,b=2), (a=1, c=3)]) => AsTable) + @test_throws ArgumentError combine(df, :a => (x -> (a=1,b=2)) => :x) + end +end + +@testset "check correctness of duplicate column names" begin + for df in (DataFrame(a=1:2, b=3:4, c=5:6), view(DataFrame(a=1:3, b=3:5, c=5:7, d=11:13), 1:2, 1:3)) + @test select(df, :b, :) == DataFrame(b=3:4, a=1:2, c=5:6) + @test select(df, :b => :c, :) == DataFrame(c=3:4, a=1:2, b=3:4) + @test_throws ArgumentError select(df, :b => [:c, :d], :) + @test_throws ArgumentError select(df, :a, :a => x -> (a=[1,2], b=[3,4])) + @test_throws ArgumentError select(df, :a, :a => (x -> (a=[1,2], b=[3,4])) => AsTable) + @test select(df, [:b, :a], :a => (x -> (a=[11,12], b=[13,14])) => AsTable, :) == + DataFrame(b=[13, 14], a=[11, 12], c=[5, 6]) + @test select(df, [:b, :a], :a => (x -> (a=[11,12], b=[13,14])) => [:b, :a], :) == + DataFrame(b=[11, 12], a=[13, 14], c=[5, 6]) end end diff --git a/test/string.jl b/test/string.jl index ea2e9b222a..2fd8b98dcc 100644 --- a/test/string.jl +++ b/test/string.jl @@ -169,19 +169,16 @@ end @test combine(gdf, :a) == combine(gdf, "a") == combine(gdf, [:a]) == combine(gdf, ["a"]) - @test combine("a" => identity, gdf, ungroup=false) == - combine(:a => identity, gdf, ungroup=false) - @test combine(["a"] => identity, gdf, ungroup=false) == - combine([:a] => identity, gdf, ungroup=false) - @test combine(nrow => :n, gdf, ungroup=false) == - combine(nrow => "n", gdf, ungroup=false) - - @test combine("a" => identity, gdf) == combine(:a => identity, gdf) == - combine(gdf, "a" => identity) == combine(gdf, :a => identity) - @test combine(["a"] => identity, gdf) == combine([:a] => identity, gdf) == - combine(gdf, ["a"] => identity) == combine(gdf, [:a] => identity) - @test combine(nrow => :n, gdf) == combine(nrow => "n", gdf) == - combine(gdf, nrow => :n) == combine(gdf, nrow => "n") + @test combine(gdf, "a" => identity, ungroup=false) == + combine(gdf, :a => identity, ungroup=false) + @test combine(gdf, ["a"] => identity, ungroup=false) == + combine(gdf, [:a] => identity, ungroup=false) + @test combine(gdf, nrow => :n, ungroup=false) == + combine(gdf, nrow => "n", ungroup=false) + + @test combine(gdf, "a" => identity) == combine(gdf, :a => identity) + @test combine(gdf, ["a"] => identity) == combine(gdf, [:a] => identity) + @test combine(gdf, nrow => :n) == combine(gdf, nrow => "n") end @testset "DataFrameRow" begin