diff --git a/NEWS.md b/NEWS.md index cb020d1e1..d751b1652 100644 --- a/NEWS.md +++ b/NEWS.md @@ -22,6 +22,8 @@ 5. `transpose` gains `list.cols=` argument, [#5639](https://github.com/Rdatatable/data.table/issues/5639). Use this to return output with list columns and avoids type promotion (an exception is `factor` columns which are promoted to `character` for consistency between `list.cols=TRUE` and `list.cols=FALSE`). This is convenient for creating a row-major representation of a table. Thanks to @MLopez-Ibanez for the request, and Benjamin Schwendinger for the PR. +4. Using `dt[, names(.SD) := lapply(.SD, fx)]` now works, [#795](https://github.com/Rdatatable/data.table/issues/795) -- one of our [most-requested issues (see #3189)](https://github.com/Rdatatable/data.table/issues/3189). Thanks to @brodieG for the report, 20 or so others for chiming in, and @ColeMiller1 for PR. + ## BUG FIXES 1. `unique()` returns a copy the case when `nrows(x) <= 1` instead of a mutable alias, [#5932](https://github.com/Rdatatable/data.table/pull/5932). This is consistent with existing `unique()` behavior when the input has no duplicates but more than one row. Thanks to @brookslogan for the report and @dshemetov for the fix. diff --git a/R/data.table.R b/R/data.table.R index c80e89f88..f7b9b4192 100644 --- a/R/data.table.R +++ b/R/data.table.R @@ -1122,8 +1122,8 @@ replace_dot_alias = function(e) { if (is.name(lhs)) { lhs = as.character(lhs) } else { - # e.g. (MyVar):= or get("MyVar"):= - lhs = eval(lhs, parent.frame(), parent.frame()) + # lhs is e.g. (MyVar) or get("MyVar") or names(.SD) || setdiff(names(.SD), cols) + lhs = eval(lhs, list(.SD = setNames(logical(length(sdvars)), sdvars)), parent.frame()) } } else { # `:=`(c2=1L,c3=2L,...) diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index 4370fb888..fe68cc5de 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -18359,3 +18359,33 @@ test(2249.2, indices(DT[, .SD]), 'x') setindex(DT, y) test(2249.3, indices(DT), c('x', 'y')) test(2249.4, indices(DT[, .SD]), c('x', 'y')) + +# make names(.SD) work - issue #795 +dt = data.table(a = 1:4, b = 5:8) +test(2250.01, dt[, names(.SD) := lapply(.SD, '*', 2), .SDcols = 1L], data.table(a = 1:4 * 2, b = 5:8)) +test(2250.02, dt[, names(.SD) := lapply(.SD, '*', 2), .SDcols = 2L], data.table(a = 1:4 * 2, b = 5:8 * 2)) +test(2250.03, dt[, names(.SD) := lapply(.SD, as.integer)], data.table(a = as.integer(1:4 * 2), b = as.integer(5:8 * 2))) +test(2250.04, dt[1L, names(.SD) := lapply(.SD, '+', 2L)], data.table(a = as.integer(c(4, 2:4 * 2)), b = as.integer(c(12, 6:8 * 2)))) +test(2250.05, dt[, setdiff(names(.SD), 'a') := NULL], data.table(a = as.integer(c(4, 2:4 * 2)))) +test(2250.06, dt[, c(names(.SD)) := NULL], null.data.table()) + +dt = data.table(a = 1:4, b = 5:8, grp = c('a', 'a', 'b', 'c')) +test(2250.07, dt[, names(.SD) := lapply(.SD, max), by = grp], data.table(a = c(2L, 2L, 3L, 4L), b = c(6L, 6L, 7L, 8L), grp = c('a', 'a', 'b', 'c'))) + +dt = data.table(a = 1:4, b = 5:8, grp = c('a', 'a', 'b', 'c')) +keep = c('a', 'b') +test(2250.08, dt[, names(.SD) := NULL, .SDcols = !keep], data.table(a = 1:4, b = 5:8)) + +dt = data.table(a = 1:4, b = 5:8, grp = c('a', 'a', 'b', 'c')) +test(2250.09, dt[, paste(names(.SD), 'max', sep = '_') := lapply(.SD, max), by = grp] , data.table(a = 1:4, b = 5:8, grp = c('a', 'a', 'b', 'c'), a_max = c(2L, 2L, 3L, 4L), b_max = c(6L, 6L, 7L, 8L))) + +dt = data.table(a = 1:3, b = 5:7, grp = c('a', 'a', 'b')) +test(2250.10, dt[1:2, paste(names(.SD), 'max', sep = '_') := lapply(.SD, max), by = grp], data.table(a = 1:3, b = 5:7, grp = c('a', 'a', 'b'), a_max = c(2L, 2L, NA_integer_), b_max = c(6L, 6L, NA_integer_))) +test(2250.11, dt[, names(.SD(2)) := lapply(.SD, .I)], error = 'could not find function ".SD"') + +dt = data.table(a = 1:3, b = 5:7, grp = c('a', 'a', 'b')) +test(2250.12, dt[, names(.SD) := lapply(.SD, \(x) x + b), .SDcols = "a"], data.table(a = 1:3 + 5:7, b = 5:7, grp = c('a', 'a', 'b'))) + + +dt = data.table(a = 1L, b = 2L, c = 3L, d = 4L, e = 5L, f = 6L) +test(2250.13, dt[, names(.SD)[1:5] := sum(.SD)], data.table(a = 21L, b = 21L, c = 21L, d = 21L, e = 21L, f = 6L)) diff --git a/man/assign.Rd b/man/assign.Rd index df255d395..62c8d6142 100644 --- a/man/assign.Rd +++ b/man/assign.Rd @@ -26,6 +26,9 @@ # LHS2 = RHS2, # ...), by = ...] +# 3. Multiple columns in place +# DT[i, names(.SD) := lapply(.SD, fx), by = ..., .SDcols = ...] + set(x, i = NULL, j, value) } \arguments{ diff --git a/vignettes/datatable-reference-semantics.Rmd b/vignettes/datatable-reference-semantics.Rmd index 7a9990ba4..b678c390e 100644 --- a/vignettes/datatable-reference-semantics.Rmd +++ b/vignettes/datatable-reference-semantics.Rmd @@ -258,6 +258,23 @@ flights[, c("speed", "max_speed", "max_dep_delay", "max_arr_delay") := NULL] head(flights) ``` +#### -- How can we update multiple existing columns in place using `.SD`? + +```{r} +flights[, names(.SD) := lapply(.SD, as.factor), .SDcols = is.character] +``` +Let's clean up again and convert our newly-made factor columns back into character columns. This time we will make use of `.SDcols` accepting a function to decide which columns to include. In this case, `is.factor()` will return the columns which are factors. For more on the **S**ubset of the **D**ata, there is also an [SD Usage vignette](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-sd-usage.html). + +Sometimes, it is also nice to keep track of columns that we transform. That way, even after we convert our columns we would be able to call the specific columns we were updating. +```{r} +factor_cols <- sapply(flights, is.factor) +flights[, names(.SD) := lapply(.SD, as.character), .SDcols = factor_cols] +str(flights[, ..factor_cols]) +``` +#### {.bs-callout .bs-callout-info} + +* We also could have used `(factor_cols)` on the `LHS` instead of `names(.SD)`. + ## 3. `:=` and `copy()` `:=` modifies the input object by reference. Apart from the features we have discussed already, sometimes we might want to use the update by reference feature for its side effect. And at other times it may not be desirable to modify the original object, in which case we can use `copy()` function, as we will see in a moment. diff --git a/vignettes/datatable-sd-usage.Rmd b/vignettes/datatable-sd-usage.Rmd index 5f0348e4f..09243c820 100644 --- a/vignettes/datatable-sd-usage.Rmd +++ b/vignettes/datatable-sd-usage.Rmd @@ -77,7 +77,15 @@ The first way to impact what `.SD` is is to limit the _columns_ contained in `.S Pitching[ , .SD, .SDcols = c('W', 'L', 'G')] ``` -This is just for illustration and was pretty boring. But even this simply usage lends itself to a wide variety of highly beneficial / ubiquitous data manipulation operations: +This is just for illustration and was pretty boring. In addition to accepting a character vector, `.SDcols` also accepts: + +1. any function such as `is.character` to filter _columns_ +2. the function^{*} `patterns()` to filter _column names_ by regular expression +3. integer and logical vectors + +*see `?patterns` for more details + +This simple usage lends itself to a wide variety of highly beneficial / ubiquitous data manipulation operations: ## Column Type Conversion @@ -91,52 +99,40 @@ We notice that the following columns are stored as `character` in the `Teams` da # teamIDretro: Team ID used by Retrosheet fkt = c('teamIDBR', 'teamIDlahman45', 'teamIDretro') # confirm that they're stored as `character` -Teams[ , sapply(.SD, is.character), .SDcols = fkt] +str(Teams[ , ..fkt]) ``` -If you're confused by the use of `sapply` here, note that it's quite similar for base R `data.frames`: - -```{r identify_factors_as_df} -setDF(Teams) # convert to data.frame for illustration -sapply(Teams[ , fkt], is.character) -setDT(Teams) # convert back to data.table -``` - -The key to understanding this syntax is to recall that a `data.table` (as well as a `data.frame`) can be considered as a `list` where each element is a column -- thus, `sapply`/`lapply` applies the `FUN` argument (in this case, `is.character`) to each _column_ and returns the result as `sapply`/`lapply` usually would. - -The syntax to now convert these columns to `factor` is very similar -- simply add the `:=` assignment operator: +The syntax to now convert these columns to `factor` is simple: ```{r assign_factors} -Teams[ , (fkt) := lapply(.SD, factor), .SDcols = fkt] +Teams[ , names(.SD) := lapply(.SD, factor), .SDcols = patterns('teamID')] # print out the first column to demonstrate success head(unique(Teams[[fkt[1L]]])) ``` -Note that we must wrap `fkt` in parentheses `()` to force `data.table` to interpret this as column names, instead of trying to assign a column named `'fkt'`. +Note: -Actually, the `.SDcols` argument is quite flexible; above, we supplied a `character` vector of column names. In other situations, it is more convenient to supply an `integer` vector of column _positions_ or a `logical` vector dictating include/exclude for each column. `.SDcols` even accepts regular expression-based pattern matching. +1. The `:=` is an assignment operator to update the `data.table` in place without making a copy. See [reference semantics](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reference-semantics.html) for more. +2. The LHS, `names(.SD)`, indicates which columns we are updating - in this case we update the entire `.SD`. +3. The RHS, `lapply()`, loops through each column of the `.SD` and converts the column to a factor. +4. We use the `.SDcols` to only select columns that have pattern of `teamID`. + +Again, the `.SDcols` argument is quite flexible; above, we supplied `patterns` but we could have also supplied `fkt` or any `character` vector of column names. In other situations, it is more convenient to supply an `integer` vector of column _positions_ or a `logical` vector dictating include/exclude for each column. Finally, the use of a function to filter columns is very helpful. For example, we could do the following to convert all `factor` columns to `character`: ```{r sd_as_logical} -# while .SDcols accepts a logical vector, -# := does not, so we need to convert to column -# positions with which() -fkt_idx = which(sapply(Teams, is.factor)) -Teams[ , (fkt_idx) := lapply(.SD, as.character), .SDcols = fkt_idx] -head(unique(Teams[[fkt_idx[1L]]])) +fct_idx = Teams[, which(sapply(.SD, is.factor))] # column numbers to show the class changing +str(Teams[[fct_idx[1L]]]) +Teams[ , names(.SD) := lapply(.SD, as.character), .SDcols = is.factor] +str(Teams[[fct_idx[1L]]]) ``` Lastly, we can do pattern-based matching of columns in `.SDcols` to select all columns which contain `team` back to `factor`: ```{r sd_patterns} Teams[ , .SD, .SDcols = patterns('team')] - -# now convert these columns to factor; -# value = TRUE in grep() is for the LHS of := to -# get column names instead of positions -team_idx = grep('team', names(Teams), value = TRUE) -Teams[ , (team_idx) := lapply(.SD, factor), .SDcols = team_idx] +Teams[ , names(.SD) := lapply(.SD, factor), .SDcols = patterns('team')] ``` ** A proviso to the above: _explicitly_ using column numbers (like `DT[ , (1) := rnorm(.N)]`) is bad practice and can lead to silently corrupted code over time if column positions change. Even implicitly using numbers can be dangerous if we don't keep smart/strict control over the ordering of when we create the numbered index and when we use it.