Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

names(.SD) should work #4163

Merged
merged 49 commits into from
Mar 20, 2024
Merged
Show file tree
Hide file tree
Changes from 44 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
316c42b
Update data.table.R
ColeMiller1 Jan 8, 2020
e9ae7d3
Update tests.Rraw
ColeMiller1 Jan 8, 2020
17e80c4
Update data.table.R
ColeMiller1 Jan 8, 2020
d1c7a99
Update tests.Rraw
ColeMiller1 Jan 8, 2020
9d26aa7
Update datatable-reference-semantics.Rmd
ColeMiller1 Jan 8, 2020
54f35f6
Update assign.Rd
ColeMiller1 Jan 8, 2020
3c68d6e
Update NEWS.md
ColeMiller1 Jan 8, 2020
a009df0
Update NEWS.md
ColeMiller1 Jan 8, 2020
7bd494d
Merge branch 'master' into names_SD
ColeMiller1 Jan 8, 2020
18ccd2f
Update data.table.R
ColeMiller1 Jan 9, 2020
15e95f8
Update tests.Rraw
ColeMiller1 Jan 9, 2020
112f81d
Update tests.Rraw
ColeMiller1 Jan 9, 2020
2c39630
Update data.table.R
ColeMiller1 Jan 10, 2020
fcb270a
Update tests.Rraw
ColeMiller1 Jan 10, 2020
21d3a93
replace iris with raw dataset
ColeMiller1 Jan 10, 2020
10b36db
Update tests.Rraw
ColeMiller1 Jan 14, 2020
7993419
update replace_names_sd and made .SD := not work
ColeMiller1 Jan 19, 2020
269967e
change .SD to names(.SD)
ColeMiller1 Jan 19, 2020
76b5e64
update typo; change .SD to names(.SD)
ColeMiller1 Jan 19, 2020
ed879f6
update to names(.SD)
ColeMiller1 Jan 19, 2020
1fbd631
include names(.SD) and fx to .SD usage
ColeMiller1 Jan 21, 2020
8df7af5
Updates news to names(.SD)
ColeMiller1 Jan 21, 2020
8c2d273
Update typo.
ColeMiller1 Jan 30, 2020
7267766
tweak NEWS
MichaelChirico Feb 2, 2020
197cb54
minor grammar
MichaelChirico Feb 2, 2020
8d7f232
jans comment
MichaelChirico Feb 2, 2020
29cc659
jan's comment (ii)
MichaelChirico Feb 2, 2020
f7adef8
added "footnote"
MichaelChirico Feb 2, 2020
9469e4e
Add is.name(e[[2L]])
ColeMiller1 Feb 2, 2020
3ba5518
Put tests above Add new tests here
ColeMiller1 Feb 2, 2020
8e1c109
added test to test names(.SD(2))
ColeMiller1 Feb 2, 2020
2ef29e7
Merge branch 'master' into names_SD
ColeMiller1 Feb 2, 2020
c389b3c
include .SDcols in example for assign
ColeMiller1 Feb 2, 2020
2c3fb51
included .SDcols = function example
ColeMiller1 Feb 2, 2020
a2b568b
Merge branch 'master' into names_SD
ColeMiller1 Feb 17, 2020
82b7cfd
Merge branch 'master' into names_SD
ColeMiller1 Feb 27, 2020
f5ab271
test 2138 is greater than 2137
ColeMiller1 Feb 27, 2020
3be7e22
Merge branch 'master' into names_SD
MichaelChirico Feb 26, 2024
9d816d7
Merge branch 'master' into names_SD
MichaelChirico Feb 27, 2024
be720a3
bad merge
MichaelChirico Feb 27, 2024
7b0f8f1
Make updates per Michael's comments.
ColeMiller1 Mar 19, 2024
3635c3d
Added test where .SD is used as well as some columns not in .SD.
ColeMiller1 Mar 19, 2024
5fec7bc
Mention count of reactions in issue
MichaelChirico Mar 19, 2024
7ae1ea3
small copy-edit
MichaelChirico Mar 19, 2024
2cb48ea
more specific
MichaelChirico Mar 19, 2024
5a587e7
specify LHS/RHS
MichaelChirico Mar 19, 2024
212a774
Simplify implementation to probe for names(.SD) and new test
ColeMiller1 Mar 20, 2024
b91dab5
fine-tune comment
MichaelChirico Mar 20, 2024
8fe60ee
Merge branch 'master' into names_SD
MichaelChirico Mar 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@

3. Namespace-qualifying `data.table::shift()`, `data.table::first()`, or `data.table::last()` will not deactivate GForce, [#5942](https://github.com/Rdatatable/data.table/issues/5942). Thanks @MichaelChirico for the proposal and fix. Namespace-qualifying other calls like `stats::sum()`, `base::prod()`, etc., continue to work as an escape valve to avoid GForce, e.g. to ensure S3 method dispatch.

4. Using `dt[, names(.SD) := lapply(.SD, fx)]` now works, [#795](https://github.com/Rdatatable/data.table/issues/795) -- one of our [most-requested issues (see #3189)](https://github.com/Rdatatable/data.table/issues/3189). Thanks to @brodieG for the report, 20 or so others for chiming in, and @ColeMiller1 for PR.

## BUG FIXES

1. `unique()` returns a copy the case when `nrows(x) <= 1` instead of a mutable alias, [#5932](https://github.com/Rdatatable/data.table/pull/5932). This is consistent with existing `unique()` behavior when the input has no duplicates but more than one row. Thanks to @brookslogan for the report and @dshemetov for the fix.
Expand Down
10 changes: 8 additions & 2 deletions R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -1122,8 +1122,14 @@ replace_dot_alias = function(e) {
if (is.name(lhs)) {
lhs = as.character(lhs)
} else {
# e.g. (MyVar):= or get("MyVar"):=
lhs = eval(lhs, parent.frame(), parent.frame())
# i.e lhs is names(.SD) || setdiff(names(.SD), cols) || (cols)
MichaelChirico marked this conversation as resolved.
Show resolved Hide resolved
replace_names_sd = function(e, cols){
MichaelChirico marked this conversation as resolved.
Show resolved Hide resolved
if (length(e) == 1L) return(e)
if (e %iscall% "names" && is.name(e2 <- e[[2L]]) && e2 == ".SD") return(cols)
for (i in 2:length(e)) if (!is.null(e[[i]])) e[[i]] = replace_names_sd(e[[i]], cols)
e
}
lhs = eval(replace_names_sd(lhs, sdvars), parent.frame(), parent.frame())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A simpler implementation might be the following:

e <- copyenv(parent.frame()) # pseudocode
e$.SD <- setNames(logical(length(sdvars)), sdvars) # or vector("list"), or even vector("raw") to really scrimp on storage
lhs = eval(lhs, e, e)

WDYT?

}
} else {
# `:=`(c2=1L,c3=2L,...)
Expand Down
28 changes: 28 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -18294,3 +18294,31 @@ test(2246.1, DT[, data.table::shift(b), by=a], DT[, shift(b), by=a], output="GFo
test(2246.2, DT[, data.table::first(b), by=a], DT[, first(b), by=a], output="GForce TRUE")
test(2246.3, DT[, data.table::last(b), by=a], DT[, last(b), by=a], output="GForce TRUE")
options(old)

# make names(.SD) work - issue #795
dt = data.table(a = 1:4, b = 5:8)

test(2247.01, dt[, names(.SD) := lapply(.SD, '*', 2), .SDcols = 1L], data.table(a = 1:4 * 2, b = 5:8))
test(2247.02, dt[, names(.SD) := lapply(.SD, '*', 2), .SDcols = 2L], data.table(a = 1:4 * 2, b = 5:8 * 2))
test(2247.03, dt[, names(.SD) := lapply(.SD, as.integer)], data.table(a = as.integer(1:4 * 2), b = as.integer(5:8 * 2)))
test(2247.04, dt[1L, names(.SD) := lapply(.SD, '+', 2L)], data.table(a = as.integer(c(4, 2:4 * 2)), b = as.integer(c(12, 6:8 * 2))))
test(2247.05, dt[, setdiff(names(.SD), 'a') := NULL], data.table(a = as.integer(c(4, 2:4 * 2))))
test(2247.06, dt[, c(names(.SD)) := NULL], null.data.table())

dt = data.table(a = 1:4, b = 5:8, grp = c('a', 'a', 'b', 'c'))
test(2247.07, dt[, names(.SD) := lapply(.SD, max), by = grp], data.table(a = c(2L, 2L, 3L, 4L), b = c(6L, 6L, 7L, 8L), grp = c('a', 'a', 'b', 'c')))

dt = data.table(a = 1:4, b = 5:8, grp = c('a', 'a', 'b', 'c'))
keep = c('a', 'b')
test(2247.08, dt[, names(.SD) := NULL, .SDcols = !keep], data.table(a = 1:4, b = 5:8))

dt = data.table(a = 1:4, b = 5:8, grp = c('a', 'a', 'b', 'c'))
test(2247.09, dt[, paste(names(.SD), 'max', sep = '_') := lapply(.SD, max), by = grp] , data.table(a = 1:4, b = 5:8, grp = c('a', 'a', 'b', 'c'), a_max = c(2L, 2L, 3L, 4L), b_max = c(6L, 6L, 7L, 8L)))

dt = data.table(a = 1:3, b = 5:7, grp = c('a', 'a', 'b'))
test(2247.10, dt[1:2, paste(names(.SD), 'max', sep = '_') := lapply(.SD, max), by = grp], data.table(a = 1:3, b = 5:7, grp = c('a', 'a', 'b'), a_max = c(2L, 2L, NA_integer_), b_max = c(6L, 6L, NA_integer_)))
ColeMiller1 marked this conversation as resolved.
Show resolved Hide resolved
test(2247.11, dt[, names(.SD(2)) := lapply(.SD, .I)], error = 'could not find function ".SD"')

dt = data.table(a = 1:3, b = 5:7, grp = c('a', 'a', 'b'))
test(2247.12, dt[, names(.SD) := lapply(.SD, \(x) x + b), .SDcols = "a"], data.table(a = 1:3 + 5:7, b = 5:7, grp = c('a', 'a', 'b')))
ColeMiller1 marked this conversation as resolved.
Show resolved Hide resolved
ColeMiller1 marked this conversation as resolved.
Show resolved Hide resolved

3 changes: 3 additions & 0 deletions man/assign.Rd
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,9 @@
# LHS2 = RHS2,
# ...), by = ...]

# 3. Multiple columns in place
# DT[i, names(.SD) := lapply(.SD, fx), by = ..., .SDcols = ...]

set(x, i = NULL, j, value)
}
\arguments{
Expand Down
17 changes: 17 additions & 0 deletions vignettes/datatable-reference-semantics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -258,6 +258,23 @@ flights[, c("speed", "max_speed", "max_dep_delay", "max_arr_delay") := NULL]
head(flights)
```

#### -- How can we update multiple existing columns in place using `.SD`?

```{r}
flights[, names(.SD) := lapply(.SD, as.factor), .SDcols = is.character]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I can imagine will happen soon is a user trying:

x[, names(.SD)[1:5] := ...]

Does that work already? If so, please add a test. If not, no need to handle it until it's requested later unless you see an easy fix.

```
Let's clean up again and convert our newly-made factor columns back into character columns. This time we will make use of `.SDcols` accepting a function to decide which columns to include. In this case, `is.factor()` will return the columns which are factors. For more on the **S**ubset of the **D**ata, there is also an [SD Usage vignette](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-sd-usage.html).

Sometimes, it is also nice to keep track of columns that we transform. That way, even after we convert our columns we would be able to call the specific columns we were updating.
```{r}
factor_cols <- sapply(flights, is.factor)
flights[, names(.SD) := lapply(.SD, as.character), .SDcols = factor_cols]
str(flights[, ..factor_cols])
```
#### {.bs-callout .bs-callout-info}

* We also could have used `(factor_cols)` on the `LHS` instead of `names(.SD)`.

## 3. `:=` and `copy()`

`:=` modifies the input object by reference. Apart from the features we have discussed already, sometimes we might want to use the update by reference feature for its side effect. And at other times it may not be desirable to modify the original object, in which case we can use `copy()` function, as we will see in a moment.
Expand Down
52 changes: 24 additions & 28 deletions vignettes/datatable-sd-usage.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,15 @@ The first way to impact what `.SD` is is to limit the _columns_ contained in `.S
Pitching[ , .SD, .SDcols = c('W', 'L', 'G')]
ColeMiller1 marked this conversation as resolved.
Show resolved Hide resolved
```

This is just for illustration and was pretty boring. But even this simply usage lends itself to a wide variety of highly beneficial / ubiquitous data manipulation operations:
This is just for illustration and was pretty boring. In addition to accepting a character vector, `.SDcols` also accepts:

1. any function such as `is.character` to filter _columns_
2. the function^{*} `patterns()` to filter by _column names_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self to check the GL CI output to make sure the ^{*} looks as intended on output

MichaelChirico marked this conversation as resolved.
Show resolved Hide resolved
3. integer and logical vectors

*see `?patterns` for more details

This simple usage lends itself to a wide variety of highly beneficial / ubiquitous data manipulation operations:

## Column Type Conversion

Expand All @@ -91,52 +99,40 @@ We notice that the following columns are stored as `character` in the `Teams` da
# teamIDretro: Team ID used by Retrosheet
fkt = c('teamIDBR', 'teamIDlahman45', 'teamIDretro')
# confirm that they're stored as `character`
Teams[ , sapply(.SD, is.character), .SDcols = fkt]
str(Teams[ , ..fkt])
```

If you're confused by the use of `sapply` here, note that it's quite similar for base R `data.frames`:

```{r identify_factors_as_df}
setDF(Teams) # convert to data.frame for illustration
sapply(Teams[ , fkt], is.character)
setDT(Teams) # convert back to data.table
```

The key to understanding this syntax is to recall that a `data.table` (as well as a `data.frame`) can be considered as a `list` where each element is a column -- thus, `sapply`/`lapply` applies the `FUN` argument (in this case, `is.character`) to each _column_ and returns the result as `sapply`/`lapply` usually would.

The syntax to now convert these columns to `factor` is very similar -- simply add the `:=` assignment operator:
The syntax to now convert these columns to `factor` is simple:

```{r assign_factors}
Teams[ , (fkt) := lapply(.SD, factor), .SDcols = fkt]
Teams[ , names(.SD) := lapply(.SD, factor), .SDcols = patterns('teamID')]
# print out the first column to demonstrate success
head(unique(Teams[[fkt[1L]]]))
```

Note that we must wrap `fkt` in parentheses `()` to force `data.table` to interpret this as column names, instead of trying to assign a column named `'fkt'`.
Note:

Actually, the `.SDcols` argument is quite flexible; above, we supplied a `character` vector of column names. In other situations, it is more convenient to supply an `integer` vector of column _positions_ or a `logical` vector dictating include/exclude for each column. `.SDcols` even accepts regular expression-based pattern matching.
1. The `:=` is an assignment operator to update the `data.table` in place without making a copy. See [reference semantics](https://cran.r-project.org/web/packages/data.table/vignettes/datatable-reference-semantics.html) for more.
2. `names(.SD)` indicates which columns we are updating - in this case we update the entire `.SD`.
3. `lapply()` loops through each column of the `.SD` and converts the column to a factor.
MichaelChirico marked this conversation as resolved.
Show resolved Hide resolved
4. We use the `.SDcols` to only select columns that have pattern of `teamID`.

Again, the `.SDcols` argument is quite flexible; above, we supplied `patterns` but we could have also supplied `fkt` or any `character` vector of column names. In other situations, it is more convenient to supply an `integer` vector of column _positions_ or a `logical` vector dictating include/exclude for each column. Finally, the use of a function to filter columns is very helpful.

For example, we could do the following to convert all `factor` columns to `character`:

```{r sd_as_logical}
# while .SDcols accepts a logical vector,
# := does not, so we need to convert to column
# positions with which()
fkt_idx = which(sapply(Teams, is.factor))
Teams[ , (fkt_idx) := lapply(.SD, as.character), .SDcols = fkt_idx]
head(unique(Teams[[fkt_idx[1L]]]))
fct_idx = Teams[, which(sapply(.SD, is.factor))] # column numbers to show the class changing
str(Teams[[fct_idx[1L]]])
Teams[ , names(.SD) := lapply(.SD, as.character), .SDcols = is.factor]
str(Teams[[fct_idx[1L]]])
```

Lastly, we can do pattern-based matching of columns in `.SDcols` to select all columns which contain `team` back to `factor`:

```{r sd_patterns}
Teams[ , .SD, .SDcols = patterns('team')]

# now convert these columns to factor;
# value = TRUE in grep() is for the LHS of := to
# get column names instead of positions
team_idx = grep('team', names(Teams), value = TRUE)
Teams[ , (team_idx) := lapply(.SD, factor), .SDcols = team_idx]
Teams[ , names(.SD) := lapply(.SD, factor), .SDcols = patterns('team')]
```

** A proviso to the above: _explicitly_ using column numbers (like `DT[ , (1) := rnorm(.N)]`) is bad practice and can lead to silently corrupted code over time if column positions change. Even implicitly using numbers can be dangerous if we don't keep smart/strict control over the ordering of when we create the numbered index and when we use it.
Expand Down