Return all duplicated rows #924

arunsrinivasan · 2014-10-30T21:09:41Z

One relevant SO question. Also @matthieugomez brought this up a while ago (don't remember where).

Suppose we've a data.table:

require(data.table)
DT = data.table(x=c(1,1,2,3,3,4), y=1:6)
duplicated(DT, by="x")
# [1] FALSE  TRUE FALSE FALSE  TRUE FALSE

Edited to clarify: The new function implemented with this FR should instead return all TRUE for those groups that occur more than once. That is, in this case:

# expected answer
# [1] TRUE TRUE FALSE TRUE TRUE, FALSE

Right now, we can do this by doing:

DT[, rep.int(.N>1L, .N), by=x][, V1]
# [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE
# or alternatively
DT[, .I[.N>1L], by=x][, V1]
# [1] 1 2 4 5

But it might be useful to have this as a separate function. Filing it so that we don't forget.

A more general (+useful) variant of this would be to subset only those groups whose .N > some_value. The example illustrated here has some_value = 1L.

The text was updated successfully, but these errors were encountered:

lianos · 2014-10-30T23:27:28Z

I agree that this is useful, but if duplicated is implemented this way by default it breaks what standard R does. Perhaps add a new argument to duplicated.data.table with a default value set so that duplicated continues to act as it does now, but one could explicitly set this argument so that duplicated works in the way you describe here.

arunsrinivasan · 2014-10-31T00:11:03Z

i intended it to be a new function, not change default functionality of duplicated. I'll clarify this in the original post.

matthieugomez · 2014-11-04T16:12:21Z

I'd like that (I deleted my FR since it was a one liner). Maybe, which = TRUE (the default), the function would return a boolean vector of these rows, while with which = FALSE, the function would return a data.table with these rows only, with the variable "N" as the first column, and rows sorted by (N, by) (this is handy to examine duplicates)

MichaelChirico · 2024-04-07T05:35:06Z

This form looks pretty idiomatic to me: DT[, .I[.N>1L], by=x][, V1]. It will be nicer once #788 is fulfilled, too: DT[, .I, by=x, having=.N>1][, V1], which can probably be optimized pretty well too.

Not sure it's worth adding a new function, and it would be weird to make duplicated() do this since it's not how it works for the data.frame method.

Please open a new issue if there's strong disagreement with the above.

arunsrinivasan added the feature request label Nov 10, 2014

MichaelChirico mentioned this issue Dec 4, 2020

Allow duplicated() to show all duplicates #4828

Closed

MichaelChirico closed this as completed Apr 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return all duplicated rows #924

Return all duplicated rows #924

arunsrinivasan commented Oct 30, 2014

lianos commented Oct 30, 2014

arunsrinivasan commented Oct 31, 2014

matthieugomez commented Nov 4, 2014

MichaelChirico commented Apr 7, 2024

Return all duplicated rows #924

Return all duplicated rows #924

Comments

arunsrinivasan commented Oct 30, 2014

lianos commented Oct 30, 2014

arunsrinivasan commented Oct 31, 2014

matthieugomez commented Nov 4, 2014

MichaelChirico commented Apr 7, 2024