Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return all duplicated rows #924

Closed
arunsrinivasan opened this issue Oct 30, 2014 · 4 comments
Closed

Return all duplicated rows #924

arunsrinivasan opened this issue Oct 30, 2014 · 4 comments

Comments

@arunsrinivasan
Copy link
Member

One relevant SO question. Also @matthieugomez brought this up a while ago (don't remember where).

Suppose we've a data.table:

require(data.table)
DT = data.table(x=c(1,1,2,3,3,4), y=1:6)
duplicated(DT, by="x")
# [1] FALSE  TRUE FALSE FALSE  TRUE FALSE

Edited to clarify: The new function implemented with this FR should instead return all TRUE for those groups that occur more than once. That is, in this case:

# expected answer
# [1] TRUE TRUE FALSE TRUE TRUE, FALSE

Right now, we can do this by doing:

DT[, rep.int(.N>1L, .N), by=x][, V1]
# [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE
# or alternatively
DT[, .I[.N>1L], by=x][, V1]
# [1] 1 2 4 5

But it might be useful to have this as a separate function. Filing it so that we don't forget.


A more general (+useful) variant of this would be to subset only those groups whose .N > some_value. The example illustrated here has some_value = 1L.

@lianos
Copy link
Contributor

lianos commented Oct 30, 2014

I agree that this is useful, but if duplicated is implemented this way by default it breaks what standard R does. Perhaps add a new argument to duplicated.data.table with a default value set so that duplicated continues to act as it does now, but one could explicitly set this argument so that duplicated works in the way you describe here.

@arunsrinivasan
Copy link
Member Author

i intended it to be a new function, not change default functionality of duplicated. I'll clarify this in the original post.

@matthieugomez
Copy link
Contributor

I'd like that (I deleted my FR since it was a one liner). Maybe, which = TRUE (the default), the function would return a boolean vector of these rows, while with which = FALSE, the function would return a data.table with these rows only, with the variable "N" as the first column, and rows sorted by (N, by) (this is handy to examine duplicates)

@MichaelChirico
Copy link
Member

This form looks pretty idiomatic to me: DT[, .I[.N>1L], by=x][, V1]. It will be nicer once #788 is fulfilled, too: DT[, .I, by=x, having=.N>1][, V1], which can probably be optimized pretty well too.

Not sure it's worth adding a new function, and it would be weird to make duplicated() do this since it's not how it works for the data.frame method.

Please open a new issue if there's strong disagreement with the above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants