-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uniqueN() is very slow compared to length(unique()) #3739
Comments
Well, I see. For very large vectors, Maybe for small atomic vector, we just use length(unique()) instead? (Uses 4 thread on Mac OSX) library(data.table)
set.seed(1000)
# small vector, uniqueN() is slower
x <- sample(1:1e5, size = 1e2, replace = TRUE)
microbenchmark::microbenchmark(
uniqueN(x),
length(unique(x))
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> uniqueN(x) 20.744 21.3460 32.05373 21.7050 22.226 1010.481 100
#> length(unique(x)) 4.757 5.3565 6.18967 5.9215 6.698 18.199 100
# big vector, uniqueN() is faster
x <- sample(1:1e5, size = 1e6, replace = TRUE)
microbenchmark::microbenchmark(
uniqueN(x),
length(unique(x))
)
#> Unit: milliseconds
#> expr min lq mean median uq max
#> uniqueN(x) 9.176062 10.31124 11.55733 10.80676 12.44610 16.72396
#> length(unique(x)) 20.834950 22.74216 26.23369 25.77416 27.84571 43.23249
#> neval
#> 100
#> 100 Created on 2019-08-02 by the reprex package (v0.2.1) |
it make sense to include number of cores in those timings |
Thanks for the remind. It uses 4 threads and I've included the thread number in my original post above. |
Another example, perhaps also needing "n-cores" expansion: https://stackoverflow.com/questions/60623235/r-why-dplyr-counts-unique-values-n-distinct-by-groups-faster-than-data-table |
The same happens to me. DT[,if(.N==uniqueN(test)) .SD ,by=ID] But the former is 10 times slower. |
@skanskan can you confirm that timing's on current master? Since there was just a related commit yesterday. My understanding is the timing should be better (though probably not yet equivalent) |
Right now I'm using the stable version because I'm doing calculations for an article I didn't want to mess with versions. But I'm going to try the master version. |
OK, I have just compared data.table 1.12.8 vs 1.12.9 DT[,if(.N==uniqueN(test)) .SD ,by=ID] improved from 44.3s to 17.5s I have also tried |
1.12.8
master
4 cores / 50% |
@jangorecki I think I was using the dev version of data.table at time of writing. So it should be version 1.12.3 2986736 BTW, I get basically the same result as you do using version 1.13.0 - still slower (2-3 times) but not that much (10 - 30 times) as it was in 1.12.3 . |
I'm using the lastest dev version of
data.table
. StilluniqueN()
is an order of magnitude slower thanlength(unique())
. So slow I think it should be tagged as a bug ... See the reprex example below.Note, the below code uses 4 threads (on a Win7 computer). If we set the thread number to 1, the time cost will be reduced to a half but still significantly slower than
length(unique())
.(In fact, the reason I notice this is because I have a daily routine script costs maybe 20 minutes... and trying to improve the speed leads me to the cause -
uniqueN()
)Character
Created on 2019-08-02 by the reprex package (v0.2.1)
Double
Created on 2019-08-02 by the reprex package (v0.2.1)
The text was updated successfully, but these errors were encountered: