uniqueN could be GForce optimised + GForce could be optimised for := too. #3725

phileas-condemine · 2019-07-24T11:27:36Z

This issue follows a discussion during useR!2019 after the presentation of data.table by Arun @arunsrinivasan

Hello,
Thanks for the amazing job, I love data.table !
I am using uniqueN to verify l-diversity for anonymization purposes.
The data I am working with is around 30M rows, easily ingested by data.table.
Unfortunately uniqueN is not as fast as other functions.
I tried to parallelize the grouping using setDTthreads as I can go up to 16 on my rstudio server instance.
First I get a benchmark using a simple sum over numeric.

Then I do basically the same thing but apply uniqueN over character (factor would give the same results).

Here is the code for a repex https://github.com/phileas-condemine/repex_slow_uniqueN/blob/master/repex_slow_uniqueN.R

Additional info :

> library(data.table)
data.table 1.12.2 using 8 threads (see ?getDTthreads).

Here is my session_info()

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)

Matrix products: default
BLAS: /opt/microsoft/ropen/3.5.1/lib64/R/lib/libRblas.so
LAPACK: /opt/microsoft/ropen/3.5.1/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8       
 [4] LC_COLLATE=fr_FR.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_3.0.0        data.table_1.12.2    RevoUtils_11.0.1     RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1       rstudioapi_0.7   bindr_0.1.1      magrittr_1.5     tidyselect_0.2.4 munsell_0.5.0   
 [7] colorspace_1.3-2 R6_2.2.2         rlang_0.4.0      plyr_1.8.4       dplyr_0.7.6      tools_3.5.1     
[13] grid_3.5.1       gtable_0.2.0     withr_2.1.2      lazyeval_0.2.1   assertthat_0.2.0 tibble_1.4.2    
[19] crayon_1.3.4     bindrcpp_0.2.2   purrr_0.2.5      glue_1.3.0       compiler_3.5.1   pillar_1.3.0    
[25] scales_0.5.0     pkgconfig_2.0.1

also lscpu call

> system("lscpu")
Architecture :        x86_64
Mode(s) opératoire(s) des processeurs : 32-bit, 64-bit
Boutisme :            Little Endian
Processeur(s) :       16
Liste de processeur(s) en ligne : 0-15
Thread(s) par cœur : 1
Cœur(s) par socket : 8
Socket(s) :           2
Nœud(s) NUMA :       2
Identifiant constructeur : GenuineIntel
Famille de processeur : 6
Modèle :             62
Nom de modèle :      Intel(R) Xeon(R) CPU E5-2667 v2 @ 3.30GHz
Révision :           4
Vitesse du processeur en MHz : 3291.927
BogoMIPS :            6584.13
Constructeur d'hyperviseur : VMware
Type de virtualisation : complet
Cache L1d :           32K
Cache L1i :           32K
Cache L2 :            256K
Cache L3 :            25600K
Nœud NUMA 0 de processeur(s) : 0-7
Nœud NUMA 1 de processeur(s) : 8-15
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl tsc_reliable nonstop_tsc pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 popcnt aes xsave avx hypervisor lahf_lm kaiser arat

The text was updated successfully, but these errors were encountered:

arunsrinivasan · 2019-07-24T19:09:13Z

There are a couple of things here.

First point: uniqueN is parallelised. But it's usage in j is not optimised to use GForce.

When you use .Ninj`

require(data.table)
foo <- function(n=3e8) {
  card <- 3000
  chars <- substr(openssl::sha2(as.character(1:card)), 1L, 5L)
  dist <- runif(card)
  DT <- data.table(
    A=sample(chars, n, TRUE, dist), 
    B=sample(chars, n, TRUE, dist)
  )
  DT
}
set.seed(1L)
DT <- foo(5e7L)

DT[, .N, by=B, verbose=TRUE]
# Detected that j uses these columns: <none>
# Finding groups using forderv ... 0.687s elapsed (1.131s cpu)
# Finding group sizes from the positions (can be avoided to save RAM) ... 0.001s elapsed (0.001s cpu)
# Getting back original order ... 0.001s elapsed (0.001s cpu)
# lapply optimization is on, j unchanged as '.N'
# GForce optimized j to '.N' ### <~~~~~~~~
# Making each group and running j (GForce TRUE) ... 1.301s elapsed (1.651s cpu)
#           B     N
#    1: 71ee4 18647
#    2: b1718 31722
#    3: 2c1f3 33496
#    4: 13b3f 31041
#    5: 12132 19033
#   ---
# 2994: 46635    20
# 2995: 5787a    23
# 2996: 7611f    57
# 2997: c30c6    39
# 2998: a8a2c    23

You can see that the expression is optimised to use GForce. This basically avoids having to switch evaluating the expression in j between R and C which is quite inefficient.

Similarly, we need to optimise uniqueN to be used along with by efficiently.

Second point: Even then, GForce isn't implemented for use with := yet. It's only used for aggregation operations that return a length=1 vector. We'll need to extend GForce to also optimise :=.

When both these are done, things should speedup.

Until then, the best way to go about this (not benchmarked) would be:

unique(DT, by=c("A", "B"))[, .N, by=B]

I think this does what you want to do, but of course this returns an aggregated result which'd mean you'll have to join+update back to your original data.table. So roundabout. But if it's performant than your current solution, perhaps you could write a wrapper function temporarily until this is implemented.

franknarf1 · 2019-07-24T21:00:01Z

@arunsrinivasan Second part of the title looks like a dupe of #1414

jangorecki · 2019-07-25T11:55:47Z

First part is dupe of #1120

arunsrinivasan · 2019-08-31T11:37:11Z

I agree that both parts are dups. Closing this as it's clearly a dup. But would be nice to up the priority on this one since there seems to be some uniqueN performance related issues off late.

arunsrinivasan assigned arunsrinivasan and unassigned arunsrinivasan Jul 24, 2019

arunsrinivasan added enhancement High labels Jul 24, 2019

arunsrinivasan changed the title ~~uniqueN is increasingly slow with higher number of threads~~ uniqueN could be GForce optimised + GForce could be optimised for := too. Jul 24, 2019

jangorecki mentioned this issue Aug 3, 2019

uniqueN escape forder to base R for small atomic vectors, #1120 #3743

Draft

arunsrinivasan closed this as completed Aug 31, 2019

arunsrinivasan added duplicate and removed High labels Aug 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uniqueN could be GForce optimised + GForce could be optimised for := too. #3725

uniqueN could be GForce optimised + GForce could be optimised for := too. #3725

phileas-condemine commented Jul 24, 2019 •

edited

Loading

arunsrinivasan commented Jul 24, 2019

franknarf1 commented Jul 24, 2019

jangorecki commented Jul 25, 2019

arunsrinivasan commented Aug 31, 2019

uniqueN could be GForce optimised + GForce could be optimised for := too. #3725

uniqueN could be GForce optimised + GForce could be optimised for := too. #3725

Comments

phileas-condemine commented Jul 24, 2019 • edited Loading

arunsrinivasan commented Jul 24, 2019

franknarf1 commented Jul 24, 2019

jangorecki commented Jul 25, 2019

arunsrinivasan commented Aug 31, 2019

phileas-condemine commented Jul 24, 2019 •

edited

Loading