uniqueN escape forder to base R for small atomic vectors, #1120 #3743

jangorecki · 2019-08-03T15:28:27Z

this address some of the slowest uses cases of uniqueN by make groups
related #3739, #1120, #3725, #3438

codecov · 2019-08-03T15:39:48Z

Codecov Report

Merging #3743 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #3743      +/-   ##
=========================================
- Coverage   99.41%   99.4%   -0.01%     
=========================================
  Files          71      71              
  Lines       13208   13214       +6     
=========================================
+ Hits        13131   13136       +5     
- Misses         77      78       +1

Impacted Files	Coverage Δ
R/duplicated.R	`98.48% <100%> (-1.52%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d959107...410843a. Read the comment docs.

HughParsonage · 2019-08-03T16:40:11Z

R/duplicated.R

+    if (length(x) <= escape_n) {
+      #if (verbose) cat("uniqueN detected atomic type of length ", length(x), " which is small enough to use base R ", if (na.rm) "na.omit+" else "", "duplicated\n",sep="")
+      if (na.rm) x = na.omit(x)
+      return(length(x)-sum(duplicated(x)))


Probably slightly quicker to not na.omit but just do:

return(length(x) - sum(duplicated(x)) - {na.rm && anyNA(x)})

I am not getting how that would work

If na.rm = FALSE then my version is just length(x) - sum(duplicated(x)) - 0.

If na.rm = TRUE then it will be length(x) - sum(duplicated(x)) - 1 * anyNA(x).

(key insight being that uniqueN can differ by at most 1 for na.rm vs !na.rm)

nice one!
it seems that it might eventually differ by 2

> x=c(1,2,NA,2,3,NaN,NaN,NA) > length(unique(x)) [1] 5 > uniqueN(x) [1] 5 > uniqueN(x, na.rm=T) [1] 3 > length(unique(na.omit(x))) [1] 3 > uniqueN(na.omit(x)) [1] 3

mattdowle · 2019-08-26T20:41:28Z

This is good and clearly a problem that should be fixed. But could this switch for small sizes be in forder.c? If so, the problem in forder.c should really be fixed there rather than putting a fix into duplicated.R only. That way we fix the root cause at the root, with uniqueN being one use case.
A few quick candidates by eye:

https://github.com/Rdatatable/data.table/blob/master/src/forder.c#L476
If there's overhead in using a parallel team, that's the first place and would always be done regardless of vector length.
insert sort (non-parallel) kicks in under 256 items: https://github.com/Rdatatable/data.table/blob/master/src/forder.c#L804
there's another threshold at UINT16_MAX (65,535) : https://github.com/Rdatatable/data.table/blob/master/src/forder.c#L917

If forder.c is more closely profiled (more TBEG, TEND needed) then we can see exactly where the root slowdown in forder.c is. It's also a good excuse to review forder.c.

uniqueN escape forder to base R for small atomic vectors, #1120

f12f600

HughParsonage approved these changes Aug 3, 2019

View reviewed changes

Merge branch 'master' into uniquen-escape

410843a

mattdowle added this to the 1.12.4 milestone Aug 26, 2019

jangorecki added the WIP label Aug 27, 2019

jangorecki modified the milestones: 1.12.4, 1.13.0 Aug 27, 2019

This was referenced Sep 16, 2019

detect and redirect ifelse usage in j to fifelse #3800

Closed

Stress-test threading/batchSize is always on #3205

Closed

mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019

jangorecki mentioned this pull request Apr 7, 2020

Rerun repeated uniqueN test #3438

Closed

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022

jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023

jangorecki removed this from the 1.16.0 milestone Nov 6, 2023

MichaelChirico removed the WIP label Feb 19, 2024

MichaelChirico marked this pull request as draft February 19, 2024 04:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uniqueN escape forder to base R for small atomic vectors, #1120 #3743

uniqueN escape forder to base R for small atomic vectors, #1120 #3743

jangorecki commented Aug 3, 2019

codecov bot commented Aug 3, 2019 •

edited

Loading

HughParsonage Aug 3, 2019

jangorecki Aug 3, 2019

HughParsonage Aug 4, 2019

MichaelChirico Aug 4, 2019

jangorecki Aug 4, 2019 •

edited

Loading

mattdowle commented Aug 26, 2019 •

edited

Loading

uniqueN escape forder to base R for small atomic vectors, #1120 #3743

Are you sure you want to change the base?

uniqueN escape forder to base R for small atomic vectors, #1120 #3743

Conversation

jangorecki commented Aug 3, 2019

codecov bot commented Aug 3, 2019 • edited Loading

Codecov Report

HughParsonage Aug 3, 2019

Choose a reason for hiding this comment

jangorecki Aug 3, 2019

Choose a reason for hiding this comment

HughParsonage Aug 4, 2019

Choose a reason for hiding this comment

MichaelChirico Aug 4, 2019

Choose a reason for hiding this comment

jangorecki Aug 4, 2019 • edited Loading

Choose a reason for hiding this comment

mattdowle commented Aug 26, 2019 • edited Loading

codecov bot commented Aug 3, 2019 •

edited

Loading

jangorecki Aug 4, 2019 •

edited

Loading

mattdowle commented Aug 26, 2019 •

edited

Loading