-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
uniqueN escape forder to base R for small atomic vectors, #1120 #3743
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3743 +/- ##
=========================================
- Coverage 99.41% 99.4% -0.01%
=========================================
Files 71 71
Lines 13208 13214 +6
=========================================
+ Hits 13131 13136 +5
- Misses 77 78 +1
Continue to review full report at Codecov.
|
if (length(x) <= escape_n) { | ||
#if (verbose) cat("uniqueN detected atomic type of length ", length(x), " which is small enough to use base R ", if (na.rm) "na.omit+" else "", "duplicated\n",sep="") | ||
if (na.rm) x = na.omit(x) | ||
return(length(x)-sum(duplicated(x))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably slightly quicker to not na.omit
but just do:
return(length(x) - sum(duplicated(x)) - {na.rm && anyNA(x)})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not getting how that would work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If na.rm = FALSE
then my version is just length(x) - sum(duplicated(x)) - 0
.
If na.rm = TRUE
then it will be length(x) - sum(duplicated(x)) - 1 * anyNA(x)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(key insight being that uniqueN
can differ by at most 1 for na.rm
vs !na.rm
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice one!
it seems that it might eventually differ by 2
> x=c(1,2,NA,2,3,NaN,NaN,NA)
> length(unique(x))
[1] 5
> uniqueN(x)
[1] 5
> uniqueN(x, na.rm=T)
[1] 3
> length(unique(na.omit(x)))
[1] 3
> uniqueN(na.omit(x))
[1] 3
This is good and clearly a problem that should be fixed. But could this switch for small sizes be in forder.c? If so, the problem in forder.c should really be fixed there rather than putting a fix into duplicated.R only. That way we fix the root cause at the root, with uniqueN being one use case.
If forder.c is more closely profiled (more TBEG, TEND needed) then we can see exactly where the root slowdown in forder.c is. It's also a good excuse to review forder.c. |
this address some of the slowest uses cases of uniqueN by make groups
related #3739, #1120, #3725, #3438