Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uniqueN escape forder to base R for small atomic vectors, #1120 #3743

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jangorecki
Copy link
Member

this address some of the slowest uses cases of uniqueN by make groups
related #3739, #1120, #3725, #3438

@codecov
Copy link

codecov bot commented Aug 3, 2019

Codecov Report

Merging #3743 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #3743      +/-   ##
=========================================
- Coverage   99.41%   99.4%   -0.01%     
=========================================
  Files          71      71              
  Lines       13208   13214       +6     
=========================================
+ Hits        13131   13136       +5     
- Misses         77      78       +1
Impacted Files Coverage Δ
R/duplicated.R 98.48% <100%> (-1.52%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d959107...410843a. Read the comment docs.

if (length(x) <= escape_n) {
#if (verbose) cat("uniqueN detected atomic type of length ", length(x), " which is small enough to use base R ", if (na.rm) "na.omit+" else "", "duplicated\n",sep="")
if (na.rm) x = na.omit(x)
return(length(x)-sum(duplicated(x)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably slightly quicker to not na.omit but just do:

return(length(x) - sum(duplicated(x)) - {na.rm && anyNA(x)})

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not getting how that would work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If na.rm = FALSE then my version is just length(x) - sum(duplicated(x)) - 0.

If na.rm = TRUE then it will be length(x) - sum(duplicated(x)) - 1 * anyNA(x).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(key insight being that uniqueN can differ by at most 1 for na.rm vs !na.rm)

Copy link
Member Author

@jangorecki jangorecki Aug 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice one!
it seems that it might eventually differ by 2

> x=c(1,2,NA,2,3,NaN,NaN,NA)
> length(unique(x))
[1] 5
> uniqueN(x)
[1] 5
> uniqueN(x, na.rm=T)
[1] 3
> length(unique(na.omit(x)))
[1] 3
> uniqueN(na.omit(x))
[1] 3

@mattdowle
Copy link
Member

mattdowle commented Aug 26, 2019

This is good and clearly a problem that should be fixed. But could this switch for small sizes be in forder.c? If so, the problem in forder.c should really be fixed there rather than putting a fix into duplicated.R only. That way we fix the root cause at the root, with uniqueN being one use case.
A few quick candidates by eye:

  1. https://github.com/Rdatatable/data.table/blob/master/src/forder.c#L476
    If there's overhead in using a parallel team, that's the first place and would always be done regardless of vector length.

  2. insert sort (non-parallel) kicks in under 256 items: https://github.com/Rdatatable/data.table/blob/master/src/forder.c#L804

  3. there's another threshold at UINT16_MAX (65,535) : https://github.com/Rdatatable/data.table/blob/master/src/forder.c#L917

If forder.c is more closely profiled (more TBEG, TEND needed) then we can see exactly where the root slowdown in forder.c is. It's also a good excuse to review forder.c.

@mattdowle mattdowle added this to the 1.12.4 milestone Aug 26, 2019
@jangorecki jangorecki added the WIP label Aug 27, 2019
@jangorecki jangorecki modified the milestones: 1.12.4, 1.13.0 Aug 27, 2019
@mattdowle mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019
@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@jangorecki jangorecki removed this from the 1.16.0 milestone Nov 6, 2023
@MichaelChirico MichaelChirico marked this pull request as draft February 19, 2024 04:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants