-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
binary search extensions to <
, <=
, >
, >=
#1452
Comments
This would be awesome if implemented ! |
* non-equi-joins: non-equi joins update to NEWS. #1452. Patching another issue spotted by Jan. Thanks! Update ?data.table with current non-equi join functionality. Limit number of combinations for tests to max of 100. Closes #1257, on=.() syntax is now possible. Added test for join on char type with op other than '=='. Allow only '==' operator for joins on char type. Free allocated variable. Fix for the issue @jan spotted. Added tests. Thanks Jan. Finally, non-equi joins NAs/NaNs correctly in all cases, hopefully. Added a note to self comment to nestedid. Minor: fix code spacing. Adding tests for non-equi joins only for non-NA/NaN cases. Fixing logic for NAs in i. Various improvements and fixes to nestedid. better logic fixes edge cases, also removes for-loop = ~3x faster Just fixing indentation and minor code cleanup. No implementations. thinko! should be seq_len, not seq_along. First stab at non-equi joins
Any plans to support syntax like |
Very likely not in this release (depends on how fast I wrap up the rest), but definitely useful. Perhaps as a new issue would be great. |
Done, see #1639 |
@arunsrinivasan can you give a quick example of |
Here's an example where it'll fail: require(data.table)
dt = data.table(id="x", a=as.integer(c(3,8,8,15,15,15,16,22,22,25,25)), b=as.integer(c(9,10,25,19,22,25,38,3,9,7,28)), c=as.integer(c(22,33,44,14,49,44,40,25,400,52,77)))
dt[.(a=c(12,20), b=20), sum(c), on=c("a>a", "b<=b"), by=.EACHI]
# Error in `[.data.table`(dt, .(a = c(12, 20), b = 20), sum(c), on = c("a>a", :
# by-joins are not yet implemented for multi-group non-equi-joins.
The idea for non-equi joins is to split data.table |
Ok, thanks, so failure is actually catastrophic and not silent. That's good - I thought it just works incorrectly for some cases and was looking for an incorrect result. Iirc the cases I've worked with all had a single column on LHS, which always satisfies the single group condition. |
Updated all SO posts linked. |
great work!! |
It seems that for this to work, one has to specify Here is a small example
what I want to get can be accomplished by
But I thought I could simply do this:
However, I got an error:
Why do I have to specify |
@ywhuofu if you want inequality join like in SQL then why you use |
@jangorecki thanks for you response. So now I am a little confused. Based on my example, what should I do instead? |
@ywhuofu Use set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
dt1[, year_hlp:=year]
setkey(dt1, year, year_hlp)
setkey(dt2, start, end)
rf = foverlaps(dt1, dt2, by.x = c('year','year_hlp'), by.y = c('start','end'))[order(start)]
rf[, year_hlp:=NULL]
set.seed(1)
dt1 <- data.table(year=1991:2000, v=rnorm(10))
dt2 <- data.table(start=dt1$year-5L, end=dt1$year)
r = dt1[dt2, .(start, end, year=x.year, v), on=.(year>=start, year<=end), allow.cartesian=TRUE][order(start)]
all.equal(rf, r)
#[1] TRUE Note the |
@jangorecki thanks a lot for the detailed observations. It seems the key is One more interesting observation, the inequality join is faster than |
Benchmark: TODO: add Data:# sample data
require(data.table)
set.seed(1L)
ids = paste0("id", 1:30e3)
N = 40e6L
query = data.table(id=sample(ids, N, TRUE), range1=sample(1e2L, N, TRUE))
query[, range2 := range1 + as.integer(runif(N)*300L)]
query
subject = data.table(id=sample(ids), range1=sample(2e2L, 30e3L, TRUE))
subject[, range2 := range1 + as.integer(runif(30e3L)*10e3L)]
subject Non-equi joins:system.time(
nq_ans <- query[subject, .N, on=.(id, range1>=range1, range2<=range2), nomatch=0L, by=.EACHI]
)
# 19.8s findOverlapsrequire(GenomicRanges)
q.gr = GRanges(query$id, IRanges(query$range1, query$range2)) # 12.7s!!!
s.gr = GRanges(subject$id, IRanges(subject$range1, subject$range2))
system.time(gr_ans <- findOverlaps(q.gr, s.gr, type="within"))
# 16.4s
# note that we have not obtained the counts yet, just the overlaps
# the fact that q.gr takes ~13s is quite suspicious (i.e., makes me think that it does
# some preprocessing and therefore should be included in the total run time) RSQLite# Thanks @jangorecki
library(RSQLite)
conn = dbConnect(SQLite())
dbWriteTable(conn, "query", query)
dbWriteTable(conn, "subject", subject)
sql = 'SELECT subject.id, subject.range1, subject.range2, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 AND query.range2 <= subject.range2 GROUP BY subject.id, subject.range1, subject.range2;'
system.time(sql_ans <- dbGetQuery(conn, sql))
# 53.3s foverlapssystem.time({
setkey(subject, id, range1, range2)
folaps_ans <- foverlaps(query, subject, type="within", nomatch=0L, which=TRUE)
})
# 12.9s
# note that we have not obtained the counts yet, just the overlaps Another non-equi joins comparison with non-equi joinssystem.time(nq_ans <- query[subject, .N, on=.(id, range1>=range1), nomatch=0L, by=.EACHI])
# 4.3s RSQLitesql = 'SELECT subject.id, subject.range1, COUNT(*) AS n FROM query INNER JOIN subject ON query.id = subject.id AND query.range1 >= subject.range1 GROUP BY subject.id, subject.range1;'
system.time(sql_ans <- dbGetQuery(conn, sql))
# 50.7s |
<
, <=
, >
, >=
and !=
<
, <=
, >
, >=
Using
on=
ason=.(x == y, a <= b)
-- as simple as that..>=
,>
,<=
and<
.mult="first"
andmult="last"
?data.table
Then extend #1068
The text was updated successfully, but these errors were encountered: