-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbindlist support fill=TRUE with use.names=FALSE and use it in merge.R ToDo of #678 #5263
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5263 +/- ##
==========================================
- Coverage 99.50% 99.50% -0.01%
==========================================
Files 77 77
Lines 14605 14599 -6
==========================================
- Hits 14533 14527 -6
Misses 72 72
Continue to review full report at Codecov.
|
R/merge.R
Outdated
yy = cbind(yy, x[tmp, othercolsx, with = FALSE]) | ||
nx = c(names(yy), names(x[tmp, othercolsx, with = FALSE])) | ||
nx = make.unique(nx) | ||
set(yy, NULL, tail(nx, -ncol(yy)), x[tmp, othercolsx, with = FALSE]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a noticeable memory advantage to looping over the RHS columns instead of creating the whole with=FALSE subset table?
it looks like the advantage is we can do the tmp subset once & apply it to all columns...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guessed that saving this tmp object and doing indexing twice is preferred over using additional memory. Rethinking it again, I might have figured out what the ToDo meant initially where using set
omits this object/indexing anyway.
@MichaelChirico whats the preferred process here: 1 PR per source file? Or 1 PR per todo? Or 1 megathread PR? And should I update the issue list of #678 or create new issues? |
I think generally one PR per TODO makes the most sense, but if you see a cluster of TODOs in closely related code, do combine them. basically TODOs that are logically independent should get their own PRs |
R/merge.R
Outdated
yy = cbind(yy, x[tmp, othercolsx, with = FALSE]) | ||
nx = c(names(yy), paste0("V",seq_len(length(othercolsx)))) | ||
nx = make.unique(nx) | ||
set(yy, NULL, tail(nx, -ncol(yy)), rep(list(NA), length(othercolsx))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be NA_integer_?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, NA
logical works fine here. rbindlist
coerces types so logical
columns of yy
will be coerced to the types of dt
. The reason behind the previous tmp
used NA_integer_
is that it was used for indexing the data.table and therefore was integer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw. I recall cbindlist in a PR has option for controlling copy behavior.
Uh, sorry for late feedback but I realized nrow(y)=1 can be useful to test against as well. I can imagine that there could be different handling of a copy when you subset all rows and the index is scalar at the same time (based on DT[TRUE] doing shallow copy). Kind of edge case but worth to have it as well. |
Good point. I added test cases to check for accidental shallow copying. AFAIU current behavior of shallow copying of #3215 (and friends) leads only to problems when pre-copy cols are altered in the copy since these cols point to the same address? Since we only add cols here and don't alter other cols in |
@@ -1875,6 +1877,16 @@ test(630.1, merge(DT1,DT2,all.x=TRUE), setkey(adt(merge(adf(DT1),adf(DT2),by="a" | |||
|
|||
test(631, merge(DT1,DT2,all.y=TRUE), data.table(a=c(2,3,5),total.x=c(NA,1,1),total.y=c(5,1,2),key="a")) | |||
test(631.1, merge(DT1,DT2,all.y=TRUE), setkey(adt(merge(adf(DT1),adf(DT2),by="a",all.y=TRUE)),a)) | |||
# ensure merge(x,y,all.y) does not alter input y | |||
# merge containing idx 1:nrow(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is a little unclear on its own -- ideally we can read the comment without any more context and know the point of the test. I think 'idx' refers to the implementation? best to be more explicit about what it means
I tried to get to the root cause of this block of code. It took me a while to realize that all it was doing was adding on the right number of NA columns to |
Closes #5262
Closes #5037
merge.R
set
in merge.R instead ofcbind