-
Notifications
You must be signed in to change notification settings - Fork 11
adjust outer join behavior (types and right outer join bug) #44
Conversation
src/abstractdatatable/join.jl
Outdated
NullableCategoricalArray(T, dims) | ||
|
||
similar_nullable(dt::AbstractDataTable, dims::Int) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar_nullable
is only used by compose_joined_table
and it does not use this call form, so this is not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So say it in the commit message.
src/abstractdatatable/join.jl
Outdated
NullableCategoricalArray(T, dims) | ||
|
||
similar_nullable(dt::AbstractDataTable, dims::Int) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So say it in the commit message.
src/abstractdatatable/join.jl
Outdated
NullableCategoricalArray(T, dims) | ||
|
||
similar_nullable(dt::AbstractDataTable, dims::Int) = | ||
DataTable(Any[similar_nullable(x, dims) for x in columns(dt)], copy(index(dt))) | ||
similar_nullable{T,R}(dv::NullableCategoricalArray{T,R}, dims::Union{Int, Tuple{Vararg{Int}}}) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R
isn't needed.
src/abstractdatatable/join.jl
Outdated
similar_nullable(dt::AbstractDataTable, dims::Int) = | ||
DataTable(Any[similar_nullable(x, dims) for x in columns(dt)], copy(index(dt))) | ||
similar_nullable{T,R}(dv::NullableCategoricalArray{T,R}, dims::Union{Int, Tuple{Vararg{Int}}}) = | ||
NullableCategoricalArray(T, dims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the NullableCategoricalArray{T}(dims)
construct, this one is deprecated (or should be, if you didn't get a warning...).
src/abstractdatatable/join.jl
Outdated
cols = Vector{Any}(ncleft + ncol(dtr_noon)) | ||
for (i, col) in enumerate(columns(joiner.dtl)) | ||
cols[i] = kind == :inner ? col[all_orig_left_ixs] : | ||
copy!(similar_nullable(col, nrow), col[all_orig_left_ixs]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
col[all_orig_left_ixs]
creates a temporary copy. Here it is worth doing:
newcol = similar_nullable(col, nrow)
@inbounds for (j, k) in enumerate(all_orig_left_ixs)
newcol[j] = col[k]
end
Same below, handling right_perm
inside the loop to avoid a second allocation (even when kind == :inner
).
src/abstractdatatable/join.jl
Outdated
@assert nrow == length(all_orig_right_ixs) + loil | ||
ncleft = ncol(joiner.dtl) | ||
cols = Vector{Any}(ncleft + ncol(dtr_noon)) | ||
for (i, col) in enumerate(columns(joiner.dtl)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining why we create nullable arrays when kind != :inner
?
src/abstractdatatable/join.jl
Outdated
@@ -207,7 +217,8 @@ join(dt1::AbstractDataTable, | |||
- `:cross` : a full Cartesian product of the key combinations; every | |||
row of `dt1` is matched with every row of `dt2` | |||
|
|||
Null values are filled in where needed to complete joins. | |||
For the three join operations that may introduce missing values, `:outer`, `:left`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parentheses rather than comma before the list of join types.
test/join.jl
Outdated
@@ -111,10 +111,63 @@ module TestJoin | |||
Mass = [1.5, 2.2, 1.1, 1.5]) | |||
|
|||
# Test that join works when mixing Array and NullableArray (#1151) | |||
dt = DataTable([collect(1:10), collect(2:11)], [:x, :y]) | |||
dt = DataTable([NullableArray(1:10), NullableArray(2:11)], [:x, :y]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes against the comment above. Why change it?
Thanks. The commit message should explain what the fixed bug was. Also, I don't think this really "type-stabilizes" |
test/join.jl
Outdated
@@ -112,9 +112,62 @@ module TestJoin | |||
|
|||
# Test that join works when mixing Array and NullableArray (#1151) | |||
dt = DataTable([collect(1:10), collect(2:11)], [:x, :y]) | |||
dtnull = DataTable(x = 1:10, z = 3:12) | |||
dtnull = DataTable(x = NullableArray(1:10), z = NullableArray(3:12)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be useful to make this explicit now so the test will still work when columns are not NullableArrays by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this change isn't related to this PR so it should be made elsewhere (and anyway the comment should be updated if the code changes).
src/abstractdatatable/join.jl
Outdated
|
||
if length(rightonly_ixs.join) > 0 | ||
# some left rows are nulls, so the values of the "on" columns | ||
# need to be taken from the right | ||
for (on_col_ix, on_col) in enumerate(joiner.on_cols) | ||
# fix the result of the rightjoin by taking the nonnull values from the right table | ||
res[on_col][rightonly_ixs.join] = joiner.dtr_on[rightonly_ixs.orig, on_col_ix] | ||
# end-length(rightonly_ixs.orig)+1:end was rightonly_ixs.join. Try and FIXME | ||
res[on_col][end-length(rightonly_ixs.orig)+1:end] = joiner.dtr_on[rightonly_ixs.orig, on_col_ix] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this change in this PR? Better make it separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bug was found as part of this PR, I don't see any substantial benefit to making a separate PR for it, and it's tested by the full set of joins included in this PR. I've added more detail about the bug to the commit message. Let me know if there's any additional information you'd like added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's annoying is that you're leaving a FIXME, which seems to mean it's going to hold back the PR until we're sure it's the right fix. But if you say this is tested, then the fix is probably correct and no FIXME should be left?
Also, end-length(rightonly_ixs.orig)+1:end
would be clearer with parentheses around end-length(rightonly_ixs.orig)+1
IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FIXME refers to rightonly_ixs.join
and that I would like to change this line back to using res[on_col][rightonly_ixs.join] = ...
when we get around to figuring out why rightonly_ixs.join
doesn't contain the expected values. I could remove the FIXME and open an issue? I could also remove the FIXME and just keep a personal note to get around to fixing this? Not sure what the preferred way of tracking this would be.
FIXME removed and determining the offset has been cleaned up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments, but looks mostly good.
src/abstractdatatable/join.jl
Outdated
@assert nrow == length(all_orig_right_ixs) + loil | ||
ncleft = ncol(joiner.dtl) | ||
cols = Vector{Any}(ncleft + ncol(dtr_noon)) | ||
_similar_type = kind == :inner ? similar : similar_nullable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just call this _similar
, "type" is more confusing than anything here.
src/abstractdatatable/join.jl
Outdated
@@ -210,7 +223,8 @@ join(dt1::AbstractDataTable, | |||
- `:cross` : a full Cartesian product of the key combinations; every | |||
row of `dt1` is matched with every row of `dt2` | |||
|
|||
Null values are filled in where needed to complete joins. | |||
For the three join operations that may introduce missing values (`:outer`, `:left`, | |||
and `:right`), all columns of the returned datatable will be nullable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"data table"
src/abstractdatatable/join.jl
Outdated
|
||
if length(rightonly_ixs.join) > 0 | ||
# some left rows are nulls, so the values of the "on" columns | ||
# need to be taken from the right | ||
for (on_col_ix, on_col) in enumerate(joiner.on_cols) | ||
# fix the result of the rightjoin by taking the nonnull values from the right table | ||
res[on_col][rightonly_ixs.join] = joiner.dtr_on[rightonly_ixs.orig, on_col_ix] | ||
offset = nrow - length(rightonly_ixs.orig) + 1 | ||
res[on_col][offset:end] = joiner.dtr_on[rightonly_ixs.orig, on_col_ix] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a loop as above, you will be able to avoid making a copy for each index in rightonly_ixs.orig
.
test/join.jl
Outdated
large = DataTable(Any[[0, 1, 2, 3, 4], [0.0, 1.0, 2.0, 3.0, 4.0]], [:id, :fid]) | ||
N = Nullable() | ||
|
||
@test join(small, large, kind=:cross) == DataTable(Any[repeat([1, 3, 5], inner=5), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Break the line after ==
to avoid going beyond 92 chars.
Also, it would be good to store the result of the join and check the column types to make sure we get NullableArray
and not Array{Nullable}
. Should there also be tests for categorical arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
Didn't see anything in the coverage drop that looks related. |
src/abstractdatatable/join.jl
Outdated
_similar = kind == :inner ? similar : similar_nullable | ||
for (i, col) in enumerate(columns(joiner.dtl)) | ||
cols[i] = _similar(col, nrow) | ||
@inbounds for (j, k) in enumerate(all_orig_left_ixs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just realized this is likely to be quite slow due to the type-instability. You should be able to move this loop to a separate kernel function which will take col
, cols[i]
and all_orig_left_ixs
, and which will be specialized on the column type. It could also take an optional offset, so that the same function could be used for the two similar loops below. I imagine this should make a noticeable speed difference on a simple benchmark.
test/join.jl
Outdated
DataTable([NullableArray(1:10), NullableArray(3:12), collect(2:11)], [:x, :z, :y]) | ||
|
||
@testset "all joins" begin | ||
small = DataTable(Any[[1, 3, 5], [1.0, 3.0, 5.0]], [:id, :fid]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"small" and "large" really don't apply now. How about "dt1" and "dt2"?
test/join.jl
Outdated
s([:id, :fid]) == DataTable(Any[[1, 3], [1, 3]], [:id, :fid]) | ||
@test typeof.(s(:id).columns) == | ||
typeof.(s(:fid).columns) == | ||
typeof.(s([:id, :fid]).columns) == [CategoricalVector{Int, UInt32}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in the other PR, leave UInt32
out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, or replace it with CategoricalArrays.DefaultRefType
since you need it here.
outer joins need to return nullable tables as they may introduce missing data. similar_nullable on DataTables has been removed (unused) and replaced with a similar_nullable that works on NullableCategoricalArrays, and this change is made to support the new changes to join. The 3 outer joins share a function with inner joins, and this shared function (compose_joined_table) now performs a check to see if the join type is :inner, and if so, it will return the same column type as the parent table rather than promoting to a nullable column. A bug was found in right-outer join behavior where the values unique to the right table were added to the table in the incorrect locations, overwriting data and leaving nulls where they shouldn't be. This bug, due to incorrect values in rightonly_ixs.join, was fixed by filling the last n-rows of the datatable where n = length(rightonly_ixs.join). Tests were checked for accuracy against pandas.
Thanks! Out of curiosity, have you checked whether the new code is measurably faster than the old one? |
Thanks here too! I don't remember benchmarking any of these changes, but let's save runtime testing for after #62. |
Separated from #30 as part of a PR refactor. Fixes a bug found in #34 (comment). Also fixes the issue that in the current implementation,
resize!
has the ability to introduce#undef
values whereNullable
s should be placed.