Skip to content
This repository has been archived by the owner on May 5, 2019. It is now read-only.

adjust outer join behavior (types and right outer join bug) #44

Merged
merged 1 commit into from
May 12, 2017
Merged

adjust outer join behavior (types and right outer join bug) #44

merged 1 commit into from
May 12, 2017

Conversation

cjprybol
Copy link
Contributor

@cjprybol cjprybol commented Mar 31, 2017

Separated from #30 as part of a PR refactor. Fixes a bug found in #34 (comment). Also fixes the issue that in the current implementation, resize! has the ability to introduce #undef values where Nullables should be placed.

NullableCategoricalArray(T, dims)

similar_nullable(dt::AbstractDataTable, dims::Int) =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar_nullable is only used by compose_joined_table and it does not use this call form, so this is not needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So say it in the commit message.

NullableCategoricalArray(T, dims)

similar_nullable(dt::AbstractDataTable, dims::Int) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So say it in the commit message.

NullableCategoricalArray(T, dims)

similar_nullable(dt::AbstractDataTable, dims::Int) =
DataTable(Any[similar_nullable(x, dims) for x in columns(dt)], copy(index(dt)))
similar_nullable{T,R}(dv::NullableCategoricalArray{T,R}, dims::Union{Int, Tuple{Vararg{Int}}}) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R isn't needed.

similar_nullable(dt::AbstractDataTable, dims::Int) =
DataTable(Any[similar_nullable(x, dims) for x in columns(dt)], copy(index(dt)))
similar_nullable{T,R}(dv::NullableCategoricalArray{T,R}, dims::Union{Int, Tuple{Vararg{Int}}}) =
NullableCategoricalArray(T, dims)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the NullableCategoricalArray{T}(dims) construct, this one is deprecated (or should be, if you didn't get a warning...).

cols = Vector{Any}(ncleft + ncol(dtr_noon))
for (i, col) in enumerate(columns(joiner.dtl))
cols[i] = kind == :inner ? col[all_orig_left_ixs] :
copy!(similar_nullable(col, nrow), col[all_orig_left_ixs])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

col[all_orig_left_ixs] creates a temporary copy. Here it is worth doing:

newcol = similar_nullable(col, nrow)
@inbounds for (j, k) in enumerate(all_orig_left_ixs)
    newcol[j] = col[k]
end

Same below, handling right_perm inside the loop to avoid a second allocation (even when kind == :inner).

@assert nrow == length(all_orig_right_ixs) + loil
ncleft = ncol(joiner.dtl)
cols = Vector{Any}(ncleft + ncol(dtr_noon))
for (i, col) in enumerate(columns(joiner.dtl))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining why we create nullable arrays when kind != :inner?

@@ -207,7 +217,8 @@ join(dt1::AbstractDataTable,
- `:cross` : a full Cartesian product of the key combinations; every
row of `dt1` is matched with every row of `dt2`

Null values are filled in where needed to complete joins.
For the three join operations that may introduce missing values, `:outer`, `:left`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parentheses rather than comma before the list of join types.

test/join.jl Outdated
@@ -111,10 +111,63 @@ module TestJoin
Mass = [1.5, 2.2, 1.1, 1.5])

# Test that join works when mixing Array and NullableArray (#1151)
dt = DataTable([collect(1:10), collect(2:11)], [:x, :y])
dt = DataTable([NullableArray(1:10), NullableArray(2:11)], [:x, :y])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes against the comment above. Why change it?

@nalimilan
Copy link
Member

Thanks. The commit message should explain what the fixed bug was. Also, I don't think this really "type-stabilizes" join, it just avoid returning #undef entries by making them null entries of a nullable array.

@cjprybol cjprybol changed the title type stabilize join, fix right outer join bug adjust outer join behavior (types and right outer join bug) Apr 1, 2017
test/join.jl Outdated
@@ -112,9 +112,62 @@ module TestJoin

# Test that join works when mixing Array and NullableArray (#1151)
dt = DataTable([collect(1:10), collect(2:11)], [:x, :y])
dtnull = DataTable(x = 1:10, z = 3:12)
dtnull = DataTable(x = NullableArray(1:10), z = NullableArray(3:12))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be useful to make this explicit now so the test will still work when columns are not NullableArrays by default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this change isn't related to this PR so it should be made elsewhere (and anyway the comment should be updated if the code changes).


if length(rightonly_ixs.join) > 0
# some left rows are nulls, so the values of the "on" columns
# need to be taken from the right
for (on_col_ix, on_col) in enumerate(joiner.on_cols)
# fix the result of the rightjoin by taking the nonnull values from the right table
res[on_col][rightonly_ixs.join] = joiner.dtr_on[rightonly_ixs.orig, on_col_ix]
# end-length(rightonly_ixs.orig)+1:end was rightonly_ixs.join. Try and FIXME
res[on_col][end-length(rightonly_ixs.orig)+1:end] = joiner.dtr_on[rightonly_ixs.orig, on_col_ix]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this change in this PR? Better make it separately.

Copy link
Contributor Author

@cjprybol cjprybol Apr 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bug was found as part of this PR, I don't see any substantial benefit to making a separate PR for it, and it's tested by the full set of joins included in this PR. I've added more detail about the bug to the commit message. Let me know if there's any additional information you'd like added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's annoying is that you're leaving a FIXME, which seems to mean it's going to hold back the PR until we're sure it's the right fix. But if you say this is tested, then the fix is probably correct and no FIXME should be left?

Also, end-length(rightonly_ixs.orig)+1:end would be clearer with parentheses around end-length(rightonly_ixs.orig)+1 IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FIXME refers to rightonly_ixs.join and that I would like to change this line back to using res[on_col][rightonly_ixs.join] = ... when we get around to figuring out why rightonly_ixs.join doesn't contain the expected values. I could remove the FIXME and open an issue? I could also remove the FIXME and just keep a personal note to get around to fixing this? Not sure what the preferred way of tracking this would be.

@cjprybol
Copy link
Contributor Author

cjprybol commented Apr 4, 2017

FIXME removed and determining the offset has been cleaned up.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, but looks mostly good.

@assert nrow == length(all_orig_right_ixs) + loil
ncleft = ncol(joiner.dtl)
cols = Vector{Any}(ncleft + ncol(dtr_noon))
_similar_type = kind == :inner ? similar : similar_nullable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just call this _similar, "type" is more confusing than anything here.

@@ -210,7 +223,8 @@ join(dt1::AbstractDataTable,
- `:cross` : a full Cartesian product of the key combinations; every
row of `dt1` is matched with every row of `dt2`

Null values are filled in where needed to complete joins.
For the three join operations that may introduce missing values (`:outer`, `:left`,
and `:right`), all columns of the returned datatable will be nullable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"data table"


if length(rightonly_ixs.join) > 0
# some left rows are nulls, so the values of the "on" columns
# need to be taken from the right
for (on_col_ix, on_col) in enumerate(joiner.on_cols)
# fix the result of the rightjoin by taking the nonnull values from the right table
res[on_col][rightonly_ixs.join] = joiner.dtr_on[rightonly_ixs.orig, on_col_ix]
offset = nrow - length(rightonly_ixs.orig) + 1
res[on_col][offset:end] = joiner.dtr_on[rightonly_ixs.orig, on_col_ix]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a loop as above, you will be able to avoid making a copy for each index in rightonly_ixs.orig.

test/join.jl Outdated
large = DataTable(Any[[0, 1, 2, 3, 4], [0.0, 1.0, 2.0, 3.0, 4.0]], [:id, :fid])
N = Nullable()

@test join(small, large, kind=:cross) == DataTable(Any[repeat([1, 3, 5], inner=5),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break the line after == to avoid going beyond 92 chars.

Also, it would be good to store the result of the join and check the column types to make sure we get NullableArray and not Array{Nullable}. Should there also be tests for categorical arrays?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

@cjprybol
Copy link
Contributor Author

cjprybol commented May 3, 2017

Didn't see anything in the coverage drop that looks related.

_similar = kind == :inner ? similar : similar_nullable
for (i, col) in enumerate(columns(joiner.dtl))
cols[i] = _similar(col, nrow)
@inbounds for (j, k) in enumerate(all_orig_left_ixs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just realized this is likely to be quite slow due to the type-instability. You should be able to move this loop to a separate kernel function which will take col, cols[i] and all_orig_left_ixs, and which will be specialized on the column type. It could also take an optional offset, so that the same function could be used for the two similar loops below. I imagine this should make a noticeable speed difference on a simple benchmark.

test/join.jl Outdated
DataTable([NullableArray(1:10), NullableArray(3:12), collect(2:11)], [:x, :z, :y])

@testset "all joins" begin
small = DataTable(Any[[1, 3, 5], [1.0, 3.0, 5.0]], [:id, :fid])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"small" and "large" really don't apply now. How about "dt1" and "dt2"?

test/join.jl Outdated
s([:id, :fid]) == DataTable(Any[[1, 3], [1, 3]], [:id, :fid])
@test typeof.(s(:id).columns) ==
typeof.(s(:fid).columns) ==
typeof.(s([:id, :fid]).columns) == [CategoricalVector{Int, UInt32},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in the other PR, leave UInt32 out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, or replace it with CategoricalArrays.DefaultRefType since you need it here.

outer joins need to return nullable tables as they may introduce missing
data. similar_nullable on DataTables has been removed (unused) and
replaced with a similar_nullable that works on
NullableCategoricalArrays, and this change is made to support the new
changes to join. The 3 outer joins share a function with inner joins,
and this shared function (compose_joined_table) now performs a check to
see if the join type is :inner, and if so, it will return the same
column type as the parent table rather than promoting to a nullable
column. A bug was found in right-outer join behavior where the values
unique to the right table were added to the table in the incorrect
locations, overwriting data and leaving nulls where they shouldn't be.
This bug, due to incorrect values in rightonly_ixs.join, was fixed by
filling the last n-rows of the datatable where n =
length(rightonly_ixs.join). Tests were checked for accuracy against
pandas.
@nalimilan nalimilan merged commit 12443f4 into JuliaData:master May 12, 2017
@nalimilan
Copy link
Member

Thanks! Out of curiosity, have you checked whether the new code is measurably faster than the old one?

@cjprybol
Copy link
Contributor Author

Thanks here too! I don't remember benchmarking any of these changes, but let's save runtime testing for after #62.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants