"correct" join results #34

cjprybol · 2017-03-16T06:08:06Z

We're making progress on removing the automatic promotion of columns to nullable arrays (NullableArrays & NullableCategoricalArrays) in #30 and I would like some feedback and discussion about the best way to handle outer joins (outer, left, right). Outer joins are particularly interesting because they all have the potential to introduce nulls, but do not necessarily need to do so. All of the other joins currently supported create subsets of the tables and shouldn't have the potential to introduce nulls. Please correct me if I'm wrong on this.

For the other joins that do not have the potential to introduce missing data, it seems that the "correct" result is pretty easy to decide on: The result should be in the same order as the input data and the column types should be preserved.

For outer joins, we have two inter-related "correct"-ness questions that both have to do with nulls. The first question is what is the "correct" ordering of the returned values when nulls are introduced. Do we try and retain the original ordering of the input datasets, or do we default to putting the null values at the end, or left then right table ordering? The second is what is the "correct" column type to return to the user. As all outer joins have the potential to introduce missingness, the "correct" result could be the one that is most column-type consistent. In this case, we would retain the column type and only promote to using nullable arrays when nulls are introduced. On the flip side of this is the behavioral consistency approach. Because all outer joins have the potential to introduce missing data, it could be argued that outer joins should always return nullable columns (only for the smaller of the two joined tables or just completely nullify the datatable). In that case, users can always expect to receive nullable columns to be output from outer joins and write nullable-tolerant code accordingly without anything unexpected.

Here is an example set of joins. I'll turn these into test cases to assert whatever behavior pattern we decide on, and if any other examples are put forth here I'll add those too.

the data

julia> small = DataTable(id = 1:2:5, fid = 1.0:2.0:5.0)
3×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │
│ 3   │ 5  │ 5.0 │

julia> large = DataTable(id = 0:4, fid = 0.0:4.0)
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 0  │ 0.0 │
│ 2   │ 1  │ 1.0 │
│ 3   │ 2  │ 2.0 │
│ 4   │ 3  │ 3.0 │
│ 5   │ 4  │ 4.0 │

and here are the results I would expect from the joins, which are commented out in the line before the result I expected.

julia> N = Nullable()
Nullable{Union{}}()

julia> # join(small, large, kind=:cross)
       DataTable(id = repeat(1:2:5, inner=5),
                 fid = repeat(1.0:2.0:5.0, inner=5),
                 id_1 = repeat(0:4, outer=3),
                 fid_1 = repeat(0.0:4.0, outer=3))
15×4 DataTables.DataTable
│ Row │ id │ fid │ id_1 │ fid_1 │
├─────┼────┼─────┼──────┼───────┤
│ 1   │ 1  │ 1.0 │ 0    │ 0.0   │
│ 2   │ 1  │ 1.0 │ 1    │ 1.0   │
│ 3   │ 1  │ 1.0 │ 2    │ 2.0   │
│ 4   │ 1  │ 1.0 │ 3    │ 3.0   │
│ 5   │ 1  │ 1.0 │ 4    │ 4.0   │
│ 6   │ 3  │ 3.0 │ 0    │ 0.0   │
│ 7   │ 3  │ 3.0 │ 1    │ 1.0   │
│ 8   │ 3  │ 3.0 │ 2    │ 2.0   │
│ 9   │ 3  │ 3.0 │ 3    │ 3.0   │
│ 10  │ 3  │ 3.0 │ 4    │ 4.0   │
│ 11  │ 5  │ 5.0 │ 0    │ 0.0   │
│ 12  │ 5  │ 5.0 │ 1    │ 1.0   │
│ 13  │ 5  │ 5.0 │ 2    │ 2.0   │
│ 14  │ 5  │ 5.0 │ 3    │ 3.0   │
│ 15  │ 5  │ 5.0 │ 4    │ 4.0   │

julia> # id
       # join(small, large, on=:id, kind=:inner)
       DataTable(id = [1, 3], fid = [1.0, 3.0], fid_1 = [1.0, 3.0])
2×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼─────┼───────┤
│ 1   │ 1  │ 1.0 │ 1.0   │
│ 2   │ 3  │ 3.0 │ 3.0   │

julia> # join(small, large, on=:id, kind=:left)
       DataTable(id = [1, 3, 5], fid = [1.0, 3.0, 5.0], fid_1 = [1.0, 3.0, N])
3×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼─────┼───────┤
│ 1   │ 1  │ 1.0 │ 1.0   │
│ 2   │ 3  │ 3.0 │ 3.0   │
│ 3   │ 5  │ 5.0 │ #NULL │

julia> # join(small, large, on=:id, kind=:semi)
       DataTable(id = [1, 3], fid = [1.0, 3.0])
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │

julia> # join(small, large, on=:id, kind=:anti)
       DataTable(id = 5, fid = 5.0)
1×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 5  │ 5.0 │

julia> # join(small, large, on=:id, kind=:outer)
       DataTable(id = 0:5, fid = [N, 1.0, N, 3.0, N, 5.0], fid_1 = [0.0, 1.0, 2.0, 3.0, 4.0, N])
6×3 DataTables.DataTable
│ Row │ id │ fid   │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1   │ 0  │ #NULL │ 0.0   │
│ 2   │ 1  │ 1.0   │ 1.0   │
│ 3   │ 2  │ #NULL │ 2.0   │
│ 4   │ 3  │ 3.0   │ 3.0   │
│ 5   │ 4  │ #NULL │ 4.0   │
│ 6   │ 5  │ 5.0   │ #NULL │

julia> # join(small, large, on=:id, kind=:right)
       DataTable(id = 0:4, fid = [N, 1.0, N, 3.0, N], fid_1 = 0.0:4.0)
5×3 DataTables.DataTable
│ Row │ id │ fid   │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1   │ 0  │ #NULL │ 0.0   │
│ 2   │ 1  │ 1.0   │ 1.0   │
│ 3   │ 2  │ #NULL │ 2.0   │
│ 4   │ 3  │ 3.0   │ 3.0   │
│ 5   │ 4  │ #NULL │ 4.0   │

julia> # fid
       # join(small, large, on=:fid, kind=:inner)
       DataTable(id = [1, 3], fid = [1.0, 3.0], id_1 = [1, 3])
2×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼────┼─────┼──────┤
│ 1   │ 1  │ 1.0 │ 1    │
│ 2   │ 3  │ 3.0 │ 3    │

julia> # join(small, large, on=:fid, kind=:left)
       DataTable(id = [1, 3, 5], fid = [1.0, 3.0, 5.0], id_1 = [1, 3, N])
3×3 DataTables.DataTable
│ Row │ id │ fid │ id_1  │
├─────┼────┼─────┼───────┤
│ 1   │ 1  │ 1.0 │ 1     │
│ 2   │ 3  │ 3.0 │ 3     │
│ 3   │ 5  │ 5.0 │ #NULL │

julia> # join(small, large, on=:fid, kind=:semi)
       DataTable(id = [1, 3], fid = [1.0, 3.0])
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │

julia> # join(small, large, on=:fid, kind=:anti)
       DataTable(id = 5, fid = 5.0)
1×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 5  │ 5.0 │

julia> # join(small, large, on=:fid, kind=:outer)
       DataTable(id = [N, 1, N, 3, N, 5], fid = 0.0:5.0, id_1 = [0, 1, 2, 3, 4, N])
6×3 DataTables.DataTable
│ Row │ id    │ fid │ id_1  │
├─────┼───────┼─────┼───────┤
│ 1   │ #NULL │ 0.0 │ 0     │
│ 2   │ 1     │ 1.0 │ 1     │
│ 3   │ #NULL │ 2.0 │ 2     │
│ 4   │ 3     │ 3.0 │ 3     │
│ 5   │ #NULL │ 4.0 │ 4     │
│ 6   │ 5     │ 5.0 │ #NULL │

julia> # join(small, large, on=:fid, kind=:right)
       DataTable(id = [N, 1, N, 3, N], fid = 0.0:4.0, id_1 = 0:4)
5×3 DataTables.DataTable
│ Row │ id    │ fid │ id_1 │
├─────┼───────┼─────┼──────┤
│ 1   │ #NULL │ 0.0 │ 0    │
│ 2   │ 1     │ 1.0 │ 1    │
│ 3   │ #NULL │ 2.0 │ 2    │
│ 4   │ 3     │ 3.0 │ 3    │
│ 5   │ #NULL │ 4.0 │ 4    │

julia> # both
       # join(small, large, on=[:id, :fid], kind=:inner)
       DataTable(id = [1, 3], fid = [1.0, 3.0])
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │

julia> # join(small, large, on=[:id, :fid], kind=:left)
       DataTable(id = 1:2:5, fid = 1.0:2.0:5.0)
3×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │
│ 3   │ 5  │ 5.0 │

julia> # join(small, large, on=[:id, :fid], kind=:semi)
       DataTable(id = [1, 3], fid = [1.0, 3.0])
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │

julia> # join(small, large, on=[:id, :fid], kind=:anti)
       DataTable(id = 5, fid = 5.0)
1×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 5  │ 5.0 │

julia> # join(small, large, on=[:id, :fid], kind=:outer)
       DataTable(id = 0:5, fid = 0.0:5.0)
6×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 0  │ 0.0 │
│ 2   │ 1  │ 1.0 │
│ 3   │ 2  │ 2.0 │
│ 4   │ 3  │ 3.0 │
│ 5   │ 4  │ 4.0 │
│ 6   │ 5  │ 5.0 │

julia> # join(small, large, on=[:id, :fid], kind=:right)
       DataTable(id = 0:4, fid = 0.0:4.0)
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 0  │ 0.0 │
│ 2   │ 1  │ 1.0 │
│ 3   │ 2  │ 2.0 │
│ 4   │ 3  │ 3.0 │
│ 5   │ 4  │ 4.0 │

The only ones that don't behave as expected are the right outer join and the full outer join, which both appear to order according to the left data table and then the right. The answer I expected immediately follows the join I expected to produce it.

julia> join(small, large, on=:id, kind=:outer)
6×3 DataTables.DataTable
│ Row │ id │ fid   │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1   │ 1  │ 1.0   │ 1.0   │
│ 2   │ 3  │ 3.0   │ 3.0   │
│ 3   │ 5  │ 5.0   │ #NULL │
│ 4   │ 0  │ #NULL │ 0.0   │
│ 5   │ 2  │ #NULL │ 2.0   │
│ 6   │ 4  │ #NULL │ 4.0   │

julia> DataTable(id = 0:5, fid = [N, 1.0, N, 3.0, N, 5.0], fid_1 = [0.0, 1.0, 2.0, 3.0, 4.0, N])
6×3 DataTables.DataTable
│ Row │ id │ fid   │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1   │ 0  │ #NULL │ 0.0   │
│ 2   │ 1  │ 1.0   │ 1.0   │
│ 3   │ 2  │ #NULL │ 2.0   │
│ 4   │ 3  │ 3.0   │ 3.0   │
│ 5   │ 4  │ #NULL │ 4.0   │
│ 6   │ 5  │ 5.0   │ #NULL │

julia> join(small, large, on=:id, kind=:right)
5×3 DataTables.DataTable
│ Row │ id    │ fid   │ fid_1 │
├─────┼───────┼───────┼───────┤
│ 1   │ 0     │ 1.0   │ 1.0   │
│ 2   │ 3     │ 3.0   │ 3.0   │
│ 3   │ 2     │ #NULL │ 0.0   │
│ 4   │ #NULL │ #NULL │ 2.0   │
│ 5   │ 4     │ #NULL │ 4.0   │

julia> DataTable(id = 0:4, fid = [N, 1.0, N, 3.0, N], fid_1 = 0.0:4.0)
5×3 DataTables.DataTable
│ Row │ id │ fid   │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1   │ 0  │ #NULL │ 0.0   │
│ 2   │ 1  │ 1.0   │ 1.0   │
│ 3   │ 2  │ #NULL │ 2.0   │
│ 4   │ 3  │ 3.0   │ 3.0   │
│ 5   │ 4  │ #NULL │ 4.0   │

julia> join(small, large, on=:fid, kind=:outer)
6×3 DataTables.DataTable
│ Row │ id    │ fid │ id_1  │
├─────┼───────┼─────┼───────┤
│ 1   │ 1     │ 1.0 │ 1     │
│ 2   │ 3     │ 3.0 │ 3     │
│ 3   │ 5     │ 5.0 │ #NULL │
│ 4   │ #NULL │ 0.0 │ 0     │
│ 5   │ #NULL │ 2.0 │ 2     │
│ 6   │ #NULL │ 4.0 │ 4     │

julia> DataTable(id = [N, 1, N, 3, N, 5], fid = 0.0:5.0, id_1 = [0, 1, 2, 3, 4, N])
6×3 DataTables.DataTable
│ Row │ id    │ fid │ id_1  │
├─────┼───────┼─────┼───────┤
│ 1   │ #NULL │ 0.0 │ 0     │
│ 2   │ 1     │ 1.0 │ 1     │
│ 3   │ #NULL │ 2.0 │ 2     │
│ 4   │ 3     │ 3.0 │ 3     │
│ 5   │ #NULL │ 4.0 │ 4     │
│ 6   │ 5     │ 5.0 │ #NULL │

julia> join(small, large, on=:fid, kind=:right)
5×3 DataTables.DataTable
│ Row │ id    │ fid   │ id_1 │
├─────┼───────┼───────┼──────┤
│ 1   │ 1     │ 0.0   │ 1    │
│ 2   │ 3     │ 3.0   │ 3    │
│ 3   │ #NULL │ 2.0   │ 0    │
│ 4   │ #NULL │ #NULL │ 2    │
│ 5   │ #NULL │ 4.0   │ 4    │

julia> DataTable(id = [N, 1, N, 3, N], fid = 0.0:4.0, id_1 = 0:4)
5×3 DataTables.DataTable
│ Row │ id    │ fid │ id_1 │
├─────┼───────┼─────┼──────┤
│ 1   │ #NULL │ 0.0 │ 0    │
│ 2   │ 1     │ 1.0 │ 1    │
│ 3   │ #NULL │ 2.0 │ 2    │
│ 4   │ 3     │ 3.0 │ 3    │
│ 5   │ #NULL │ 4.0 │ 4    │

julia> join(small, large, on=[:id, :fid], kind=:outer)
6×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │
│ 3   │ 5  │ 5.0 │
│ 4   │ 0  │ 0.0 │
│ 5   │ 2  │ 2.0 │
│ 6   │ 4  │ 4.0 │

julia> DataTable(id = 0:5, fid = 0.0:5.0)
6×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 0  │ 0.0 │
│ 2   │ 1  │ 1.0 │
│ 3   │ 2  │ 2.0 │
│ 4   │ 3  │ 3.0 │
│ 5   │ 4  │ 4.0 │
│ 6   │ 5  │ 5.0 │

julia> join(small, large, on=[:id, :fid], kind=:right)
5×2 DataTables.DataTable
│ Row │ id    │ fid   │
├─────┼───────┼───────┤
│ 1   │ 0     │ 0.0   │
│ 2   │ 3     │ 3.0   │
│ 3   │ 2     │ 2.0   │
│ 4   │ #NULL │ #NULL │
│ 5   │ 4     │ 4.0   │

julia> DataTable(id = 0:4, fid = 0.0:4.0)
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 0  │ 0.0 │
│ 2   │ 1  │ 1.0 │
│ 3   │ 2  │ 2.0 │
│ 4   │ 3  │ 3.0 │
│ 5   │ 4  │ 4.0 │

We recently merged a PR that modified joining in #17, and at first I thought it was a behavioral bug I had missed, but I ran these tests against DataFrames and got the same results so it seems this is how outer joins have behaved in Julia in the past few years. I personally would prefer that both

we never promote to nullable arrays unless required
we order as shown in the examples rather than what the current behavior yields.

I tried to find examples of this but I came up with very few guiding examples to inform what would be the "correct" behavior here. Turns out everyone on the internet explains joins with venn diagrams. Here's one example that would suggest our current ordering is wrong.
https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/

If anyone has examples from other languages (R, Python, Matlab, others) or examples from reference material those would be very helpful.

For quick polling, let's do
👍 = change both (REMOVE/CHANGE autopromotion and CHANGE the ordering)
😕 = KEEP autopromotion of outer joins, but the ordering is confusing so CHANGE ordering
👎 = REMOVE/CHANGE autopromotion, but KEEP order as is
❤️ = I love it the way it is, KEEP autopromotion of outer joins and KEEP order as is

Thanks for reading!

The text was updated successfully, but these errors were encountered:

nalimilan · 2017-03-16T13:44:01Z

Please separate the nullability issue from the ordering issues, as these are completely orthogonal AFAICT. Let's keep this issue for nullability, which is the most important one IMHO. (Also SQL tables have no ordering guarantees, so that's not a reference in that regard.)

I've already given my opinion on the PR: the column type shouldn't change depending on whether there are missing values or not. It should only be determined by the types of original columns and the type of join. So that means promoting to nullable arrays except for inner joins. This is what #30 does currently.

In the long term, if Nullable{T} ends up being replaced with Union{T, Null}, then that distinction will be less of an issue as both Vector{T} and Vector{Nullable{T}}/NullableVector{T} will behave the same, except that the latter will support setting some values to null. But for now this makes a big difference.

cjprybol · 2017-03-16T17:14:44Z

Yes, they are pretty orthogonal, but I imagine knowing the stance of the community on both of these behaviors will help inform the best way to implement changes to join (if any are desired at all).

(Also SQL tables have no ordering guarantees, so that's not a reference in that regard.)

^ this is very helpful to know, Thanks!

I also didn't mean to convey that these changes have to be implemented in the open PR, just that I started to think about them while exploring the behavioral changes introduced by the open PR. Right now the current implementation fails (in master and the PR) because it expects nullable arrays. If we bypass autopromotion with the one constructor that allows it, we can induce the error. Here's Julia master using DataTables master.

julia> small = DataTable(Any[[1, 3, 5], [1.0, 3.0, 5.0]], [:id, :fid])
3×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │
│ 3   │ 5  │ 5.0 │

julia> large = DataTable(Any[[0, 1, 2, 3, 4], [0.0, 1.0, 2.0, 3.0, 4.0]], [:id, :fid])
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 0  │ 0.0 │
│ 2   │ 1  │ 1.0 │
│ 3   │ 2  │ 2.0 │
│ 4   │ 3  │ 3.0 │
│ 5   │ 4  │ 4.0 │

julia> join(small, large, on=:id, kind=:outer)
6×3 DataTables.DataTable
│ Row │ id │ fid          │ fid_1 │
├─────┼────┼──────────────┼───────┤
│ 1   │ 1  │ 1.0          │ 1.0   │
│ 2   │ 3  │ 3.0          │ 3.0   │
│ 3   │ 5  │ 5.0          │ 0.0   │
│ 4   │ 0  │ 2.32209e-314 │ 0.0   │
│ 5   │ 2  │ 2.32208e-314 │ 2.0   │
│ 6   │ 4  │ 2.32209e-314 │ 4.0   │

Missing values are filled with uninitialized floats. We'll have to autopromote the entire DataTable to using NullableArrays to patch this issue in the short term.

In the long term, if Nullable{T} ends up being replaced with Union{T, Null}, then that distinction will be less of an issue as both Vector{T} and Vector{Nullable{T}}/NullableVector{T} will behave the same, except that the latter will support setting some values to null. But for now this makes a big difference.

^ That sounds pretty great.

nalimilan · 2017-03-16T18:44:09Z

Yes, they are pretty orthogonal, but I imagine knowing the stance of the community on both of these behaviors will help inform the best way to implement changes to join (if any are desired at all).

It's a good idea to keep a link between issues, but mixing things in discussions/PRs is the recipe for long and inconclusive debates. So really better have two separate issues (with links from one to the other).

Missing values are filled with uninitialized floats. We'll have to autopromote the entire DataTable to using NullableArrays to patch this issue in the short term.

I'm confused. Isn't that what #30 is doing now?

cjprybol · 2017-03-16T19:26:12Z

It will, I'll explain over there when I push the fix

cjprybol · 2017-03-16T21:55:30Z

Here is a comparison with Pandas. Pandas includes the on columns from both tables, while DataTables merges the on columns into one. I'll ignore that difference when comparing results. I also notice cases where Pandas promotes the Ints of id to Floats, which I don't see us doing anywhere and is a plus for DataTables.

julia> using DataTables

julia> small = DataTable(id = [1, 3, 5], fid = [1.0, 3.0, 5.0])
3×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │
│ 3   │ 5  │ 5.0 │

julia> large = DataTable(id = [0, 1, 2, 3, 4], fid = [0.0, 1.0, 2.0, 3.0, 4.0])
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 0  │ 0.0 │
│ 2   │ 1  │ 1.0 │
│ 3   │ 2  │ 2.0 │
│ 4   │ 3  │ 3.0 │
│ 5   │ 4  │ 4.0 │

In [1]: import pandas as pd

In [2]: from collections import OrderedDict as od

In [3]: small = pd.DataFrame(od((("id", [1, 3, 5]), ("fid", [1.0, 3.0, 5.0]))))

In [4]: large = pd.DataFrame(od((("id", [0, 1, 2, 3, 4]), ("fid", [0.0, 1.0, 2.0, 3.0, 4.0]))))

correct?

julia> join(small, large, on=:id, kind=:inner)
2×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼─────┼───────┤
│ 1   │ 1  │ 1.0 │ 1.0   │
│ 2   │ 3  │ 3.0 │ 3.0   │

In [5]: small.join(large, on="id", how='inner', rsuffix="_1")
Out[5]:
   id  fid  id_1  fid_1
0   1  1.0     1    1.0
1   3  3.0     3    3.0

correct?

julia> join(small, large, on=:id, kind=:left)
3×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼─────┼───────┤
│ 1   │ 1  │ 1.0 │ 1.0   │
│ 2   │ 3  │ 3.0 │ 3.0   │
│ 3   │ 5  │ 5.0 │ #NULL │

In [6]: small.join(large, on="id", how='left', rsuffix="_1")
Out[6]:
   id  fid  id_1  fid_1
0   1  1.0   1.0    1.0
1   3  3.0   3.0    3.0
2   5  5.0   NaN    NaN

correct?

julia> join(small, large, on=:id, kind=:right)
5×3 DataTables.DataTable
│ Row │ id    │ fid   │ fid_1 │
├─────┼───────┼───────┼───────┤
│ 1   │ 0     │ 1.0   │ 1.0   │
│ 2   │ 3     │ 3.0   │ 3.0   │
│ 3   │ 2     │ #NULL │ 0.0   │
│ 4   │ #NULL │ #NULL │ 2.0   │
│ 5   │ 4     │ #NULL │ 4.0   │

In [7]: small.join(large, on="id", how='right', rsuffix="_1")
Out[7]:
    id  fid  id_1  fid_1
0  1.0  1.0     1    1.0
1  3.0  3.0     3    3.0
2  0.0  NaN     0    0.0
2  2.0  NaN     2    2.0
2  4.0  NaN     4    4.0

correct?

julia> join(small, large, on=:id, kind=:outer)
6×3 DataTables.DataTable
│ Row │ id │ fid   │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1   │ 1  │ 1.0   │ 1.0   │
│ 2   │ 3  │ 3.0   │ 3.0   │
│ 3   │ 5  │ 5.0   │ #NULL │
│ 4   │ 0  │ #NULL │ 0.0   │
│ 5   │ 2  │ #NULL │ 2.0   │
│ 6   │ 4  │ #NULL │ 4.0   │

In [8]: small.join(large, on="id", how='outer', rsuffix="_1")
Out[8]:
    id  fid  id_1  fid_1
0  1.0  1.0   1.0    1.0
1  3.0  3.0   3.0    3.0
2  5.0  5.0   NaN    NaN
2  0.0  NaN   0.0    0.0
2  2.0  NaN   2.0    2.0
2  4.0  NaN   4.0    4.0

correct?

julia> join(small, large, on=:fid, kind=:inner)
2×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼────┼─────┼──────┤
│ 1   │ 1  │ 1.0 │ 1    │
│ 2   │ 3  │ 3.0 │ 3    │

In [9]: small.join(large, on="fid", how='inner', rsuffix="_1")
Out[9]:
   id  fid  id_1  fid_1
0   1  1.0     1    1.0
1   3  3.0     3    3.0

correct?

julia> join(small, large, on=:fid, kind=:left)
3×3 DataTables.DataTable
│ Row │ id │ fid │ id_1  │
├─────┼────┼─────┼───────┤
│ 1   │ 1  │ 1.0 │ 1     │
│ 2   │ 3  │ 3.0 │ 3     │
│ 3   │ 5  │ 5.0 │ #NULL │

In [10]: small.join(large, on="fid", how='left', rsuffix="_1")
Out[10]:
   id  fid  id_1  fid_1
0   1  1.0   1.0    1.0
1   3  3.0   3.0    3.0
2   5  5.0   NaN    NaN

correct?

julia> join(small, large, on=:fid, kind=:right)
5×3 DataTables.DataTable
│ Row │ id    │ fid   │ id_1 │
├─────┼───────┼───────┼──────┤
│ 1   │ 1     │ 0.0   │ 1    │
│ 2   │ 3     │ 3.0   │ 3    │
│ 3   │ #NULL │ 2.0   │ 0    │
│ 4   │ #NULL │ #NULL │ 2    │
│ 5   │ #NULL │ 4.0   │ 4    │

In [11]: small.join(large, on="fid", how='right', rsuffix="_1")
Out[11]:
    id  fid  id_1  fid_1
0  1.0  1.0     1    1.0
1  3.0  3.0     3    3.0
2  NaN  0.0     0    0.0
2  NaN  2.0     2    2.0
2  NaN  4.0     4    4.0

correct?

julia> join(small, large, on=:fid, kind=:outer)
6×3 DataTables.DataTable
│ Row │ id    │ fid │ id_1  │
├─────┼───────┼─────┼───────┤
│ 1   │ 1     │ 1.0 │ 1     │
│ 2   │ 3     │ 3.0 │ 3     │
│ 3   │ 5     │ 5.0 │ #NULL │
│ 4   │ #NULL │ 0.0 │ 0     │
│ 5   │ #NULL │ 2.0 │ 2     │
│ 6   │ #NULL │ 4.0 │ 4     │

In [12]: small.join(large, on="fid", how='outer', rsuffix="_1")
Out[12]:
    id  fid  id_1  fid_1
0  1.0  1.0   1.0    1.0
1  3.0  3.0   3.0    3.0
2  5.0  5.0   NaN    NaN
2  NaN  0.0   0.0    0.0
2  NaN  2.0   2.0    2.0
2  NaN  4.0   4.0    4.0

and none of these worked in pandas

julia> join(small, large, on=[:id, :fid], kind=:inner)
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │

small.join(large, on=["id", "fid"], how='inner', rsuffix="_1")
**error**

julia> join(small, large, on=[:id, :fid], kind=:left)
3×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │
│ 3   │ 5  │ 5.0 │

small.join(large, on=["id", "fid"], how='left', rsuffix="_1")
**error**

julia> join(small, large, on=[:id, :fid], kind=:right)
5×2 DataTables.DataTable
│ Row │ id    │ fid   │
├─────┼───────┼───────┤
│ 1   │ 0     │ 0.0   │
│ 2   │ 3     │ 3.0   │
│ 3   │ 2     │ 2.0   │
│ 4   │ #NULL │ #NULL │
│ 5   │ 4     │ 4.0   │

small.join(large, on=["id", "fid"], how='right', rsuffix="_1")
**error**

julia> join(small, large, on=[:id, :fid], kind=:outer)
6×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1   │ 1  │ 1.0 │
│ 2   │ 3  │ 3.0 │
│ 3   │ 5  │ 5.0 │
│ 4   │ 0  │ 0.0 │
│ 5   │ 2  │ 2.0 │
│ 6   │ 4  │ 4.0 │

small.join(large, on=["id", "fid"], how='outer', rsuffix="_1")
**error**

In summary:

Right joins don't match
DataTables doesn't include the columns from both tables, Pandas does
DataTables also does left then right ordering, so that would be a vote against reordering the joins as I described

cjprybol · 2017-03-17T16:19:40Z

I think the most useful purpose this issue will serve has already been completed: helping me learn how joins are supposed to work by publicly embarrassing myself. Sorry for the noise everyone

nalimilan · 2017-03-17T16:29:18Z

I've not yet looked at ordering issues (I'd like to get things working first), but why do the two non-"correct" examples include null IDs? Is that a bug in our implementation?

cjprybol · 2017-03-17T16:31:59Z

Yea, that's a bug. The patch in #30 fixes it in all examples/tests I tried 3f2cd63#diff-b1b6de17647445a5c08f65fd851680caR96

ararslan · 2017-03-17T17:56:41Z

@cjprybol No need to be embarrassed at all. This has been instructive for me as well. Sometimes having these conversations can be a good sanity check, making us question our assumptions so that we can be sure we're on the right track.

cjprybol mentioned this issue Mar 17, 2017

Stop auto-promoting column-types #30

Closed

cjprybol closed this as completed Mar 17, 2017

cjprybol mentioned this issue Mar 31, 2017

adjust outer join behavior (types and right outer join bug) #44

Merged

cjprybol mentioned this issue Oct 26, 2017

Ensure levels are maintained when joining df's with Categorical cols JuliaData/DataFrames.jl#1266

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"correct" join results #34

"correct" join results #34

cjprybol commented Mar 16, 2017

nalimilan commented Mar 16, 2017

cjprybol commented Mar 16, 2017

nalimilan commented Mar 16, 2017

cjprybol commented Mar 16, 2017

cjprybol commented Mar 16, 2017 •

edited

Loading

cjprybol commented Mar 17, 2017

nalimilan commented Mar 17, 2017

cjprybol commented Mar 17, 2017

ararslan commented Mar 17, 2017

"correct" join results #34

"correct" join results #34

Comments

cjprybol commented Mar 16, 2017

nalimilan commented Mar 16, 2017

cjprybol commented Mar 16, 2017

nalimilan commented Mar 16, 2017

cjprybol commented Mar 16, 2017

cjprybol commented Mar 16, 2017 • edited Loading

cjprybol commented Mar 17, 2017

nalimilan commented Mar 17, 2017

cjprybol commented Mar 17, 2017

ararslan commented Mar 17, 2017

cjprybol commented Mar 16, 2017 •

edited

Loading