-
Notifications
You must be signed in to change notification settings - Fork 11
"correct" join results #34
Comments
Please separate the nullability issue from the ordering issues, as these are completely orthogonal AFAICT. Let's keep this issue for nullability, which is the most important one IMHO. (Also SQL tables have no ordering guarantees, so that's not a reference in that regard.) I've already given my opinion on the PR: the column type shouldn't change depending on whether there are missing values or not. It should only be determined by the types of original columns and the type of join. So that means promoting to nullable arrays except for inner joins. This is what #30 does currently. In the long term, if |
Yes, they are pretty orthogonal, but I imagine knowing the stance of the community on both of these behaviors will help inform the best way to implement changes to join (if any are desired at all).
^ this is very helpful to know, Thanks! I also didn't mean to convey that these changes have to be implemented in the open PR, just that I started to think about them while exploring the behavioral changes introduced by the open PR. Right now the current implementation fails (in master and the PR) because it expects nullable arrays. If we bypass autopromotion with the one constructor that allows it, we can induce the error. Here's Julia master using DataTables master. julia> small = DataTable(Any[[1, 3, 5], [1.0, 3.0, 5.0]], [:id, :fid])
3×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 1 │ 1.0 │
│ 2 │ 3 │ 3.0 │
│ 3 │ 5 │ 5.0 │
julia> large = DataTable(Any[[0, 1, 2, 3, 4], [0.0, 1.0, 2.0, 3.0, 4.0]], [:id, :fid])
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 0 │ 0.0 │
│ 2 │ 1 │ 1.0 │
│ 3 │ 2 │ 2.0 │
│ 4 │ 3 │ 3.0 │
│ 5 │ 4 │ 4.0 │
julia> join(small, large, on=:id, kind=:outer)
6×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼──────────────┼───────┤
│ 1 │ 1 │ 1.0 │ 1.0 │
│ 2 │ 3 │ 3.0 │ 3.0 │
│ 3 │ 5 │ 5.0 │ 0.0 │
│ 4 │ 0 │ 2.32209e-314 │ 0.0 │
│ 5 │ 2 │ 2.32208e-314 │ 2.0 │
│ 6 │ 4 │ 2.32209e-314 │ 4.0 │ Missing values are filled with uninitialized floats. We'll have to autopromote the entire DataTable to using NullableArrays to patch this issue in the short term.
^ That sounds pretty great. |
It's a good idea to keep a link between issues, but mixing things in discussions/PRs is the recipe for long and inconclusive debates. So really better have two separate issues (with links from one to the other).
I'm confused. Isn't that what #30 is doing now? |
It will, I'll explain over there when I push the fix |
Here is a comparison with Pandas. Pandas includes the julia> using DataTables
julia> small = DataTable(id = [1, 3, 5], fid = [1.0, 3.0, 5.0])
3×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 1 │ 1.0 │
│ 2 │ 3 │ 3.0 │
│ 3 │ 5 │ 5.0 │
julia> large = DataTable(id = [0, 1, 2, 3, 4], fid = [0.0, 1.0, 2.0, 3.0, 4.0])
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 0 │ 0.0 │
│ 2 │ 1 │ 1.0 │
│ 3 │ 2 │ 2.0 │
│ 4 │ 3 │ 3.0 │
│ 5 │ 4 │ 4.0 │ In [1]: import pandas as pd
In [2]: from collections import OrderedDict as od
In [3]: small = pd.DataFrame(od((("id", [1, 3, 5]), ("fid", [1.0, 3.0, 5.0]))))
In [4]: large = pd.DataFrame(od((("id", [0, 1, 2, 3, 4]), ("fid", [0.0, 1.0, 2.0, 3.0, 4.0]))))
julia> join(small, large, on=:id, kind=:inner)
2×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼─────┼───────┤
│ 1 │ 1 │ 1.0 │ 1.0 │
│ 2 │ 3 │ 3.0 │ 3.0 │ In [5]: small.join(large, on="id", how='inner', rsuffix="_1")
Out[5]:
id fid id_1 fid_1
0 1 1.0 1 1.0
1 3 3.0 3 3.0
julia> join(small, large, on=:id, kind=:left)
3×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼─────┼───────┤
│ 1 │ 1 │ 1.0 │ 1.0 │
│ 2 │ 3 │ 3.0 │ 3.0 │
│ 3 │ 5 │ 5.0 │ #NULL │ In [6]: small.join(large, on="id", how='left', rsuffix="_1")
Out[6]:
id fid id_1 fid_1
0 1 1.0 1.0 1.0
1 3 3.0 3.0 3.0
2 5 5.0 NaN NaN
julia> join(small, large, on=:id, kind=:right)
5×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼───────┼───────┼───────┤
│ 1 │ 0 │ 1.0 │ 1.0 │
│ 2 │ 3 │ 3.0 │ 3.0 │
│ 3 │ 2 │ #NULL │ 0.0 │
│ 4 │ #NULL │ #NULL │ 2.0 │
│ 5 │ 4 │ #NULL │ 4.0 │ In [7]: small.join(large, on="id", how='right', rsuffix="_1")
Out[7]:
id fid id_1 fid_1
0 1.0 1.0 1 1.0
1 3.0 3.0 3 3.0
2 0.0 NaN 0 0.0
2 2.0 NaN 2 2.0
2 4.0 NaN 4 4.0
julia> join(small, large, on=:id, kind=:outer)
6×3 DataTables.DataTable
│ Row │ id │ fid │ fid_1 │
├─────┼────┼───────┼───────┤
│ 1 │ 1 │ 1.0 │ 1.0 │
│ 2 │ 3 │ 3.0 │ 3.0 │
│ 3 │ 5 │ 5.0 │ #NULL │
│ 4 │ 0 │ #NULL │ 0.0 │
│ 5 │ 2 │ #NULL │ 2.0 │
│ 6 │ 4 │ #NULL │ 4.0 │ In [8]: small.join(large, on="id", how='outer', rsuffix="_1")
Out[8]:
id fid id_1 fid_1
0 1.0 1.0 1.0 1.0
1 3.0 3.0 3.0 3.0
2 5.0 5.0 NaN NaN
2 0.0 NaN 0.0 0.0
2 2.0 NaN 2.0 2.0
2 4.0 NaN 4.0 4.0
julia> join(small, large, on=:fid, kind=:inner)
2×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼────┼─────┼──────┤
│ 1 │ 1 │ 1.0 │ 1 │
│ 2 │ 3 │ 3.0 │ 3 │ In [9]: small.join(large, on="fid", how='inner', rsuffix="_1")
Out[9]:
id fid id_1 fid_1
0 1 1.0 1 1.0
1 3 3.0 3 3.0
julia> join(small, large, on=:fid, kind=:left)
3×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼────┼─────┼───────┤
│ 1 │ 1 │ 1.0 │ 1 │
│ 2 │ 3 │ 3.0 │ 3 │
│ 3 │ 5 │ 5.0 │ #NULL │ In [10]: small.join(large, on="fid", how='left', rsuffix="_1")
Out[10]:
id fid id_1 fid_1
0 1 1.0 1.0 1.0
1 3 3.0 3.0 3.0
2 5 5.0 NaN NaN
julia> join(small, large, on=:fid, kind=:right)
5×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼───────┼───────┼──────┤
│ 1 │ 1 │ 0.0 │ 1 │
│ 2 │ 3 │ 3.0 │ 3 │
│ 3 │ #NULL │ 2.0 │ 0 │
│ 4 │ #NULL │ #NULL │ 2 │
│ 5 │ #NULL │ 4.0 │ 4 │ In [11]: small.join(large, on="fid", how='right', rsuffix="_1")
Out[11]:
id fid id_1 fid_1
0 1.0 1.0 1 1.0
1 3.0 3.0 3 3.0
2 NaN 0.0 0 0.0
2 NaN 2.0 2 2.0
2 NaN 4.0 4 4.0
julia> join(small, large, on=:fid, kind=:outer)
6×3 DataTables.DataTable
│ Row │ id │ fid │ id_1 │
├─────┼───────┼─────┼───────┤
│ 1 │ 1 │ 1.0 │ 1 │
│ 2 │ 3 │ 3.0 │ 3 │
│ 3 │ 5 │ 5.0 │ #NULL │
│ 4 │ #NULL │ 0.0 │ 0 │
│ 5 │ #NULL │ 2.0 │ 2 │
│ 6 │ #NULL │ 4.0 │ 4 │ In [12]: small.join(large, on="fid", how='outer', rsuffix="_1")
Out[12]:
id fid id_1 fid_1
0 1.0 1.0 1.0 1.0
1 3.0 3.0 3.0 3.0
2 5.0 5.0 NaN NaN
2 NaN 0.0 0.0 0.0
2 NaN 2.0 2.0 2.0
2 NaN 4.0 4.0 4.0 and none of these worked in pandas julia> join(small, large, on=[:id, :fid], kind=:inner)
2×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 1 │ 1.0 │
│ 2 │ 3 │ 3.0 │ small.join(large, on=["id", "fid"], how='inner', rsuffix="_1")
**error** julia> join(small, large, on=[:id, :fid], kind=:left)
3×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 1 │ 1.0 │
│ 2 │ 3 │ 3.0 │
│ 3 │ 5 │ 5.0 │ small.join(large, on=["id", "fid"], how='left', rsuffix="_1")
**error** julia> join(small, large, on=[:id, :fid], kind=:right)
5×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼───────┼───────┤
│ 1 │ 0 │ 0.0 │
│ 2 │ 3 │ 3.0 │
│ 3 │ 2 │ 2.0 │
│ 4 │ #NULL │ #NULL │
│ 5 │ 4 │ 4.0 │ small.join(large, on=["id", "fid"], how='right', rsuffix="_1")
**error** julia> join(small, large, on=[:id, :fid], kind=:outer)
6×2 DataTables.DataTable
│ Row │ id │ fid │
├─────┼────┼─────┤
│ 1 │ 1 │ 1.0 │
│ 2 │ 3 │ 3.0 │
│ 3 │ 5 │ 5.0 │
│ 4 │ 0 │ 0.0 │
│ 5 │ 2 │ 2.0 │
│ 6 │ 4 │ 4.0 │ small.join(large, on=["id", "fid"], how='outer', rsuffix="_1")
**error** In summary:
|
I think the most useful purpose this issue will serve has already been completed: helping me learn how joins are supposed to work by publicly embarrassing myself. Sorry for the noise everyone |
I've not yet looked at ordering issues (I'd like to get things working first), but why do the two non-"correct" examples include null IDs? Is that a bug in our implementation? |
Yea, that's a bug. The patch in #30 fixes it in all examples/tests I tried 3f2cd63#diff-b1b6de17647445a5c08f65fd851680caR96 |
@cjprybol No need to be embarrassed at all. This has been instructive for me as well. Sometimes having these conversations can be a good sanity check, making us question our assumptions so that we can be sure we're on the right track. |
We're making progress on removing the automatic promotion of columns to nullable arrays (NullableArrays & NullableCategoricalArrays) in #30 and I would like some feedback and discussion about the best way to handle outer joins (outer, left, right). Outer joins are particularly interesting because they all have the potential to introduce nulls, but do not necessarily need to do so. All of the other joins currently supported create subsets of the tables and shouldn't have the potential to introduce nulls. Please correct me if I'm wrong on this.
For the other joins that do not have the potential to introduce missing data, it seems that the "correct" result is pretty easy to decide on: The result should be in the same order as the input data and the column types should be preserved.
For outer joins, we have two inter-related "correct"-ness questions that both have to do with nulls. The first question is what is the "correct" ordering of the returned values when nulls are introduced. Do we try and retain the original ordering of the input datasets, or do we default to putting the null values at the end, or left then right table ordering? The second is what is the "correct" column type to return to the user. As all outer joins have the potential to introduce missingness, the "correct" result could be the one that is most column-type consistent. In this case, we would retain the column type and only promote to using nullable arrays when nulls are introduced. On the flip side of this is the behavioral consistency approach. Because all outer joins have the potential to introduce missing data, it could be argued that outer joins should always return nullable columns (only for the smaller of the two joined tables or just completely nullify the datatable). In that case, users can always expect to receive nullable columns to be output from outer joins and write nullable-tolerant code accordingly without anything unexpected.
Here is an example set of joins. I'll turn these into test cases to assert whatever behavior pattern we decide on, and if any other examples are put forth here I'll add those too.
the data
and here are the results I would expect from the joins, which are commented out in the line before the result I expected.
The only ones that don't behave as expected are the right outer join and the full outer join, which both appear to order according to the left data table and then the right. The answer I expected immediately follows the
join
I expected to produce it.We recently merged a PR that modified joining in #17, and at first I thought it was a behavioral bug I had missed, but I ran these tests against DataFrames and got the same results so it seems this is how outer joins have behaved in Julia in the past few years. I personally would prefer that both
I tried to find examples of this but I came up with very few guiding examples to inform what would be the "correct" behavior here. Turns out everyone on the internet explains joins with venn diagrams. Here's one example that would suggest our current ordering is wrong.
https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
If anyone has examples from other languages (R, Python, Matlab, others) or examples from reference material those would be very helpful.
For quick polling, let's do
👍 = change both (REMOVE/CHANGE autopromotion and CHANGE the ordering)
😕 = KEEP autopromotion of outer joins, but the ordering is confusing so CHANGE ordering
👎 = REMOVE/CHANGE autopromotion, but KEEP order as is
❤️ = I love it the way it is, KEEP autopromotion of outer joins and KEEP order as is
Thanks for reading!
The text was updated successfully, but these errors were encountered: