-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] full_join considers nulls as equal #5563
Comments
Oldie but a goodie: #542 The behavior definitely used to be So I think we should just switch back to |
fyi, @revans2 the way things are currently implemented, this doesn't look like it's unique to full join. Inner and left outer would have this same issue. |
From our tests other join operators are not showing issues with nulls in the key. It is just full_join. I would need to dig into the details to see if something else is masking it in spark or not. |
I think this is a test the reproduces the issue TEST_F(JoinTest, FullJoinWithNullKeys)
{
column_wrapper<int32_t> col0_0{{0, 1}, {0, 1}};
column_wrapper<int32_t> col1_0{{-1, 2}, {0, 1}};
CVector cols0, cols1;
cols0.push_back(col0_0.release());
cols1.push_back(col1_0.release());
Table t0(std::move(cols0));
Table t1(std::move(cols1));
auto result = cudf::full_join(t0, t1, {0}, {0}, {{0, 0}});
auto result_sort_order = cudf::sorted_order(result->view());
auto sorted_result = cudf::gather(result->view(), *result_sort_order);
column_wrapper<int32_t> col_gold_0{{0, 1, -1, 2}, {0, 1, 0, 1}};
CVector cols_gold;
cols_gold.push_back(col_gold_0.release());
Table gold(std::move(cols_gold));
auto gold_sort_order = cudf::sorted_order(gold.view());
auto sorted_gold = cudf::gather(gold.view(), *gold_sort_order);
cudf::test::expect_tables_equal(*sorted_gold, *sorted_result);
}
|
I still don't know why this is not showing up in spark for a regular left join, but while trying to add in a right outer join, using left join I am seeing it there too. I need to dig a bit into spark to try and understand why it is not showing up on a left join. |
I suspect that this might not show up in a right join either (had we one). This might be happening because SparkSQL likely pushes down an implicit With a I'll try verify this tomorrow. Very interesting bug, this. |
This was fixed a while ago by the patch from @mythrocks |
Describe the bug
When I do a
full_join
in spark if the key has nulls in it I get the wrong answer becausefull_join
in cudf considersnull
to be equal, but in the rest of the world it does not.Steps/Code to reproduce bug
psudo code. I'll try to come up with an actual unit test to reproduce this, but for now...
in spark on the CPU the returns
from cudf it returns
The text was updated successfully, but these errors were encountered: