Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] full_join considers nulls as equal #5563

Closed
revans2 opened this issue Jun 23, 2020 · 7 comments
Closed

[BUG] full_join considers nulls as equal #5563

revans2 opened this issue Jun 23, 2020 · 7 comments
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Jun 23, 2020

Describe the bug
When I do a full_join in spark if the key has nulls in it I get the wrong answer because full_join in cudf considers null to be equal, but in the rest of the world it does not.

Steps/Code to reproduce bug
psudo code. I'll try to come up with an actual unit test to reproduce this, but for now...

a = {{null, 1}}
b = {{null, 2}}
full_join(a, b)

in spark on the CPU the returns

{{null, 1, null, null},
 {null, null, null, 2}}

from cudf it returns

{{null, 1, null, 2}}
@revans2 revans2 added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Jun 23, 2020
@jrhemstad
Copy link
Contributor

jrhemstad commented Jun 23, 2020

Oldie but a goodie: #542

The behavior definitely used to be NULL != NULL, but that may have just been an oversight in porting since row_equality_comparator defaults to NULL == NULL.

So I think we should just switch back to NULL != NULL by specifying it in the row comparator.

@jrhemstad
Copy link
Contributor

fyi, @revans2 the way things are currently implemented, this doesn't look like it's unique to full join. Inner and left outer would have this same issue.

@jrhemstad jrhemstad added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 23, 2020
@revans2
Copy link
Contributor Author

revans2 commented Jun 23, 2020

From our tests other join operators are not showing issues with nulls in the key. It is just full_join. I would need to dig into the details to see if something else is masking it in spark or not.

@revans2
Copy link
Contributor Author

revans2 commented Jun 23, 2020

I think this is a test the reproduces the issue

TEST_F(JoinTest, FullJoinWithNullKeys)
{
  column_wrapper<int32_t> col0_0{{0, 1}, {0, 1}};
  column_wrapper<int32_t> col1_0{{-1, 2}, {0, 1}};

  CVector cols0, cols1;
  cols0.push_back(col0_0.release());
  cols1.push_back(col1_0.release());

  Table t0(std::move(cols0));
  Table t1(std::move(cols1));

  auto result            = cudf::full_join(t0, t1, {0}, {0}, {{0, 0}});
  auto result_sort_order = cudf::sorted_order(result->view());
  auto sorted_result     = cudf::gather(result->view(), *result_sort_order);

  column_wrapper<int32_t> col_gold_0{{0, 1, -1, 2}, {0, 1, 0, 1}};
  CVector cols_gold;
  cols_gold.push_back(col_gold_0.release());
  Table gold(std::move(cols_gold));

  auto gold_sort_order = cudf::sorted_order(gold.view());
  auto sorted_gold     = cudf::gather(gold.view(), *gold_sort_order);
  cudf::test::expect_tables_equal(*sorted_gold, *sorted_result);
}

@revans2
Copy link
Contributor Author

revans2 commented Jun 23, 2020

I still don't know why this is not showing up in spark for a regular left join, but while trying to add in a right outer join, using left join I am seeing it there too. I need to dig a bit into spark to try and understand why it is not showing up on a left join.

@mythrocks
Copy link
Contributor

mythrocks commented Jun 24, 2020

I need to dig a bit into spark to try and understand why it is not showing up on a left join.

I suspect that this might not show up in a right join either (had we one).

This might be happening because SparkSQL likely pushes down an implicit right_table.join_key IS NOT NULL filter, for LEFT OUTER JOIN. So the right-side null records are pre-filtered, and aren't available to produce the wrong results.

With a FULL OUTER JOIN, neither side can be pre-filtered. Without the join predicate pushed down, null-keys on both sides make it to the cudf::full_join(), and we get the incorrect results.

I'll try verify this tomorrow. Very interesting bug, this.

@revans2
Copy link
Contributor Author

revans2 commented Jul 14, 2020

This was fixed a while ago by the patch from @mythrocks

@revans2 revans2 closed this as completed Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

3 participants