Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43712: [C++][Parquet] Dataset: Handle num-nulls in Parquet correctly when !HasNullCount() #43726

Merged
merged 10 commits into from
Sep 6, 2024

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Aug 16, 2024

Rationale for this change

See issue. When !HasNullCount, we cannot gurantee null exists

What changes are included in this PR?

Handle HasNullCount in dataset expr

Are these changes tested?

Yes

Are there any user-facing changes?

Merely

Copy link

⚠️ GitHub issue #43712 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 16, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Aug 22, 2024

@pitrou @bkietz Would you mind take a look? I've add test in this case now

@mapleFU mapleFU requested a review from bkietz August 28, 2024 03:59
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay @mapleFU .

cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved
Comment on lines 376 to 378
// If there are no values and no nulls, it might be empty or contains
// only null.
return std::nullopt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit weird (though it's certainly safe to return std::nullopt).
If num_values() == 0, there can be only nulls, so we can just return is_null(std::move(field_expr)) too?

Copy link
Member Author

@mapleFU mapleFU Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. This might means "no-values", like an empty-page. I'm not sure should an empty page return is_null, it might be ok but a bit-weird for me( is_null for null or empty data)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's simpler and more consistent to return is_null(std::move(field_expr)) if we can; in general I think it usually makes the most sense to construct the most explicit/precise guarantees which are easily available (and therefore to avoid making this special case which will result in less specific guarantees).

cpp/src/arrow/dataset/file_parquet.cc Outdated Show resolved Hide resolved
}
{
// Special case: when num_value is 0, if has_null, it would return
// "is_null", otherwise it cannot gurantees anything
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why it cannot guarantee anything. It knows that there are no non-null values, so is_null is true for all values (even if there are no values at all).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied in https://github.com/apache/arrow/pull/43726/files#r1742087940 . I can also change to is_null. I've no trending here

@mapleFU mapleFU requested a review from pitrou September 3, 2024 14:55
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 4, 2024
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 5, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Sep 5, 2024

@pitrou @bkietz I've change to return is_null when value count is 0

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Sep 5, 2024
Comment on lines 372 to 375
// If `statistics.HasNullCount()`, it means the all the values are nulls.
//
// If there are no values and no nulls, it might be empty or all values
// are nulls. In this case, we also return a null expression.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update this comment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it might be mistake

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    // If there are no values and `!statistics.HasNullCount()`, it might be
    // empty or all values are nulls. In this case, we also return a null
    // expression.

Change to this

@pitrou
Copy link
Member

pitrou commented Sep 5, 2024

Thanks for the fix and tests @mapleFU !

@mapleFU
Copy link
Member Author

mapleFU commented Sep 6, 2024

Enter primary author in the format of "name <email>" [mwish <[[email protected]](mailto:[email protected])>]: 
Traceback (most recent call last):
  File "/Users/mwish/workspace/CMakeLibs/arrow/dev/merge_arrow_pr.py", line 790, in <module>
    cli()
  File "/Users/mwish/workspace/CMakeLibs/arrow/dev/merge_arrow_pr.py", line 771, in cli
    pr.merge()
  File "/Users/mwish/workspace/CMakeLibs/arrow/dev/merge_arrow_pr.py", line 644, in merge
    self.cmd.fail(f'Failed to merge pull request: {message}')
  File "/Users/mwish/workspace/CMakeLibs/arrow/dev/merge_arrow_pr.py", line 458, in fail
    raise Exception(msg)
Exception: Failed to merge pull request: Not Found: https://api.github.com/repos/apache/arrow/pulls/43726/merge

@kou Have you meet this problem when merging?

@kou
Copy link
Member

kou commented Sep 6, 2024

I haven't seen it. Could you try it again? Or can I try it?

@kou
Copy link
Member

kou commented Sep 6, 2024

Is your GitHub Personal Access Token valid?

@mapleFU
Copy link
Member Author

mapleFU commented Sep 6, 2024

I would check this

@mapleFU mapleFU merged commit ab0a40e into apache:main Sep 6, 2024
40 of 41 checks passed
@mapleFU mapleFU removed the awaiting merge Awaiting merge label Sep 6, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Sep 6, 2024

Aha merged, sorry for distrubing

@mapleFU mapleFU deleted the handle-num-nulls-correctly branch September 6, 2024 13:41
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit ab0a40e.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 17 possible false positives for unstable benchmarks that are known to sometimes produce them.

khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
…orrectly when !HasNullCount() (apache#43726)

### Rationale for this change

See issue. When `!HasNullCount`, we cannot gurantee null exists

### What changes are included in this PR?

Handle HasNullCount in dataset expr

### Are these changes tested?

Yes

### Are there any user-facing changes?

Merely

* GitHub Issue: apache#43712

Lead-authored-by: mwish <[email protected]>
Co-authored-by: mwish <[email protected]>
Co-authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: mwish <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants