Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39562: [C++][Parquet] Fix crash in test_parquet_dataset_lazy_filtering #39632

Merged
merged 1 commit into from
Jan 16, 2024

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Jan 16, 2024

Rationale for this change

ParquetFileFragment stores a SchemaManifest that has a raw pointer to a SchemaDescriptor. The SchemaDescriptor is originally provided by a FileMetadata instance but, in some cases, the FileMetadata instance can be destroyed while the ParquetFileFragment is still in use. This can typically lead to bugs or crashes.

What changes are included in this PR?

Ensure that ParquetFileFragment keeps an owning pointer to the FileMetadata instance that provides its SchemaManifest's schema descriptor.

Are these changes tested?

An assertion is added that would fail deterministically in the Python test suite.

Are there any user-facing changes?

No.

TODO complete this description
Copy link

⚠️ GitHub issue #39562 has been automatically assigned in GitHub to PR creator.

@pitrou
Copy link
Member Author

pitrou commented Jan 16, 2024

@github-actions crossbow submit wheel-macos*

@pitrou pitrou requested review from felipecrv and bkietz January 16, 2024 11:19
Copy link

Only contributors can submit requests to this bot. Please ask someone from the community for help with getting the first commit in.
The Archery job run can be found at: https://github.com/apache/arrow/actions/runs/7540837592

DCHECK(row_groups_.has_value());

metadata_ = std::move(metadata);
manifest_ = std::move(manifest);
original_metadata_ = original_metadata ? std::move(original_metadata) : metadata_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ok for be but should we check Subset for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean FileMetaData::Subset?

Copy link
Member

@mapleFU mapleFU Jan 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I misunderstand this previously, metadata_ is from Subset, but manifest is shared and having lifetime problem here, this looks ok to me

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 16, 2024
@raulcd
Copy link
Member

raulcd commented Jan 16, 2024

This b59082a might have broken the crossbow submissions for committers that don't have their Apache membership set to Public (https://github.com/orgs/apache/people?query=pitrou)

@raulcd
Copy link
Member

raulcd commented Jan 16, 2024

@github-actions crossbow submit wheel-macos*

Copy link

Revision: 8978cef

Submitted crossbow builds: ursacomputing/crossbow @ actions-2f9bd68461

Task Status
wheel-macos-big-sur-cp310-arm64 GitHub Actions
wheel-macos-big-sur-cp311-arm64 GitHub Actions
wheel-macos-big-sur-cp312-arm64 GitHub Actions
wheel-macos-big-sur-cp38-arm64 GitHub Actions
wheel-macos-big-sur-cp39-arm64 GitHub Actions
wheel-macos-catalina-cp310-amd64 GitHub Actions
wheel-macos-catalina-cp311-amd64 GitHub Actions
wheel-macos-catalina-cp312-amd64 GitHub Actions
wheel-macos-catalina-cp38-amd64 GitHub Actions
wheel-macos-catalina-cp39-amd64 GitHub Actions

@pitrou
Copy link
Member Author

pitrou commented Jan 16, 2024

@github-actions crossbow submit wheel-macos*

Copy link

Revision: 8978cef

Submitted crossbow builds: ursacomputing/crossbow @ actions-b28f7d7852

Task Status
wheel-macos-big-sur-cp310-arm64 GitHub Actions
wheel-macos-big-sur-cp311-arm64 GitHub Actions
wheel-macos-big-sur-cp312-arm64 GitHub Actions
wheel-macos-big-sur-cp38-arm64 GitHub Actions
wheel-macos-big-sur-cp39-arm64 GitHub Actions
wheel-macos-catalina-cp310-amd64 GitHub Actions
wheel-macos-catalina-cp311-amd64 GitHub Actions
wheel-macos-catalina-cp312-amd64 GitHub Actions
wheel-macos-catalina-cp38-amd64 GitHub Actions
wheel-macos-catalina-cp39-amd64 GitHub Actions

@pitrou pitrou changed the title GH-39562: [C++][Parquet] Fix crash GH-39562: [C++][Parquet] Fix crash in test_parquet_dataset_lazy_filtering Jan 16, 2024
@pitrou pitrou marked this pull request as ready for review January 16, 2024 12:48
@pitrou pitrou requested a review from westonpace as a code owner January 16, 2024 12:48
Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@pitrou
Copy link
Member Author

pitrou commented Jan 16, 2024

@jorisvandenbossche This seems to have addressed the issue. Should we run the tests another time?

@pitrou
Copy link
Member Author

pitrou commented Jan 16, 2024

@github-actions crossbow submit -g wheel -g python

@jorisvandenbossche
Copy link
Member

Nice! Yes, running once more to be sure is best (it happened before there were no failures in a single run of all wheel builds)

Copy link

Revision: 8978cef

Submitted crossbow builds: ursacomputing/crossbow @ actions-8a0ca7497c

Task Status
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest GitHub Actions
test-conda-python-3.10-pandas-nightly GitHub Actions
test-conda-python-3.10-spark-v3.5.0 GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0 GitHub Actions
test-conda-python-3.8-spark-v3.5.0 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest GitHub Actions
test-cuda-python GitHub Actions
test-debian-11-python-3 Azure
test-fedora-38-python-3 Azure
test-ubuntu-20.04-python-3 Azure
test-ubuntu-22.04-python-3 GitHub Actions
wheel-macos-big-sur-cp310-arm64 GitHub Actions
wheel-macos-big-sur-cp311-arm64 GitHub Actions
wheel-macos-big-sur-cp312-arm64 GitHub Actions
wheel-macos-big-sur-cp38-arm64 GitHub Actions
wheel-macos-big-sur-cp39-arm64 GitHub Actions
wheel-macos-catalina-cp310-amd64 GitHub Actions
wheel-macos-catalina-cp311-amd64 GitHub Actions
wheel-macos-catalina-cp312-amd64 GitHub Actions
wheel-macos-catalina-cp38-amd64 GitHub Actions
wheel-macos-catalina-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-arm64 GitHub Actions
wheel-manylinux-2-28-cp311-amd64 GitHub Actions
wheel-manylinux-2-28-cp311-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp38-amd64 GitHub Actions
wheel-manylinux-2-28-cp38-arm64 GitHub Actions
wheel-manylinux-2-28-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp39-arm64 GitHub Actions
wheel-manylinux-2014-cp310-amd64 GitHub Actions
wheel-manylinux-2014-cp310-arm64 GitHub Actions
wheel-manylinux-2014-cp311-amd64 GitHub Actions
wheel-manylinux-2014-cp311-arm64 GitHub Actions
wheel-manylinux-2014-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp38-amd64 GitHub Actions
wheel-manylinux-2014-cp38-arm64 GitHub Actions
wheel-manylinux-2014-cp39-amd64 GitHub Actions
wheel-manylinux-2014-cp39-arm64 GitHub Actions
wheel-windows-cp310-amd64 GitHub Actions
wheel-windows-cp311-amd64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions
wheel-windows-cp38-amd64 GitHub Actions
wheel-windows-cp39-amd64 GitHub Actions

@pitrou pitrou added this to the 15.0.0 milestone Jan 16, 2024
@pitrou pitrou merged commit 4b3de81 into apache:main Jan 16, 2024
38 of 39 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Jan 16, 2024
@pitrou pitrou deleted the GH-39562-pq-metadata branch January 16, 2024 14:24
@pitrou
Copy link
Member Author

pitrou commented Jan 16, 2024

@raulcd I marked this for 15.0.0.

raulcd pushed a commit that referenced this pull request Jan 16, 2024
…ring (#39632)

### Rationale for this change

`ParquetFileFragment` stores a `SchemaManifest` that has a raw pointer to a `SchemaDescriptor`. The `SchemaDescriptor` is originally provided by a `FileMetadata` instance but, in some cases, the `FileMetadata` instance can be destroyed while the `ParquetFileFragment` is still in use. This can typically lead to bugs or crashes.

### What changes are included in this PR?

Ensure that `ParquetFileFragment` keeps an owning pointer to the `FileMetadata` instance that provides its `SchemaManifest`'s schema descriptor.

### Are these changes tested?

An assertion is added that would fail deterministically in the Python test suite.

### Are there any user-facing changes?

No.

* Closes: #39562

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Copy link

After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 4b3de81.

There was 1 benchmark result with an error:

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

idailylife pushed a commit to idailylife/arrow that referenced this pull request Jan 18, 2024
…_filtering (apache#39632)

### Rationale for this change

`ParquetFileFragment` stores a `SchemaManifest` that has a raw pointer to a `SchemaDescriptor`. The `SchemaDescriptor` is originally provided by a `FileMetadata` instance but, in some cases, the `FileMetadata` instance can be destroyed while the `ParquetFileFragment` is still in use. This can typically lead to bugs or crashes.

### What changes are included in this PR?

Ensure that `ParquetFileFragment` keeps an owning pointer to the `FileMetadata` instance that provides its `SchemaManifest`'s schema descriptor.

### Are these changes tested?

An assertion is added that would fail deterministically in the Python test suite.

### Are there any user-facing changes?

No.

* Closes: apache#39562

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
clayburn pushed a commit to clayburn/arrow that referenced this pull request Jan 23, 2024
…_filtering (apache#39632)

### Rationale for this change

`ParquetFileFragment` stores a `SchemaManifest` that has a raw pointer to a `SchemaDescriptor`. The `SchemaDescriptor` is originally provided by a `FileMetadata` instance but, in some cases, the `FileMetadata` instance can be destroyed while the `ParquetFileFragment` is still in use. This can typically lead to bugs or crashes.

### What changes are included in this PR?

Ensure that `ParquetFileFragment` keeps an owning pointer to the `FileMetadata` instance that provides its `SchemaManifest`'s schema descriptor.

### Are these changes tested?

An assertion is added that would fail deterministically in the Python test suite.

### Are there any user-facing changes?

No.

* Closes: apache#39562

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…_filtering (apache#39632)

### Rationale for this change

`ParquetFileFragment` stores a `SchemaManifest` that has a raw pointer to a `SchemaDescriptor`. The `SchemaDescriptor` is originally provided by a `FileMetadata` instance but, in some cases, the `FileMetadata` instance can be destroyed while the `ParquetFileFragment` is still in use. This can typically lead to bugs or crashes.

### What changes are included in this PR?

Ensure that `ParquetFileFragment` keeps an owning pointer to the `FileMetadata` instance that provides its `SchemaManifest`'s schema descriptor.

### Are these changes tested?

An assertion is added that would fail deterministically in the Python test suite.

### Are there any user-facing changes?

No.

* Closes: apache#39562

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…_filtering (apache#39632)

### Rationale for this change

`ParquetFileFragment` stores a `SchemaManifest` that has a raw pointer to a `SchemaDescriptor`. The `SchemaDescriptor` is originally provided by a `FileMetadata` instance but, in some cases, the `FileMetadata` instance can be destroyed while the `ParquetFileFragment` is still in use. This can typically lead to bugs or crashes.

### What changes are included in this PR?

Ensure that `ParquetFileFragment` keeps an owning pointer to the `FileMetadata` instance that provides its `SchemaManifest`'s schema descriptor.

### Are these changes tested?

An assertion is added that would fail deterministically in the Python test suite.

### Are there any user-facing changes?

No.

* Closes: apache#39562

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…_filtering (apache#39632)

### Rationale for this change

`ParquetFileFragment` stores a `SchemaManifest` that has a raw pointer to a `SchemaDescriptor`. The `SchemaDescriptor` is originally provided by a `FileMetadata` instance but, in some cases, the `FileMetadata` instance can be destroyed while the `ParquetFileFragment` is still in use. This can typically lead to bugs or crashes.

### What changes are included in this PR?

Ensure that `ParquetFileFragment` keeps an owning pointer to the `FileMetadata` instance that provides its `SchemaManifest`'s schema descriptor.

### Are these changes tested?

An assertion is added that would fail deterministically in the Python test suite.

### Are there any user-facing changes?

No.

* Closes: apache#39562

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants