Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40068: [C++] Possible data race when reading metadata of a parquet file #40111

Merged
merged 2 commits into from
Feb 26, 2024

Conversation

westonpace
Copy link
Member

@westonpace westonpace commented Feb 17, 2024

Rationale for this change

The ParquetFileFragment will cache the parquet metadata when loading it. The metadata() method accesses this metadata (a shared_ptr) but does not grab the lock used to set that shared_ptr. It's possible then that we are reading a shared_ptr at the same time some other thread is setting the shared_ptr which is technically (I think) undefined behavior.

What changes are included in this PR?

Guard access to the metadata by grabbing the mutex first

Are these changes tested?

Existing tests should regress this change

Are there any user-facing changes?

No

Copy link

⚠️ GitHub issue #40068 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 23, 2024
@westonpace westonpace requested a review from pitrou February 23, 2024 19:14
@pitrou
Copy link
Member

pitrou commented Feb 26, 2024

@github-actions crossbow submit -g cpp

@pitrou
Copy link
Member

pitrou commented Feb 26, 2024

@raulcd This would be good for 15.0.1, if not too late.

Copy link

Revision: a3fe9ec

Submitted crossbow builds: ursacomputing/crossbow @ actions-f5521fde34

Task Status
test-alpine-linux-cpp GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind Azure
test-cuda-cpp GitHub Actions
test-debian-11-cpp-amd64 GitHub Actions
test-debian-11-cpp-i386 GitHub Actions
test-fedora-39-cpp GitHub Actions
test-ubuntu-20.04-cpp GitHub Actions
test-ubuntu-20.04-cpp-bundled GitHub Actions
test-ubuntu-20.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-20.04-cpp-thread-sanitizer GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions

@pitrou pitrou merged commit a7ac7e0 into apache:main Feb 26, 2024
34 of 36 checks passed
@pitrou pitrou removed the awaiting changes Awaiting changes label Feb 26, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit a7ac7e0.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Feb 28, 2024
…arquet file (apache#40111)

### Rationale for this change

The `ParquetFileFragment` will cache the parquet metadata when loading it.  The `metadata()` method accesses this metadata (a shared_ptr) but does not grab the lock used to set that shared_ptr.  It's possible then that we are reading a shared_ptr at the same time some other thread is setting the shared_ptr which is technically (I think) undefined behavior.

### What changes are included in this PR?

Guard access to the metadata by grabbing the mutex first

### Are these changes tested?

Existing tests should regress this change

### Are there any user-facing changes?

No
* Closes: apache#40068

Authored-by: Weston Pace <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
thisisnic pushed a commit to thisisnic/arrow that referenced this pull request Mar 8, 2024
…arquet file (apache#40111)

### Rationale for this change

The `ParquetFileFragment` will cache the parquet metadata when loading it.  The `metadata()` method accesses this metadata (a shared_ptr) but does not grab the lock used to set that shared_ptr.  It's possible then that we are reading a shared_ptr at the same time some other thread is setting the shared_ptr which is technically (I think) undefined behavior.

### What changes are included in this PR?

Guard access to the metadata by grabbing the mutex first

### Are these changes tested?

Existing tests should regress this change

### Are there any user-facing changes?

No
* Closes: apache#40068

Authored-by: Weston Pace <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
raulcd pushed a commit that referenced this pull request Mar 13, 2024
… file (#40111)

### Rationale for this change

The `ParquetFileFragment` will cache the parquet metadata when loading it.  The `metadata()` method accesses this metadata (a shared_ptr) but does not grab the lock used to set that shared_ptr.  It's possible then that we are reading a shared_ptr at the same time some other thread is setting the shared_ptr which is technically (I think) undefined behavior.

### What changes are included in this PR?

Guard access to the metadata by grabbing the mutex first

### Are these changes tested?

Existing tests should regress this change

### Are there any user-facing changes?

No
* Closes: #40068

Authored-by: Weston Pace <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++] Possible data race when reading metadata of a parquet file
2 participants