Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] FileFormat::GetReaderAsync hangs when ARROW_IO_THREADS is set to 1 #37487

Closed
yiteng-guo opened this issue Aug 31, 2023 · 5 comments · Fixed by #37514
Closed

[C++] FileFormat::GetReaderAsync hangs when ARROW_IO_THREADS is set to 1 #37487

yiteng-guo opened this issue Aug 31, 2023 · 5 comments · Fixed by #37514

Comments

@yiteng-guo
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

# test.py
import pyarrow.dataset as pads

ds = pads.dataset("/tmp/test.parq", format="parquet") # can be any parquet
ds.count_rows(use_threads=False)

ARROW_IO_THREADS=1 python test.py hangs forever. I traced this and it's stuck in GetReader.

arrow::FutureImpl::Wait
> arrow::dataset::ParquetFileFormat::GetReader
> arrow::dataset::ParquetFileFragment::EnsureCompleteMetadata

It works normally if I use any ARROW_IO_THREADS larger than 1. I reproduced this on mac with version 13.0.0 and linux with version 12.0.0.

Component(s)

Python

@yiteng-guo
Copy link
Author

It's worth noting that this only happens on count_rows but not on to_table and other pyarrow functions.

@mapleFU
Copy link
Member

mapleFU commented Aug 31, 2023

Hmm let me take a look. Perhaps something is already use ioExecutor and re-allocating sub-task in ioExecutor, which might causing deadlock and hang forever

@mapleFU
Copy link
Member

mapleFU commented Aug 31, 2023

I've found out the reason. But I'm too tired today, I'd like to fix it tomorrow. Or you can try to fix this and at me for review.

Change BufferReader::ReadAsync, and set ARROW_IO_THREADS == 1 would reproduce the problem in dataset C++ unittest.

Future<std::shared_ptr<Buffer>> BufferReader::ReadAsync(const IOContext& ctx,
                                                        int64_t position,
                                                        int64_t nbytes) {
  return DeferNotOk(ctx.executor()->Submit([this, position, nbytes]() {
    return DoReadAt(position, nbytes);
  }));
}

(The origin impl uses a blocking read so test cannot find problem here, it's just a hacking to reproduce the problem)

The IO Thread would wait for it:

Future<std::optional<int64_t>> ParquetFileFormat::CountRows(
    const std::shared_ptr<FileFragment>& file, compute::Expression predicate,
    const std::shared_ptr<ScanOptions>& options) {
  auto parquet_file = checked_pointer_cast<ParquetFileFragment>(file);
  if (parquet_file->metadata()) {
    ARROW_ASSIGN_OR_RAISE(auto maybe_count,
                          parquet_file->TryCountRows(std::move(predicate)));
    return Future<std::optional<int64_t>>::MakeFinished(maybe_count);
  } else {
    return DeferNotOk(options->io_context.executor()->Submit(
        [parquet_file, predicate]() -> Result<std::optional<int64_t>> {
          RETURN_NOT_OK(parquet_file->EnsureCompleteMetadata()); // <- here, it submit a task, and will wait for it in parquet_file->EnsureCompleteMetadata, causing deadlock
          return parquet_file->TryCountRows(predicate);
        }));
  }
}

@mapleFU
Copy link
Member

mapleFU commented Aug 31, 2023

Also cc @westonpace @pitrou , I'd like to implement a blocking read in EnsureCompleteMetadata or async thread, would this be ok? Or do we have better solutions here? (Or IO_THREADS = 1 just a bad idea?)

@mapleFU
Copy link
Member

mapleFU commented Sep 1, 2023

I've implement a draft fixing here: #37514

I'm not sure it can solve all deadlock problem, and I'm a bit tried these days so test might be added later, you can try it yourself.

bkietz pushed a commit that referenced this issue Sep 19, 2023
…GetReader` (#37514)

### Rationale for this change

As #37487 says. When thread cnt == 1, the thread might blocking in `ParquetFileFormat::GetReaderAsync`, that's because:

1. `ParquetFileFormat::CountRows` would call `EnsureCompleteMetadata` in `io_executor`
2. `EnsureCompleteMetadata` call `ParquetFileFormat::GetReader`, which dispatch real request to async mode
3. `async` is executed in `io_executor`.

1/3 in same fix-sized executor, causing deadlock.

### What changes are included in this PR?

Implement sync `ParquetFileFormat::GetReader`.

### Are these changes tested?

Currently not

### Are there any user-facing changes?

Bugfix

* Closes: #37487

Authored-by: mwish <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
@bkietz bkietz added this to the 14.0.0 milestone Sep 19, 2023
@AlenkaF AlenkaF changed the title [Python] FileFormat::GetReaderAsync hangs when ARROW_IO_THREADS is set to 1 [C++] FileFormat::GetReaderAsync hangs when ARROW_IO_THREADS is set to 1 Oct 13, 2023
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…rmat::GetReader` (apache#37514)

### Rationale for this change

As apache#37487 says. When thread cnt == 1, the thread might blocking in `ParquetFileFormat::GetReaderAsync`, that's because:

1. `ParquetFileFormat::CountRows` would call `EnsureCompleteMetadata` in `io_executor`
2. `EnsureCompleteMetadata` call `ParquetFileFormat::GetReader`, which dispatch real request to async mode
3. `async` is executed in `io_executor`.

1/3 in same fix-sized executor, causing deadlock.

### What changes are included in this PR?

Implement sync `ParquetFileFormat::GetReader`.

### Are these changes tested?

Currently not

### Are there any user-facing changes?

Bugfix

* Closes: apache#37487

Authored-by: mwish <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…rmat::GetReader` (apache#37514)

### Rationale for this change

As apache#37487 says. When thread cnt == 1, the thread might blocking in `ParquetFileFormat::GetReaderAsync`, that's because:

1. `ParquetFileFormat::CountRows` would call `EnsureCompleteMetadata` in `io_executor`
2. `EnsureCompleteMetadata` call `ParquetFileFormat::GetReader`, which dispatch real request to async mode
3. `async` is executed in `io_executor`.

1/3 in same fix-sized executor, causing deadlock.

### What changes are included in this PR?

Implement sync `ParquetFileFormat::GetReader`.

### Are these changes tested?

Currently not

### Are there any user-facing changes?

Bugfix

* Closes: apache#37487

Authored-by: mwish <[email protected]>
Signed-off-by: Benjamin Kietzman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants