Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] run_end_encode segfaults on arrays with an all-set validity buffer #36708

Closed
coady opened this issue Jul 16, 2023 · 8 comments · Fixed by #36740
Closed

[C++] run_end_encode segfaults on arrays with an all-set validity buffer #36708

coady opened this issue Jul 16, 2023 · 8 comments · Fixed by #36740
Assignees
Labels
Component: C++ Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Milestone

Comments

@coady
Copy link

coady commented Jul 16, 2023

Describe the bug, including details regarding any error messages, version, and platform.

import pyarrow as pa, pyarrow.compute as pc, pyarrow.dataset as ds

ds.write_dataset(pa.table({'': pa.repeat(1, 10)}), 'temp', format='parquet')
(column,) = ds.dataset('temp').to_table()

assert pc.run_end_encode(column.combine_chunks())
pc.run_end_encode(column)  # <- segfaults

Seems to only reproduce when the array is read. Tested on the latest nightly build: 13.0.0.dev516.

Component(s)

C++, Python

@mapleFU
Copy link
Member

mapleFU commented Jul 16, 2023

cc @felipecrv

@felipecrv felipecrv self-assigned this Jul 17, 2023
@felipecrv
Copy link
Contributor

Chunked arrays are handled by the kernel and even part of the unit tests, but the repro code above does indeed trigger the SIGSEGV.

I isolated the cause of the SIGSEGV to some expectations in the allocation code not being preserved.

Result<std::shared_ptr<ArrayData>> PreallocateValuesArray(
    const std::shared_ptr<DataType>& value_type, bool has_validity_buffer, int64_t length,
    int64_t null_count, MemoryPool* pool, int64_t data_buffer_size) {
  std::vector<std::shared_ptr<Buffer>> values_data_buffers;
  std::shared_ptr<Buffer> validity_buffer = NULLPTR;
  if (has_validity_buffer) {
    ARROW_ASSIGN_OR_RAISE(validity_buffer, AllocateEmptyBitmap(length, pool));
    DCHECK(validity_buffer);
  }
  ARROW_ASSIGN_OR_RAISE(auto values_buffer, AllocateValuesBuffer(length, *value_type,
                                                                 pool, data_buffer_size));
  if (is_base_binary_like(value_type->id())) {
    const int offset_byte_width = offset_bit_width(value_type->id()) / 8;
    ARROW_ASSIGN_OR_RAISE(auto offsets_buffer,
                          AllocateBuffer((length + 1) * offset_byte_width, pool));
    // Ensure the first offset is zero
    memset(offsets_buffer->mutable_data(), 0, offset_byte_width);
    offsets_buffer->ZeroPadding();
    values_data_buffers = {validity_buffer, std::move(offsets_buffer),
                           std::move(values_buffer)};
  } else {
    values_data_buffers = {validity_buffer, std::move(values_buffer)};
  }
  auto data = ArrayData::Make(value_type, length, values_data_buffers, null_count);
  DCHECK(!has_validity_buffer || validity_buffer.use_count() == 2);
  DCHECK(!has_validity_buffer || validity_buffer != NULLPTR);
  DCHECK(!has_validity_buffer || data->buffers[0] != NULLPTR);
  return data;
}

DCHECK(!has_validity_buffer || data->buffers[0] != NULLPTR); fails after all the previous checks pass and I still don't understand why.

@felipecrv
Copy link
Contributor

Found the reason: AdjustNonNullable zeroes the validity buffer if null_count is passed as 0.

static inline void AdjustNonNullable(Type::type type_id, int64_t length,
                                     std::vector<std::shared_ptr<Buffer>>* buffers,
                                     int64_t* null_count) {
  if (type_id == Type::NA) {
    *null_count = length;
    (*buffers)[0] = nullptr;
  } else if (internal::HasValidityBitmap(type_id)) {
    if (*null_count == 0) {
      // In case there are no nulls, don't keep an allocated null bitmap around
      (*buffers)[0] = nullptr;
    } else if (*null_count == kUnknownNullCount && buffers->at(0) == nullptr) {
      // Conversely, if no null bitmap is provided, set the null count to 0
      *null_count = 0;
    }
  } else {
    *null_count = 0;
  }
}

@felipecrv
Copy link
Contributor

@coady the issue is triggered by giving an input array without nulls that has a non-null validity bitmap buffer. The PR I've sent fixes that.

@mapleFU it might be nice to investigate if the reading of Parquet files is unnecessarily producing validity buffers when null_count is zero. Setting those buffers to NULLPTR when null_count is known to be zero can avoid this kind of confusion and save some cost of calculating null_count by scanning all the bits down the road.

@felipecrv
Copy link
Contributor

But to be clear: validity bitmap set when null_count is zero is totally valid. Sometimes it's hard to know the null_count is zero upfront (e.g. slicing an array producing a range that has no nulls).

@mapleFU
Copy link
Member

mapleFU commented Jul 18, 2023

I guess not:

std::shared_ptr<Array> TransferZeroCopy(RecordReader* reader,
                                        const std::shared_ptr<Field>& field) {
  std::shared_ptr<::arrow::ArrayData> data;
  if (field->nullable()) {
    std::vector<std::shared_ptr<Buffer>> buffers = {reader->ReleaseIsValid(),
                                                    reader->ReleaseValues()};
    data = std::make_shared<::arrow::ArrayData>(field->type(), reader->values_written(),
                                                std::move(buffers), reader->null_count());
  } else {
    std::vector<std::shared_ptr<Buffer>> buffers = {nullptr, reader->ReleaseValues()};
    data = std::make_shared<::arrow::ArrayData>(field->type(), reader->values_written(),
                                                std::move(buffers), /*null_count=*/0);
  }
  return ::arrow::MakeArray(data);
}

parquet/arrow has considering these cases.

felipecrv added a commit to felipecrv/arrow that referenced this issue Jul 19, 2023
@westonpace westonpace changed the title run_end_encode segfaults on chunked arrays. [C++] run_end_encode segfaults on chunked arrays. Jul 19, 2023
@felipecrv
Copy link
Contributor

@westonpace for reference, the title of the issue should probably be: "run_end_encode segfaults on arrays with an all-set validity buffer"

@kou kou changed the title [C++] run_end_encode segfaults on chunked arrays. [C++] run_end_encode segfaults on arrays with an all-set validity buffer Jul 20, 2023
@kou
Copy link
Member

kou commented Jul 20, 2023

Updated the title.

felipecrv added a commit to felipecrv/arrow that referenced this issue Jul 21, 2023
kou pushed a commit that referenced this issue Jul 22, 2023
…ke sense (#36740)

### Rationale for this change

When `has_validity_buffer` is true, we expect validity buffers to be allocated, but if null_count is calculated and ends up being 0, `ArrayData::Make()` will sneakily remove the validity buffer from the physical array for us and the assumption that it exists stops holding and causes a crash.

Forcing `null_count` calculation with `input.GetNullCount()` ensures `has_validity_buffer` won't be `true` if the `null_count` on the input ends up being 0.

### What changes are included in this PR?

The fix and tests to reproduce it.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: #36708

Authored-by: Felipe Oliveira Carvalho <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@kou kou added this to the 14.0.0 milestone Jul 22, 2023
R-JunmingChen pushed a commit to R-JunmingChen/arrow that referenced this issue Aug 20, 2023
…ons make sense (apache#36740)

### Rationale for this change

When `has_validity_buffer` is true, we expect validity buffers to be allocated, but if null_count is calculated and ends up being 0, `ArrayData::Make()` will sneakily remove the validity buffer from the physical array for us and the assumption that it exists stops holding and causes a crash.

Forcing `null_count` calculation with `input.GetNullCount()` ensures `has_validity_buffer` won't be `true` if the `null_count` on the input ends up being 0.

### What changes are included in this PR?

The fix and tests to reproduce it.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#36708

Authored-by: Felipe Oliveira Carvalho <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…ons make sense (apache#36740)

### Rationale for this change

When `has_validity_buffer` is true, we expect validity buffers to be allocated, but if null_count is calculated and ends up being 0, `ArrayData::Make()` will sneakily remove the validity buffer from the physical array for us and the assumption that it exists stops holding and causes a crash.

Forcing `null_count` calculation with `input.GetNullCount()` ensures `has_validity_buffer` won't be `true` if the `null_count` on the input ends up being 0.

### What changes are included in this PR?

The fix and tests to reproduce it.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

No.
* Closes: apache#36708

Authored-by: Felipe Oliveira Carvalho <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@amoeba amoeba added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants