Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Improve performance of sequential access of ChunkResolver #44641

Open
anjakefala opened this issue Nov 5, 2024 · 1 comment
Open

[C++] Improve performance of sequential access of ChunkResolver #44641

anjakefala opened this issue Nov 5, 2024 · 1 comment

Comments

@anjakefala
Copy link
Collaborator

Describe the enhancement requested

There has been recent work to move the ChunkResolver to public API.

ChunkResolver uses O(log(num_chunks)) binary search to identify chunks, which is optimised for random access. For sequential row-by-row access, using ChunkResolver would be inefficient.

Sometimes a user needs to be able to do row-major processing of the data. To that note, the proposal is to add these helper methods to the ChunKResolver API for more efficient sequential access traversal.

These helper methods were written by @felipecrv:

  /// \pre loc.chunk_index >= 0
  /// \pre loc.index_in_chunk is assumed valid if chunk_index is not the last one
  inline bool Valid(ChunkLocation loc) const {
    const int64_t last_chunk_index = static_cast<int64_t>(offsets_.size()) - 1;
    return loc.chunk_index + 1 < last_chunk_index ||
           (loc.chunk_index + 1 == last_chunk_index &&
            loc.index_in_chunk < offsets_[last_chunk_index]);
  }

  /// \pre Valid(loc)
  inline ChunkLocation Next(ChunkLocation loc) const {
    const int64_t next_index_in_chunk = loc.index_in_chunk + 1;
    return (next_index_in_chunk < offsets_[loc.chunk_index + 1])
               ? ChunkLocation{loc.chunk_index, next_index_in_chunk}
               : ChunkLocation{loc.chunk_index + 1, 0};
  }

with the resulting loops:

ChunkResolver resolver(batches);
for (ChunkLocation loc; resolver.Valid(loc); loc = resolved.Next(loc)) {
  // re-use loc for all the typed columns since they are split on the same offsets
}

Component(s)

C++

@anjakefala anjakefala changed the title [C++] Improve sequential access use of ChunkResolver [C++] Improve performance of sequential access of ChunkResolver Nov 5, 2024
@felipecrv
Copy link
Contributor

+1 I think these would be nice additions to the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants