Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce code duplication in VectorizedParquetDefinitionLevelReader #11661

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

wypoon
Copy link
Contributor

@wypoon wypoon commented Nov 27, 2024

In VectorizedParquetDefinitionLevelReader, there are a number of nested reader classes; these extend one of two nested base classes, NumericBaseReader and BaseReader. Each of these base classes define nextBatch and nextDictEncodedBatch methods. These 4 methods (NumericBaseReader::nextBatch, NumericBaseReader::nextDictEncodedBatch, BaseReader::nextBatch, BaseReader::nextDictEncodedBatch) are actually structurally similar:

      int idx = startOffset;
      int left = numValsToRead;
      while (left > 0) {
        if (currentCount == 0) {
          readNextGroup();
        }
        int numValues = Math.min(left, currentCount);
        ...
        switch (mode) {
          case RLE:
            ...
          case PACKED:
            ...
        }
        left -= numValues;
        currentCount -= numValues;
      }

We can separate out the differences into what gets called in the RLE and PACKED cases for each of the two base classes, and even there, for dict encoded batches, there is no difference in logic between NumericBaseReader and BaseReader. Thus there is quite a bit of duplication that can be reduced.

This PR simply refactors the common logic into a superclass (CommonBaseReader) of NumericBaseReader and BaseReader, in a nextCommonBatch method that nextBatch and nextDictEncodedBatch delegate to.

No tests are added as there is no functionality change. The methods are exercised by existing tests.

Additional notes:
NumericBaseReader and BaseReader both define abstract nextDictEncodedVal methods with the same set of parameters but in different order! In this refactor, we define a common nextDictEncodedVal for them in CommonBaseReader.
NumericBaseReader and BaseReader each define abstract nextVal methods as well, but with different sets of parameters, and these we have left alone.

@github-actions github-actions bot added the arrow label Nov 27, 2024
int idx,
int numValues,
ArrowBuf validityBuffer) {
int bufferIdx = idx;
Copy link
Contributor Author

@wypoon wypoon Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incremented in the method body and incrementing a parameter (idx) is a checkstyle violation. Hence this variable.
The same applies to 3 other methods below.

@wypoon
Copy link
Contributor Author

wypoon commented Nov 27, 2024

@nastra can you please review this?
My impetus for the refactor is that I want to implement Parquet page-skipping, and with the refactor, a core change in logic can be applied in one method instead of 4 similarly structured methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant