Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-38399: [Go][Parquet] DeltaBitPack decoder reset usedFirst after SetData #38413

Merged
merged 3 commits into from
Oct 24, 2023

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Oct 23, 2023

Rationale for this change

As #38399 says. DeltaBitPack will corrupt when we meet a column chunk
with more than one page. During first page decoding, it works well. But when the second page comes, the
d.usedFirst haven't been reset, which cause the bug.

What changes are included in this PR?

  1. Some style enhancement
  2. Bug fix

Are these changes tested?

Currently not

Are there any user-facing changes?

bugfix

@mapleFU mapleFU requested a review from zeroshade as a code owner October 23, 2023 16:48
@github-actions
Copy link

⚠️ GitHub issue #38399 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member Author

mapleFU commented Oct 23, 2023

@zeroshade I've investigate and submit a fixing. It's a bit late in UTC-8. I'll add tests tomorrow.

return xerrors.New("parquet: eof exception")
}
if d.miniBlocksPerBlock == 0 {
return xerrors.New("parquet: cannot have zero miniblock per block")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xerrors is no longer necessary as it has been absorbed into the stdlib, so for new code please just use errors instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I replace the xerrors in this file? Since mixing xerror and standard lib is so weird :-(

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Oct 23, 2023
if err != nil {
return 0, err
}
d.currentMiniBlockVals = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this redundant due to the above if statement?

Copy link
Member Author

@mapleFU mapleFU Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Previously I though it might raise from data on last block( already-read + d.currentBlockVals would greater than max). However finally I found that here it will limit by out and max. So this line can be removed.

@zeroshade
Copy link
Member

This looks great to me @mapleFU thanks! I'll review again once you add the tests

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Oct 24, 2023
@mapleFU mapleFU requested a review from zeroshade October 24, 2023 02:41
@mapleFU
Copy link
Member Author

mapleFU commented Oct 24, 2023

I've finished testing and resolve the comments @zeroshade . I'm not familiar with go enough, so don't know why CI failed...

@zeroshade
Copy link
Member

@mapleFU The failing appveyor test doesn't appear related at all so you're fine there.

Are either of the tests you added able to reproduce the initial failure that was reported in the issue (without your fix)? If not, can you add a test that does reproduce the initial failure?

@mapleFU
Copy link
Member Author

mapleFU commented Oct 24, 2023

		// Using same Decoder to decode the data.
		dec := encoding.NewDecoder(parquet.Types.Int32, parquet.Encodings.DeltaBinaryPacked, column, memory.DefaultAllocator)
		for i := 0; i < 5; i += 1 {
			dec.(encoding.Int32Decoder).SetData(len(values), buf.Bytes())

			valueBuf := make([]int32, 100)
			for i, j := 0, len(valueBuf); j <= len(values); i, j = i+len(valueBuf), j+len(valueBuf) {
				dec.(encoding.Int32Decoder).Decode(valueBuf)
				assert.Equalf(t, values[i:j], valueBuf, "indexes %d:%d", i, j)
			}
		}

I've add the same test for int32 and int64, please refer to "test decoding multiple pages". @zeroshade

In the case, dec is created once but call SetData multiple times. Before this fixing, SetData will not reset usedFirst, which will cause the bug.

@zeroshade zeroshade merged commit 7b1281a into apache:main Oct 24, 2023
24 checks passed
@zeroshade zeroshade removed the awaiting change review Awaiting change review label Oct 24, 2023
@github-actions github-actions bot added the awaiting merge Awaiting merge label Oct 24, 2023
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 7b1281a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Oct 25, 2023
…ter SetData (apache#38413)

### Rationale for this change

As apache#38399 says. DeltaBitPack will corrupt when we meet a column chunk
with more than one page. During first page decoding, it works well. But when the second page comes, the
`d.usedFirst` haven't been reset, which cause the bug.

### What changes are included in this PR?

1. Some style enhancement
2. Bug fix

### Are these changes tested?

Currently not

### Are there any user-facing changes?

bugfix

* Closes: apache#38399

Authored-by: mwish <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
@mapleFU mapleFU deleted the parquet-go/fix-delta-bit-pack-init branch October 30, 2023 09:25
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…ter SetData (apache#38413)

### Rationale for this change

As apache#38399 says. DeltaBitPack will corrupt when we meet a column chunk
with more than one page. During first page decoding, it works well. But when the second page comes, the
`d.usedFirst` haven't been reset, which cause the bug.

### What changes are included in this PR?

1. Some style enhancement
2. Bug fix

### Are these changes tested?

Currently not

### Are there any user-facing changes?

bugfix

* Closes: apache#38399

Authored-by: mwish <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…ter SetData (apache#38413)

### Rationale for this change

As apache#38399 says. DeltaBitPack will corrupt when we meet a column chunk
with more than one page. During first page decoding, it works well. But when the second page comes, the
`d.usedFirst` haven't been reset, which cause the bug.

### What changes are included in this PR?

1. Some style enhancement
2. Bug fix

### Are these changes tested?

Currently not

### Are there any user-facing changes?

bugfix

* Closes: apache#38399

Authored-by: mwish <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Go] DeltaBitPack deltaBitWidths overflow
2 participants