Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go] DeltaBitPack deltaBitWidths overflow #38399

Closed
Hattonuri opened this issue Oct 23, 2023 · 7 comments · Fixed by #38413
Closed

[Go] DeltaBitPack deltaBitWidths overflow #38399

Hattonuri opened this issue Oct 23, 2023 · 7 comments · Fixed by #38413
Assignees
Milestone

Comments

@Hattonuri
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

I wrote some int64 using c++ ::parquet::Encoding::DELTA_BINARY_PACKED
and then tried to read this with go but got:

panic: runtime error: index out of range [4] with length 4

goroutine 199 [running]:
github.com/apache/arrow/go/v14/parquet/internal/encoding.(*DeltaBitPackInt64Decoder).unpackNextMini(0xc1aee302a0)
        /home/dstasenko/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/internal/encoding/delta_bit_packing.go:230 +0x2da
github.com/apache/arrow/go/v14/parquet/internal/encoding.(*DeltaBitPackInt64Decoder).Decode(0xc1aee302a0, {0xc1cc73b008?, 0xa7d50d?, 0xc1adb08360?})
        /home/dstasenko/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/internal/encoding/delta_bit_packing.go:267 +0x118
github.com/apache/arrow/go/v14/parquet/file.(*Int64ColumnChunkReader).ReadBatch.func1(0x15a201, 0x167d01)
        /home/dstasenko/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/column_reader_types.gen.go:93 +0xa3
github.com/apache/arrow/go/v14/parquet/file.(*columnChunkReader).readBatch(0xc1adb08360, 0x3c9c00, {0xc1d3f26000, 0x3c9c00, 0x3c9c00}, {0x0, 0x0, 0x0}, 0xc1aeb73c80)
        /home/dstasenko/go/pkg/mod/github.com/apache/arrow/go/[email protected]/parquet/file/column_reader.go:508 +0xc2
github.com/apache/arrow/go/v14/parquet/file.(*Int64ColumnChunkReader).ReadBatch(0xc1aee2e3d0?, 0xc1aee36960?, {0xc1cbc6a000?, 0x2b21d60?, 0xc0001a9d58?}, {0xc1d3f26000?, 0x1a49d55?, 0xa?}, {0x0, 0x0, ...})

I assume that this line lacks boundaries check

d.deltaBitWidth = d.deltaBitWidths.Bytes()[int(d.miniBlockIdx)]

Component(s)

Go

@mapleFU
Copy link
Member

mapleFU commented Oct 23, 2023

Would you mind provide the test data which might trigger the bug here?

@Hattonuri
Copy link
Contributor Author

storage_000000000.parquet.txt
added .txt ending because github does not allow to upload parquet files

@Hattonuri
Copy link
Contributor Author

}
if d.currentMiniBlockVals == 0 {

probably you forgot "if err != nil return"

@mapleFU
Copy link
Member

mapleFU commented Oct 23, 2023

I've found if err != nil return, but it's still propogate an error

C++ is able to read file like this, let me investigate it.

@mapleFU
Copy link
Member

mapleFU commented Oct 23, 2023

I've find out the reason. Let me submit a fixing

@mapleFU
Copy link
Member

mapleFU commented Oct 23, 2023

@Hattonuri I've submit a fixing here: #38413

You can have a try at the new code.

@Hattonuri
Copy link
Contributor Author

Now it works! Thank you

zeroshade pushed a commit that referenced this issue Oct 24, 2023
…tData (#38413)

### Rationale for this change

As #38399 says. DeltaBitPack will corrupt when we meet a column chunk
with more than one page. During first page decoding, it works well. But when the second page comes, the
`d.usedFirst` haven't been reset, which cause the bug.

### What changes are included in this PR?

1. Some style enhancement
2. Bug fix

### Are these changes tested?

Currently not

### Are there any user-facing changes?

bugfix

* Closes: #38399

Authored-by: mwish <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
@zeroshade zeroshade added this to the 15.0.0 milestone Oct 24, 2023
JerAguilon pushed a commit to JerAguilon/arrow that referenced this issue Oct 25, 2023
…ter SetData (apache#38413)

### Rationale for this change

As apache#38399 says. DeltaBitPack will corrupt when we meet a column chunk
with more than one page. During first page decoding, it works well. But when the second page comes, the
`d.usedFirst` haven't been reset, which cause the bug.

### What changes are included in this PR?

1. Some style enhancement
2. Bug fix

### Are these changes tested?

Currently not

### Are there any user-facing changes?

bugfix

* Closes: apache#38399

Authored-by: mwish <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this issue Nov 13, 2023
…ter SetData (apache#38413)

### Rationale for this change

As apache#38399 says. DeltaBitPack will corrupt when we meet a column chunk
with more than one page. During first page decoding, it works well. But when the second page comes, the
`d.usedFirst` haven't been reset, which cause the bug.

### What changes are included in this PR?

1. Some style enhancement
2. Bug fix

### Are these changes tested?

Currently not

### Are there any user-facing changes?

bugfix

* Closes: apache#38399

Authored-by: mwish <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…ter SetData (apache#38413)

### Rationale for this change

As apache#38399 says. DeltaBitPack will corrupt when we meet a column chunk
with more than one page. During first page decoding, it works well. But when the second page comes, the
`d.usedFirst` haven't been reset, which cause the bug.

### What changes are included in this PR?

1. Some style enhancement
2. Bug fix

### Are these changes tested?

Currently not

### Are there any user-facing changes?

bugfix

* Closes: apache#38399

Authored-by: mwish <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
kou pushed a commit to apache/arrow-go that referenced this issue Aug 30, 2024
…tData (#38413)

### Rationale for this change

As apache/arrow#38399 says. DeltaBitPack will corrupt when we meet a column chunk
with more than one page. During first page decoding, it works well. But when the second page comes, the
`d.usedFirst` haven't been reset, which cause the bug.

### What changes are included in this PR?

1. Some style enhancement
2. Bug fix

### Are these changes tested?

Currently not

### Are there any user-facing changes?

bugfix

* Closes: #38399

Authored-by: mwish <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants