Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45484][SQL][3.5] Deprecated the incorrect parquet compression codec lz4raw #43330

Closed
wants to merge 1 commit into from

Conversation

beliefer
Copy link
Contributor

What changes were proposed in this pull request?

According to the discussion at #43310 (comment), this PR want deprecates the incorrect parquet compression codec lz4raw at Spark 3.5.1 and adds a warning log.

The warning log prompts users that lz4raw will be removed it at Apache Spark 4.0.0.

Why are the changes needed?

Deprecated the incorrect parquet compression codec lz4raw.

Does this PR introduce any user-facing change?

'Yes'.
Users will see the waring log below.
Parquet compression codec 'lz4raw' is deprecated, please use 'lz4_raw'

How was this patch tested?

Exists test cases and new test cases.

Was this patch authored or co-authored using generative AI tooling?

'No'.

@github-actions github-actions bot added the SQL label Oct 11, 2023
@srowen
Copy link
Member

srowen commented Oct 11, 2023

So this change is only needed in 3.5, and we already fixed it differently in 4.0?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this activity is about the return value of Parquet code. IIUC, there is no Spark user-facing issue here. He just want to be consistent with Apache Parquet library. In Apache Spark 4.0.0, I suggested to delete the old one lz4raw which is added and 3.5.0 and deprecated by 3.5.1 (by this PR) with the migration guide. Since lz4raw is a relatively new, I believe the deletion is fine.

footer.getParquetMetadata.getBlocks(0).column(0).getCodec.name()

@dongjoon-hyun
Copy link
Member

cc @wangyum

@@ -94,18 +108,22 @@ class ParquetCompressionCodecPrecedenceSuite extends ParquetTest with SharedSpar
withTempDir { tmpDir =>
val tempTableName = "TempParquetTable"
withTable(tempTableName) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. This looks like a mistake. Let's remove this empty line.

@beliefer
Copy link
Contributor Author

So this change is only needed in 3.5, and we already fixed it differently in 4.0?

Yes. This PR only used for 3.5.1. and #43310 used to fix it in 4.0.0

@beliefer
Copy link
Contributor Author

The GA failure is unrelated to this PR.
Linters, licenses, dependencies and documentation generation

/usr/lib/ruby/2.7.0/fileutils.rb:1415:in `copy_stream': No space left on device - copy_file_range (Errno::ENOSPC)
[12872](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12872)
	from /usr/lib/ruby/2.7.0/fileutils.rb:1415:in `block (2 levels) in copy_file'
[12873](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12873)
	from /usr/lib/ruby/2.7.0/fileutils.rb:1414:in `open'
[12874](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12874)
	from /usr/lib/ruby/2.7.0/fileutils.rb:1414:in `block in copy_file'
[12875](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12875)
	from /usr/lib/ruby/2.7.0/fileutils.rb:1413:in `open'
[12876](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12876)
	from /usr/lib/ruby/2.7.0/fileutils.rb:1413:in `copy_file'
[12877](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12877)
	from /usr/lib/ruby/2.7.0/fileutils.rb:1378:in `copy'
[12879](https://github.com/beliefer/spark/actions/runs/6492096910/job/17696356466#step:24:12879)
Error: No space left on device : '/home/runner/runners/2.309.0/_diag/pages/b7db6e67-b9c1-4011-a807-e832ba0cf437_fef911b0-771d-54c9-1f3c-e23e2c04b8fb_1.log'

@beliefer
Copy link
Contributor Author

cc @dongjoon-hyun @wangyum

@LuciferYang
Copy link
Contributor

Details

cc @zhengruifeng Could we possibly backport free_disk_space_container to branc-3.5?

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@LuciferYang
Copy link
Contributor

Details

cc @zhengruifeng Could we possibly backport free_disk_space_container to branc-3.5?

backport SPARK-44619 to branch-3.5 to avoid No space left on device #43381

beliefer added a commit that referenced this pull request Oct 17, 2023
…codec lz4raw

### What changes were proposed in this pull request?
According to the discussion at #43310 (comment), this PR want deprecates the incorrect parquet compression codec `lz4raw` at Spark 3.5.1 and adds a warning log.

The warning log prompts users that `lz4raw` will be removed it at Apache Spark 4.0.0.

### Why are the changes needed?
Deprecated the incorrect parquet compression codec `lz4raw`.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Users will see the waring log below.
`Parquet compression codec 'lz4raw' is deprecated, please use 'lz4_raw'`

### How was this patch tested?
Exists test cases and new test cases.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #43330 from beliefer/SPARK-45484_3.5.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Jiaan Geng <[email protected]>
@beliefer
Copy link
Contributor Author

@srowen @dongjoon-hyun @LuciferYang Merged! Thank you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants