-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-13528][SQL] Make the short names of compression codecs consistent in ParquetRelation #11408
Conversation
Test build #52101 has finished for PR 11408 at commit
|
Test build #52102 has finished for PR 11408 at commit
|
@@ -60,6 +60,51 @@ private[spark] object CallSite { | |||
val empty = CallSite("", "") | |||
} | |||
|
|||
/** An utility class to map short compression codec names to qualified ones. */ | |||
private[spark] class ShortCompressionCodecNameMapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with standardizing names, but this seems like a lot of over-engineering, with abstract classes and whatnot just to attempt to make some keys consistent. I think this makes it harder to understand, and would stick to standardizing the keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, in an original idea, we should have consistent short names for compression codecs everywhere in spark. Any simpler way we can have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, agree with Sean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also agree - this abstract class is too much. I think just having lz4/bzip2 etc in different places isn't that big of a deal.
@rxin ping |
Test build #52125 has finished for PR 11408 at commit
|
Test build #52230 has finished for PR 11408 at commit
|
val compressedFiles = tempFile.listFiles() | ||
assert(compressedFiles.exists(_.getName.endsWith(".gz"))) | ||
verifyFrame(sqlContext.read.text(tempFile.getCanonicalPath)) | ||
Seq("bzip2", "deflate", "gzip").map { codecName => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: foreach
not map
. It doesn't matter in practice though.
Test build #52276 has finished for PR 11408 at commit
|
Test build #52297 has finished for PR 11408 at commit
|
@HyukjinKwon can you help review this one? It looks ok to me. Maybe you can do yours on top of this, which adds Python and better error messages. |
@rxin Sure. (I will within tomorrow). |
One thing I am not sure here is, Might this be better to make them (I am adding those supports here #11464 for ORC/Parquet). |
One more thing is, I am not sure if we then need Would you give me some feedback please? |
If the consistent short names infer only lower-case names, then I think we can just leave them as corrected here in a way. |
Yes, you're right . |
Setting no compression codec explicitly is important for uses to understand behaviours though, I'm not sure we need both names... |
I think "none" is enough. |
Thanks. Then |
Sorry I don't understand what you mean ... |
Let me list up possible compression options for each data source. For JSON, CSV and TEXT data sources, For Parquet, For ORC, |
Why do we need to support uncompressed in parquet? It it for backward compatibility? |
Yes wouldn't the users set 'uncompressed' as it was possible to set via Spark configuration? |
OK sounds great. I'd have it as an undocumented way for backward compatibility. |
deflate is pretty confusing. I'd just say zlib? -- actually never mind, let's just say deflate. |
As I said then we might need to consider handling the extensions for JSON, TEXT and CSV which are Do you think we can just leave the extensions as they are? |
We don't specify any extensions right now, do we? |
I remember I saw some tests for compression codes which check the extension when they are compressed. Then, let me correct them (or simply check them)and then create a new PR based on this. |
Let's talk more in the new PR. I will try to deal with this in capacity myself first. |
OK I thought about this a little bit more -- I'd just have uncompressed as an undocumented option for all data sources. That way, it is very consistent. @HyukjinKwon should I merge this pr now? |
Yes please. Let me make a followup. |
Thanks - merging this in master. |
…ent in ParquetRelation ## What changes were proposed in this pull request? This pr to make the short names of compression codecs in `ParquetRelation` consistent against other ones. This pr comes from apache#11324. ## How was this patch tested? Add more tests in `TextSuite`. Author: Takeshi YAMAMURO <[email protected]> Closes apache#11408 from maropu/SPARK-13528.
What changes were proposed in this pull request?
This pr to make the short names of compression codecs in
ParquetRelation
consistent against other ones. This pr comes from #11324.How was this patch tested?
Add more tests in
TextSuite
.