-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8093] [SQL] Remove empty structs inferred from JSON documents #6799
Conversation
ok to test |
@NathanHowell Thank you for working on it! I am wondering if we can keep the new behavior and introduce a flag to let users switch back to the old behavior? Here is my thoughts on it. Our old behavior ignores the existence of empty JSON objects. So, after we get the DataFrame, we actually lose a small piece of information about the dataset. Also, if we just change the behavior to our old behavior, because we have already released 1.4 this change will go in 1.5, we actually will change the behavior again. I feel maybe it is good to keep the new behavior (we will not discard information) and introduce a flag to let user switch back to our spark 1.3's behavior. What do you think? |
Test build #34855 has finished for PR 6799 at commit
|
@yhuai I agree, I think a better default approach might be to fail in the Parquet writer (instead of writing a fail it cannot read)... and add a flag to enable this patch. |
@NathanHowell Yeah, sounds good. In the error message, we can ask user to drop that column. @liancheng Where will be the good place to add this check? |
After second thought, I feel it is better to just drop those empty structs and their corresponding values when we write data to parquet and log a warning message. @NathanHowell How about we split the work. Can you add the flag we talked about in this pr (the flag let us fall back to the 1.3 behavior of inferring schema)? I can create another PR to add a project to remove those empty structs in the parquet write path. |
Sounds good to me.
|
@NathanHowell Actually, do you think we should just fix it in the parquet side instead of introducing the flag? Since it is parquet's issue, maybe it is not worth adding a flag (also, this flag may not be used in most of the cases). |
I feel we can just fix the parquet part and do not need to touch code related to json. |
I'm fine with that too. On Thu, Jun 18, 2015 at 6:23 PM, Yin Huai [email protected] wrote:
|
Found it is hard to drop those columns in parquet's write path... Let's check this one in to make JSON has the same behavior with 1.3. I will merge it to both master and branch 1.4 once it passes the test. |
LGTM pending Jenkins tests. |
Test build #938 timed out for PR 6799 at commit |
test this please |
Test build #940 has finished for PR 6799 at commit
|
Test build #35324 has finished for PR 6799 at commit
|
I am merging it to master and branch 1.4. |
Author: Nathan Howell <[email protected]> Closes #6799 from NathanHowell/spark-8093 and squashes the following commits: 76ac3e8 [Nathan Howell] [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents (cherry picked from commit 9814b97) Signed-off-by: Yin Huai <[email protected]> Conflicts: sql/core/src/test/scala/org/apache/spark/sql/json/TestJsonData.scala
Merged. I manually fixed a small conflict for 1.4 branch. Thanks @NathanHowell ! |
Author: Nathan Howell <[email protected]> Closes apache#6799 from NathanHowell/spark-8093 and squashes the following commits: 76ac3e8 [Nathan Howell] [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents (cherry picked from commit 9814b97) Signed-off-by: Yin Huai <[email protected]> Conflicts: sql/core/src/test/scala/org/apache/spark/sql/json/TestJsonData.scala
No description provided.