[SPARK-8093] [SQL] Remove empty structs inferred from JSON documents #6799

NathanHowell · 2015-06-13T17:52:15Z

No description provided.

yhuai · 2015-06-13T23:40:44Z

ok to test

yhuai · 2015-06-13T23:46:14Z

@NathanHowell Thank you for working on it! I am wondering if we can keep the new behavior and introduce a flag to let users switch back to the old behavior? Here is my thoughts on it. Our old behavior ignores the existence of empty JSON objects. So, after we get the DataFrame, we actually lose a small piece of information about the dataset. Also, if we just change the behavior to our old behavior, because we have already released 1.4 this change will go in 1.5, we actually will change the behavior again. I feel maybe it is good to keep the new behavior (we will not discard information) and introduce a flag to let user switch back to our spark 1.3's behavior. What do you think?

SparkQA · 2015-06-14T01:23:56Z

Test build #34855 has finished for PR 6799 at commit 76ac3e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

NathanHowell · 2015-06-14T15:38:11Z

@yhuai I agree, I think a better default approach might be to fail in the Parquet writer (instead of writing a fail it cannot read)... and add a flag to enable this patch.

yhuai · 2015-06-18T22:53:25Z

@NathanHowell Yeah, sounds good. In the error message, we can ask user to drop that column. @liancheng Where will be the good place to add this check?

yhuai · 2015-06-18T23:57:47Z

After second thought, I feel it is better to just drop those empty structs and their corresponding values when we write data to parquet and log a warning message. @NathanHowell How about we split the work. Can you add the flag we talked about in this pr (the flag let us fall back to the 1.3 behavior of inferring schema)? I can create another PR to add a project to remove those empty structs in the parquet write path.

NathanHowell · 2015-06-18T23:59:19Z

Sounds good to me.
On Jun 18, 2015 4:58 PM, "Yin Huai" [email protected] wrote:

After second thought, I feel it is better to just drop those empty structs
and their corresponding values when we write data to parquet and log a
warning message. @NathanHowell https://github.com/NathanHowell How
about we split the work. Can you add the flag we talked about in this pr
(the flag let us fall back to the 1.3 behavior of inferring schema)? I can
create another PR to add a project to remove those empty structs in the
parquet write path.

—
Reply to this email directly or view it on GitHub
#6799 (comment).

yhuai · 2015-06-19T01:06:59Z

@NathanHowell Actually, do you think we should just fix it in the parquet side instead of introducing the flag? Since it is parquet's issue, maybe it is not worth adding a flag (also, this flag may not be used in most of the cases).

yhuai · 2015-06-19T01:22:53Z

I feel we can just fix the parquet part and do not need to touch code related to json.

NathanHowell · 2015-06-19T15:37:46Z

I'm fine with that too.

On Thu, Jun 18, 2015 at 6:23 PM, Yin Huai [email protected] wrote:

I feel we can just fix the parquet part and do not need to touch code
related to json.

—
Reply to this email directly or view it on GitHub
#6799 (comment).

yhuai · 2015-06-19T19:00:20Z

Found it is hard to drop those columns in parquet's write path... Let's check this one in to make JSON has the same behavior with 1.3. I will merge it to both master and branch 1.4 once it passes the test.

yhuai · 2015-06-19T19:00:31Z

LGTM pending Jenkins tests.

SparkQA · 2015-06-19T21:16:22Z

Test build #938 timed out for PR 6799 at commit 76ac3e8 after a configured wait of 175m.

yhuai · 2015-06-19T21:18:16Z

test this please

SparkQA · 2015-06-19T22:05:19Z

Test build #940 has finished for PR 6799 at commit 76ac3e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-06-19T23:12:17Z

Test build #35324 has finished for PR 6799 at commit 76ac3e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2015-06-19T23:18:41Z

I am merging it to master and branch 1.4.

Author: Nathan Howell <[email protected]> Closes #6799 from NathanHowell/spark-8093 and squashes the following commits: 76ac3e8 [Nathan Howell] [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents (cherry picked from commit 9814b97) Signed-off-by: Yin Huai <[email protected]> Conflicts: sql/core/src/test/scala/org/apache/spark/sql/json/TestJsonData.scala

yhuai · 2015-06-19T23:24:34Z

Merged. I manually fixed a small conflict for 1.4 branch. Thanks @NathanHowell !

Author: Nathan Howell <[email protected]> Closes apache#6799 from NathanHowell/spark-8093 and squashes the following commits: 76ac3e8 [Nathan Howell] [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents (cherry picked from commit 9814b97) Signed-off-by: Yin Huai <[email protected]> Conflicts: sql/core/src/test/scala/org/apache/spark/sql/json/TestJsonData.scala

[SPARK-8093] [SQL] Remove empty structs inferred from JSON documents

76ac3e8

asfgit closed this in 9814b97 Jun 19, 2015

NathanHowell deleted the spark-8093 branch December 8, 2016 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8093] [SQL] Remove empty structs inferred from JSON documents #6799

[SPARK-8093] [SQL] Remove empty structs inferred from JSON documents #6799

NathanHowell commented Jun 13, 2015

yhuai commented Jun 13, 2015

yhuai commented Jun 13, 2015

SparkQA commented Jun 14, 2015

NathanHowell commented Jun 14, 2015

yhuai commented Jun 18, 2015

yhuai commented Jun 18, 2015

NathanHowell commented Jun 18, 2015

yhuai commented Jun 19, 2015

yhuai commented Jun 19, 2015

NathanHowell commented Jun 19, 2015

yhuai commented Jun 19, 2015

yhuai commented Jun 19, 2015

SparkQA commented Jun 19, 2015

yhuai commented Jun 19, 2015

SparkQA commented Jun 19, 2015

SparkQA commented Jun 19, 2015

yhuai commented Jun 19, 2015

yhuai commented Jun 19, 2015

[SPARK-8093] [SQL] Remove empty structs inferred from JSON documents #6799

[SPARK-8093] [SQL] Remove empty structs inferred from JSON documents #6799

Conversation

NathanHowell commented Jun 13, 2015

yhuai commented Jun 13, 2015

yhuai commented Jun 13, 2015

SparkQA commented Jun 14, 2015

NathanHowell commented Jun 14, 2015

yhuai commented Jun 18, 2015

yhuai commented Jun 18, 2015

NathanHowell commented Jun 18, 2015

yhuai commented Jun 19, 2015

yhuai commented Jun 19, 2015

NathanHowell commented Jun 19, 2015

yhuai commented Jun 19, 2015

yhuai commented Jun 19, 2015

SparkQA commented Jun 19, 2015

yhuai commented Jun 19, 2015

SparkQA commented Jun 19, 2015

SparkQA commented Jun 19, 2015

yhuai commented Jun 19, 2015

yhuai commented Jun 19, 2015