Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8093] [SQL] Remove empty structs inferred from JSON documents #6799

Closed
wants to merge 1 commit into from

Conversation

NathanHowell
Copy link

No description provided.

@yhuai
Copy link
Contributor

yhuai commented Jun 13, 2015

ok to test

@yhuai
Copy link
Contributor

yhuai commented Jun 13, 2015

@NathanHowell Thank you for working on it! I am wondering if we can keep the new behavior and introduce a flag to let users switch back to the old behavior? Here is my thoughts on it. Our old behavior ignores the existence of empty JSON objects. So, after we get the DataFrame, we actually lose a small piece of information about the dataset. Also, if we just change the behavior to our old behavior, because we have already released 1.4 this change will go in 1.5, we actually will change the behavior again. I feel maybe it is good to keep the new behavior (we will not discard information) and introduce a flag to let user switch back to our spark 1.3's behavior. What do you think?

@SparkQA
Copy link

SparkQA commented Jun 14, 2015

Test build #34855 has finished for PR 6799 at commit 76ac3e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@NathanHowell
Copy link
Author

@yhuai I agree, I think a better default approach might be to fail in the Parquet writer (instead of writing a fail it cannot read)... and add a flag to enable this patch.

@yhuai
Copy link
Contributor

yhuai commented Jun 18, 2015

@NathanHowell Yeah, sounds good. In the error message, we can ask user to drop that column. @liancheng Where will be the good place to add this check?

@yhuai
Copy link
Contributor

yhuai commented Jun 18, 2015

After second thought, I feel it is better to just drop those empty structs and their corresponding values when we write data to parquet and log a warning message. @NathanHowell How about we split the work. Can you add the flag we talked about in this pr (the flag let us fall back to the 1.3 behavior of inferring schema)? I can create another PR to add a project to remove those empty structs in the parquet write path.

@NathanHowell
Copy link
Author

Sounds good to me.
On Jun 18, 2015 4:58 PM, "Yin Huai" [email protected] wrote:

After second thought, I feel it is better to just drop those empty structs
and their corresponding values when we write data to parquet and log a
warning message. @NathanHowell https://github.com/NathanHowell How
about we split the work. Can you add the flag we talked about in this pr
(the flag let us fall back to the 1.3 behavior of inferring schema)? I can
create another PR to add a project to remove those empty structs in the
parquet write path.


Reply to this email directly or view it on GitHub
#6799 (comment).

@yhuai
Copy link
Contributor

yhuai commented Jun 19, 2015

@NathanHowell Actually, do you think we should just fix it in the parquet side instead of introducing the flag? Since it is parquet's issue, maybe it is not worth adding a flag (also, this flag may not be used in most of the cases).

@yhuai
Copy link
Contributor

yhuai commented Jun 19, 2015

I feel we can just fix the parquet part and do not need to touch code related to json.

@NathanHowell
Copy link
Author

I'm fine with that too.

On Thu, Jun 18, 2015 at 6:23 PM, Yin Huai [email protected] wrote:

I feel we can just fix the parquet part and do not need to touch code
related to json.


Reply to this email directly or view it on GitHub
#6799 (comment).

@yhuai
Copy link
Contributor

yhuai commented Jun 19, 2015

Found it is hard to drop those columns in parquet's write path... Let's check this one in to make JSON has the same behavior with 1.3. I will merge it to both master and branch 1.4 once it passes the test.

@yhuai
Copy link
Contributor

yhuai commented Jun 19, 2015

LGTM pending Jenkins tests.

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #938 timed out for PR 6799 at commit 76ac3e8 after a configured wait of 175m.

@yhuai
Copy link
Contributor

yhuai commented Jun 19, 2015

test this please

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #940 has finished for PR 6799 at commit 76ac3e8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 19, 2015

Test build #35324 has finished for PR 6799 at commit 76ac3e8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yhuai
Copy link
Contributor

yhuai commented Jun 19, 2015

I am merging it to master and branch 1.4.

@asfgit asfgit closed this in 9814b97 Jun 19, 2015
asfgit pushed a commit that referenced this pull request Jun 19, 2015
Author: Nathan Howell <[email protected]>

Closes #6799 from NathanHowell/spark-8093 and squashes the following commits:

76ac3e8 [Nathan Howell] [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents

(cherry picked from commit 9814b97)
Signed-off-by: Yin Huai <[email protected]>

Conflicts:
	sql/core/src/test/scala/org/apache/spark/sql/json/TestJsonData.scala
@yhuai
Copy link
Contributor

yhuai commented Jun 19, 2015

Merged. I manually fixed a small conflict for 1.4 branch. Thanks @NathanHowell !

nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 22, 2015
Author: Nathan Howell <[email protected]>

Closes apache#6799 from NathanHowell/spark-8093 and squashes the following commits:

76ac3e8 [Nathan Howell] [SPARK-8093] [SQL] Remove empty structs inferred from JSON documents

(cherry picked from commit 9814b97)
Signed-off-by: Yin Huai <[email protected]>

Conflicts:
	sql/core/src/test/scala/org/apache/spark/sql/json/TestJsonData.scala
@NathanHowell NathanHowell deleted the spark-8093 branch December 8, 2016 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants