Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark SQL merge small files to big files Update InsertIntoHiveTable.scala #18609

Closed
wants to merge 3 commits into from

Conversation

wuzhilon
Copy link

@wuzhilon wuzhilon commented Jul 12, 2017

Merge hive small files into large files, support orc and text data table storage format

What changes were proposed in this pull request?

we have more spark SQL partitions tables ,table partition have more small files。Causing the cluster hdfs a lot of pressure, we use this feature to merge small files, to the cluster down to 1/10 hdfs pressure
##test this function
you can use SQL: INSERT INTO TABLE AS SELECT XXXX

Merge hive small files into large files, support  orc and text data table storage format
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@gatorsmile
Copy link
Member

Can you just repartition your data before writing to the file?

@wuzhilon
Copy link
Author

I am trying to, now the problem is unable to know the number of data MB, I can only get the number of data, and then re-coarse-grained.

@HyukjinKwon
Copy link
Member

@wuzhilon, could you explain why it is problematic if we just repartition? I didn't understand

the problem is unable to know the number of data MB, I can only get the number of data, and then re-coarse-grained.

I think the approach here is quite poorly implemented and we should close. This at least looks listing files which is quite costly on S3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants