Spark SQL merge small files to big files Update InsertIntoHiveTable.scala #18609

wuzhilon · 2017-07-12T07:37:12Z

Merge hive small files into large files, support orc and text data table storage format

What changes were proposed in this pull request?

we have more spark SQL partitions tables ,table partition have more small files。Causing the cluster hdfs a lot of pressure, we use this feature to merge small files, to the cluster down to 1/10 hdfs pressure
##test this function
you can use SQL: INSERT INTO TABLE AS SELECT XXXX

Merge hive small files into large files, support orc and text data table storage format

AmplabJenkins · 2017-07-12T07:42:10Z

Can one of the admins verify this patch?

gatorsmile · 2017-07-23T18:48:46Z

Can you just repartition your data before writing to the file?

wuzhilon · 2017-07-27T02:00:13Z

I am trying to, now the problem is unable to know the number of data MB, I can only get the number of data, and then re-coarse-grained.

HyukjinKwon · 2017-07-31T04:07:06Z

@wuzhilon, could you explain why it is problematic if we just repartition? I didn't understand

the problem is unable to know the number of data MB, I can only get the number of data, and then re-coarse-grained.

I think the approach here is quite poorly implemented and we should close. This at least looks listing files which is quite costly on S3.

Update InsertIntoHiveTable.scala

ba63a06

Merge hive small files into large files, support orc and text data table storage format

wuzhilon added 2 commits July 21, 2017 19:54

Update InsertIntoHiveTable.scala

577293b

Update InsertIntoHiveTable.scala

ce166e1

HyukjinKwon mentioned this pull request Jul 31, 2017

[INFRA] Close stale PRs #18780

Closed

asfgit closed this in 3a45c7f Aug 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark SQL merge small files to big files Update InsertIntoHiveTable.scala #18609

Spark SQL merge small files to big files Update InsertIntoHiveTable.scala #18609

wuzhilon commented Jul 12, 2017 •

edited

Loading

AmplabJenkins commented Jul 12, 2017

gatorsmile commented Jul 23, 2017

wuzhilon commented Jul 27, 2017

HyukjinKwon commented Jul 31, 2017

Spark SQL merge small files to big files Update InsertIntoHiveTable.scala #18609

Spark SQL merge small files to big files Update InsertIntoHiveTable.scala #18609

Conversation

wuzhilon commented Jul 12, 2017 • edited Loading

What changes were proposed in this pull request?

AmplabJenkins commented Jul 12, 2017

gatorsmile commented Jul 23, 2017

wuzhilon commented Jul 27, 2017

HyukjinKwon commented Jul 31, 2017

wuzhilon commented Jul 12, 2017 •

edited

Loading