[SPARK-6190][core] create LargeByteBuffer for eliminating 2GB block limit #5400

squito · 2015-04-07T19:59:41Z

https://issues.apache.org/jira/browse/SPARK-6190

vanzin · 2015-04-07T21:29:05Z

core/src/main/scala/org/apache/spark/util/LargeByteBufferInputStream.scala

+    } else {
+      val r = buffer.get() & 0xFF
+      if (buffer.remaining() == 0) {
+        cleanUp()


Isn't this better done as part of close()?

(In fact you do need close() in case the stream is closed before EOF is reached.)

If anyone is watching on the sidelines -- marcelo and I chatted about this a while and realized there is an issue with the existing use of ByteBufferInputStream (where this code was copied from) that prevents it from getting properly disposed in all cases. I've opened https://issues.apache.org/jira/browse/SPARK-6839

SparkQA · 2015-04-07T21:29:37Z

Test build #29808 has finished for PR 5400 at commit 9f53203.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LargeByteBufferOutputStream(chunkSize: Int = 65536)
- public class BufferTooLargeException extends IOException
- public class LargeByteBufferHelper
- public class WrappedLargeByteBuffer implements LargeByteBuffer
This patch does not change any dependencies.

vanzin · 2015-04-07T21:37:25Z

core/src/main/scala/org/apache/spark/util/LargeByteBufferOutputStream.scala

+  private var _pos = 0
+
+  override def write(b: Int): Unit = {
+    output.write(b)


Don't you need to update _pos?

In fact, it seems like _pos is not really used.

vanzin · 2015-04-07T22:14:53Z

I didn't look at the tests in detail; I found some discrepancies between the code and the LargeByteBuffer interface that should probably be fixed one way or another (either the interface needs updating, or the code needs fixing).

SparkQA · 2015-04-09T15:48:49Z

Test build #29942 has started for PR 5400 at commit a759242.

shaneknapp · 2015-04-09T16:02:36Z

jenkins, test this please

SparkQA · 2015-04-09T17:29:24Z

Test build #29944 has finished for PR 5400 at commit a759242.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LargeByteBufferOutputStream(chunkSize: Int = 65536)
- public class BufferTooLargeException extends IOException
- public class LargeByteBufferHelper
- public class WrappedLargeByteBuffer implements LargeByteBuffer
This patch does not change any dependencies.

squito · 2015-04-09T22:00:31Z

jenkins, test this please

SparkQA · 2015-04-09T23:25:20Z

Test build #29972 has finished for PR 5400 at commit a759242.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LargeByteBufferOutputStream(chunkSize: Int = 65536)
- public class BufferTooLargeException extends IOException
- public class LargeByteBufferHelper
- public class WrappedLargeByteBuffer implements LargeByteBuffer
This patch does not change any dependencies.

SparkQA · 2015-04-13T20:35:33Z

Test build #30190 has finished for PR 5400 at commit c3efa4c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LargeByteBufferOutputStream(chunkSize: Int = 65536)
- public class BufferTooLargeException extends IOException
- public class LargeByteBufferHelper
- public class WrappedLargeByteBuffer implements LargeByteBuffer
This patch does not change any dependencies.

SparkQA · 2015-04-14T16:21:18Z

Test build #30249 has finished for PR 5400 at commit e1d8fa8.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LargeByteBufferOutputStream(chunkSize: Int = 65536)
- public class BufferTooLargeException extends IOException
- public class LargeByteBufferHelper
- public class WrappedLargeByteBuffer implements LargeByteBuffer
This patch adds the following new dependencies:
- commons-math3-3.4.1.jar
This patch removes the following dependencies:
- commons-math3-3.1.1.jar

SparkQA · 2015-08-19T23:06:09Z

Test build #41274 timed out for PR 5400 at commit 80c4032 after a configured wait of 175m.

tgravescs · 2015-08-20T13:19:20Z

how long does the >2GB test take to run?

snnn · 2015-10-19T11:46:13Z

How is it going? Is it still WIP? I can help to test.

vanzin · 2015-11-02T19:51:56Z

core/src/main/java/org/apache/spark/network/buffer/LargeByteBufferInputStream.java

+    }
+  }
+
+  // only for testing


nit: comment is redundant.

vanzin · 2015-11-02T20:08:37Z

I had already reviewed this and I don't see any changes, so my only worry is still the same as Tom's: how long does the large file test takes. I guess it's not horrible if it's a single test taking 10s, but if we could avoid and still be reasonably sure that things work, it would be better.

squito · 2015-11-04T18:03:09Z

Thanks for taking another look @vanzin, fixed those last issues. Sorry I never responded earlier @tgravescs -- that one test takes about 10s on my laptop, looks like it took 15s on the last jenkins run. Personally I feel better w/ the test in there. But I agree its not adding a ton of value -- happy to scrap it if you prefer.

squito · 2015-11-04T18:20:01Z

jenkins, retest this please

SparkQA · 2015-11-04T20:23:39Z

Test build #45032 has finished for PR 5400 at commit 3447bb9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class LargeByteBufferInputStream extends InputStream\n * public class LargeByteBufferOutputStream extends OutputStream\n * public class JavaAssociationRulesExample\n * public class JavaPrefixSpanExample\n * public class JavaSimpleFPGrowth\n * public class BufferTooLargeException extends IOException\n * public class LargeByteBufferHelper\n * public class WrappedLargeByteBuffer implements LargeByteBuffer\n * class StreamInterceptor implements TransportFrameDecoder.Interceptor\n * public final class ChunkFetchSuccess extends ResponseWithBody\n * public abstract class ResponseWithBody implements ResponseMessage\n * public final class StreamFailure implements ResponseMessage\n * public final class StreamRequest implements RequestMessage\n * public final class StreamResponse extends ResponseWithBody\n * public class TransportFrameDecoder extends ChannelInboundHandlerAdapter\n

rxin · 2015-12-31T02:45:01Z

I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!

scwf · 2016-01-07T01:26:39Z

hi @squito, can you explain in which situation users will hit the 2g limit? will a job of processing very large data(such as PB level data) reach this limit?

snnn · 2016-01-08T06:24:32Z

@scwf ,

the shuffle output from one mapper to one reducer cannot be more than 2GB.
partitions of an RDD cannot exceed 2GB.

rxin · 2016-01-08T06:28:30Z

@snnn 2 is not true. Partitions can be as large as possible. The cached size cannot be greater than 2GB.

scwf · 2016-01-08T15:54:20Z

The cached size cannot be greater than 2GB.

@rxin how to understand the cached size? the partition size of a cached rdd?

vijay1106 · 2016-08-24T00:12:45Z

Hey does this address the issue of spark.sql.autoBroadcastJoinThreshold cannot be more than 2GB?

jkhalid · 2019-02-12T17:11:42Z

@squito @SparkQA @vanzin @shaneknapp @tgravescs

I am using spark.sql on AWS Glue to generate a single large (it is the clients requirement to have a single file) csv compressed file which is greater than 2GB for sure. I am running to into this issue

write(transformed_feed)
File "script_2019-02-12-15-57-55.py", line 161, in write
output_path_premium, header=True, compression="gzip")
File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 766, in csv
File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call
File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/mnt/yarn/usercache/root/appcache/application_1549986900582_0001/container_1549986900582_0001_01_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o210.csv.
.
.
.
.
.
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, ip-172-32-189-222.ec2.internal, executor 1): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

Below is python code used to write the file
def write(dataframe):
# write two files premium listings and non premium listings (critera : listing_priority > 30 = premium)
dataframe.filter(dataframe["listing_priority"] >= 30).drop('listing_priority').drop('image_count').write.csv(
output_path_premium, header=True, compression="gzip")
shell_command = "hdfs dfs -mv " + output_path_premium + '/part-' + ' ' + output_path_premium + output_file_premium
os.system(shell_command)
dataframe.filter(dataframe["listing_priority"] < 30).drop('listing_priority').drop('image_count').write.csv(
output_path_nonpremium, header=True, compression="gzip")
shell_command = "hdfs dfs -mv " + output_path_nonpremium + '/part-' + ' ' + output_path_nonpremium + output_file_nonpremium
os.system(shell_command)

I am assuming its because the file is greater than 2GB . has this issue been fixed ?

squito · 2019-02-12T17:37:52Z

@jkhalid What spark version were you on? many fixes were not available till spark 2.4.

can you share the entire stack trace, particularly the java portion?

the limit still exists for single records which are over 2GB. It shouldn't be a problem for reading a whole file which is over 2GB. but there may be some problem when going back and forth from python, I'm not very familiar w/ that part.

jkhalid · 2019-02-12T18:14:50Z

@squito
:( My bad i guess AWS is using 2.2.1
https://aws.amazon.com/about-aws/whats-new/2018/04/aws-glue-now-supports-apache-spark-221/

and i was encountering the problem while writing the file not reading it

I am attaching the stack trace. Please let me know if you see something i missed out
stacktrace.txt

jkhalid · 2019-02-13T18:31:18Z

@squito
Hey any idea from the stacktrace what i might be doing wrong

vanzin · 2019-02-13T18:37:56Z

Can we move discussions unrelated to a long-closed PR to either jira or the mailing list? Thanks.

vanzin reviewed Apr 7, 2015
View reviewed changes

squito mentioned this pull request Aug 19, 2015

[SPARK-6236] [core] [wip] caching for blocks over 2GB #8320

Closed

vanzin reviewed Nov 2, 2015
View reviewed changes

squito added 2 commits November 4, 2015 11:35

Merge branch 'master' into SPARK-6190_largeBB

3f701dc

review feedback

3447bb9

asfgit closed this in 93b52ab Dec 31, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6190][core] create LargeByteBuffer for eliminating 2GB block limit #5400

[SPARK-6190][core] create LargeByteBuffer for eliminating 2GB block limit #5400

squito commented Apr 7, 2015

vanzin Apr 7, 2015

vanzin Apr 7, 2015

squito Apr 10, 2015

SparkQA commented Apr 7, 2015

vanzin Apr 7, 2015

vanzin Apr 7, 2015

vanzin commented Apr 7, 2015

SparkQA commented Apr 9, 2015

shaneknapp commented Apr 9, 2015

SparkQA commented Apr 9, 2015

squito commented Apr 9, 2015

SparkQA commented Apr 9, 2015

SparkQA commented Apr 13, 2015

SparkQA commented Apr 14, 2015

SparkQA commented Aug 19, 2015

tgravescs commented Aug 20, 2015

snnn commented Oct 19, 2015

vanzin Nov 2, 2015

vanzin commented Nov 2, 2015

squito commented Nov 4, 2015

squito commented Nov 4, 2015

SparkQA commented Nov 4, 2015

rxin commented Dec 31, 2015

scwf commented Jan 7, 2016

snnn commented Jan 8, 2016

rxin commented Jan 8, 2016

scwf commented Jan 8, 2016

vijay1106 commented Aug 24, 2016

jkhalid commented Feb 12, 2019

squito commented Feb 12, 2019

jkhalid commented Feb 12, 2019 •

edited

Loading

jkhalid commented Feb 13, 2019

vanzin commented Feb 13, 2019

[SPARK-6190][core] create LargeByteBuffer for eliminating 2GB block limit #5400

[SPARK-6190][core] create LargeByteBuffer for eliminating 2GB block limit #5400

Conversation

squito commented Apr 7, 2015

vanzin Apr 7, 2015

Choose a reason for hiding this comment

vanzin Apr 7, 2015

Choose a reason for hiding this comment

squito Apr 10, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 7, 2015

vanzin Apr 7, 2015

Choose a reason for hiding this comment

vanzin Apr 7, 2015

Choose a reason for hiding this comment

vanzin commented Apr 7, 2015

SparkQA commented Apr 9, 2015

shaneknapp commented Apr 9, 2015

SparkQA commented Apr 9, 2015

squito commented Apr 9, 2015

SparkQA commented Apr 9, 2015

SparkQA commented Apr 13, 2015

SparkQA commented Apr 14, 2015

SparkQA commented Aug 19, 2015

tgravescs commented Aug 20, 2015

snnn commented Oct 19, 2015

vanzin Nov 2, 2015

Choose a reason for hiding this comment

vanzin commented Nov 2, 2015

squito commented Nov 4, 2015

squito commented Nov 4, 2015

SparkQA commented Nov 4, 2015

rxin commented Dec 31, 2015

scwf commented Jan 7, 2016

snnn commented Jan 8, 2016

rxin commented Jan 8, 2016

scwf commented Jan 8, 2016

vijay1106 commented Aug 24, 2016

jkhalid commented Feb 12, 2019

squito commented Feb 12, 2019

jkhalid commented Feb 12, 2019 • edited Loading

jkhalid commented Feb 13, 2019

vanzin commented Feb 13, 2019

jkhalid commented Feb 12, 2019 •

edited

Loading