-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6190][core] create LargeByteBuffer for eliminating 2GB block limit #5400
Conversation
} else { | ||
val r = buffer.get() & 0xFF | ||
if (buffer.remaining() == 0) { | ||
cleanUp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this better done as part of close()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(In fact you do need close()
in case the stream is closed before EOF is reached.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anyone is watching on the sidelines -- marcelo and I chatted about this a while and realized there is an issue with the existing use of ByteBufferInputStream
(where this code was copied from) that prevents it from getting properly disposed in all cases. I've opened https://issues.apache.org/jira/browse/SPARK-6839
Test build #29808 has finished for PR 5400 at commit
|
private var _pos = 0 | ||
|
||
override def write(b: Int): Unit = { | ||
output.write(b) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to update _pos
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, it seems like _pos
is not really used.
I didn't look at the tests in detail; I found some discrepancies between the code and the |
Test build #29942 has started for PR 5400 at commit |
jenkins, test this please |
Test build #29944 has finished for PR 5400 at commit
|
jenkins, test this please |
Test build #29972 has finished for PR 5400 at commit
|
Test build #30190 has finished for PR 5400 at commit
|
Test build #30249 has finished for PR 5400 at commit
|
Test build #41274 timed out for PR 5400 at commit |
how long does the >2GB test take to run? |
How is it going? Is it still WIP? I can help to test. |
} | ||
} | ||
|
||
// only for testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comment is redundant.
I had already reviewed this and I don't see any changes, so my only worry is still the same as Tom's: how long does the large file test takes. I guess it's not horrible if it's a single test taking 10s, but if we could avoid and still be reasonably sure that things work, it would be better. |
Thanks for taking another look @vanzin, fixed those last issues. Sorry I never responded earlier @tgravescs -- that one test takes about 10s on my laptop, looks like it took 15s on the last jenkins run. Personally I feel better w/ the test in there. But I agree its not adding a ton of value -- happy to scrap it if you prefer. |
jenkins, retest this please |
Test build #45032 has finished for PR 5400 at commit
|
I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks! |
hi @squito, can you explain in which situation users will hit the 2g limit? will a job of processing very large data(such as PB level data) reach this limit? |
@scwf ,
|
@snnn 2 is not true. Partitions can be as large as possible. The cached size cannot be greater than 2GB. |
@rxin how to understand the |
Hey does this address the issue of spark.sql.autoBroadcastJoinThreshold cannot be more than 2GB? |
@squito @SparkQA @vanzin @shaneknapp @tgravescs I am using spark.sql on AWS Glue to generate a single large (it is the clients requirement to have a single file) csv compressed file which is greater than 2GB for sure. I am running to into this issue write(transformed_feed) Below is python code used to write the file I am assuming its because the file is greater than 2GB . has this issue been fixed ? |
@jkhalid What spark version were you on? many fixes were not available till spark 2.4. can you share the entire stack trace, particularly the java portion? the limit still exists for single records which are over 2GB. It shouldn't be a problem for reading a whole file which is over 2GB. but there may be some problem when going back and forth from python, I'm not very familiar w/ that part. |
@squito and i was encountering the problem while writing the file not reading it I am attaching the stack trace. Please let me know if you see something i missed out |
@squito |
Can we move discussions unrelated to a long-closed PR to either jira or the mailing list? Thanks. |
https://issues.apache.org/jira/browse/SPARK-6190