-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#596] feat(netty): Use off heap memory to read HDFS data #806
Conversation
Codecov Report
@@ Coverage Diff @@
## master #806 +/- ##
============================================
+ Coverage 57.63% 58.87% +1.23%
- Complexity 2058 2062 +4
============================================
Files 306 292 -14
Lines 14871 12976 -1895
Branches 1221 1232 +11
============================================
- Hits 8571 7639 -932
+ Misses 5808 4900 -908
+ Partials 492 437 -55
... and 16 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
byteBufInputStream = new ByteBufInputStream(Unpooled.wrappedBuffer(data.array(), data.position(), size), true); | ||
// Uncompressed data is released in this class, Compressed data is release in the class ShuffleReadClientImpl | ||
// So if codec is null, we don't release the data when the stream is closed | ||
byteBufInputStream = new ByteBufInputStream(Unpooled.wrappedBuffer(data), codec != null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to unify where the buffer is released?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems difficult. I don't have good idea.
client-spark/common/src/main/java/org/apache/spark/shuffle/reader/RssShuffleDataIterator.java
Outdated
Show resolved
Hide resolved
I believe the off-heap read should be optional and configurable. The reader happens in the client side, which mostly are spark clients. Spark applications doesn't enable off-heap management by default. If this is mandatory, it would require users to modify spark configurations to avoid direct memory out of memory. |
We have controlled the size of data which we read. It is usually 32MB, it won't occupy too much off heap memory. If we add a config option for this feature. We will suffered the more GC problems when we use default config option and we need to mantain heap memory and off heap memory mode at the same time. It will burden the pressure of code maintain. |
client-spark/common/src/main/java/org/apache/spark/shuffle/reader/RssShuffleDataIterator.java
Outdated
Show resolved
Hide resolved
common/src/main/java/org/apache/uniffle/common/ShuffleDataResult.java
Outdated
Show resolved
Hide resolved
common/src/main/java/org/apache/uniffle/common/util/RssUtils.java
Outdated
Show resolved
Hide resolved
Do you have any cases that the client is suffered from GC problems and is especially related to HDFS data read code path? It's just that normally no other system would to support read hdfs via off-heap bytebuffer specially? As for code maintenance, it's the bill that we have to pay. |
|
I guess this is caused by the too many small objects. Could we use the resident shareable memory of spark to avoid memory allocation? |
…HdfsShuffleReadHandler.java Co-authored-by: advancedxy <[email protected]>
@advancedxy All comments are addressed. |
...ark-common/src/test/java/org/apache/uniffle/test/RepartitionWithHdfsMultiStorageRssTest.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally lgtm, left minor comments
storage/src/main/java/org/apache/uniffle/storage/handler/impl/HdfsFileReader.java
Outdated
Show resolved
Hide resolved
client-spark/common/src/main/java/org/apache/spark/shuffle/reader/RssShuffleDataIterator.java
Outdated
Show resolved
Hide resolved
client-spark/common/src/main/java/org/apache/spark/shuffle/reader/RssShuffleDataIterator.java
Outdated
Show resolved
Hide resolved
client-spark/common/src/main/java/org/apache/spark/shuffle/reader/RssShuffleDataIterator.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for your working
What changes were proposed in this pull request?
(to do: use off heap memory to read HDFS index data)
Why are the changes needed?
Fix: #596
Does this PR introduce any user-facing change?
Yes, add the document.
How was this patch tested?
Pass origin tests.