Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#596] feat(netty): Use off heap memory to read HDFS data #806

Merged
merged 38 commits into from
Apr 13, 2023

Conversation

jerqi
Copy link
Contributor

@jerqi jerqi commented Apr 10, 2023

What changes were proposed in this pull request?

  1. Use off heap memory to read HDFS data
  2. remove some unused code
    (to do: use off heap memory to read HDFS index data)

Why are the changes needed?

Fix: #596

Does this PR introduce any user-facing change?

Yes, add the document.

How was this patch tested?

Pass origin tests.

@jerqi jerqi marked this pull request as draft April 10, 2023 06:31
@jerqi jerqi changed the title [#596] Use off heap memory to read HDFS data [#596] feat(netty): Use off heap memory to read HDFS data Apr 10, 2023
@codecov-commenter
Copy link

codecov-commenter commented Apr 10, 2023

Codecov Report

Merging #806 (d198fb5) into master (c9abe9a) will increase coverage by 1.23%.
The diff coverage is 35.23%.

@@             Coverage Diff              @@
##             master     #806      +/-   ##
============================================
+ Coverage     57.63%   58.87%   +1.23%     
- Complexity     2058     2062       +4     
============================================
  Files           306      292      -14     
  Lines         14871    12976    -1895     
  Branches       1221     1232      +11     
============================================
- Hits           8571     7639     -932     
+ Misses         5808     4900     -908     
+ Partials        492      437      -55     
Impacted Files Coverage Δ
...pache/hadoop/mapreduce/task/reduce/RssShuffle.java 0.00% <ø> (ø)
...e/uniffle/client/factory/ShuffleClientFactory.java 0.00% <0.00%> (ø)
...client/request/CreateShuffleReadClientRequest.java 0.00% <0.00%> (ø)
.../java/org/apache/uniffle/common/util/RssUtils.java 57.77% <0.00%> (-1.32%) ⬇️
...uniffle/storage/factory/ShuffleHandlerFactory.java 0.00% <0.00%> (ø)
...e/uniffle/storage/handler/impl/HdfsFileReader.java 48.07% <0.00%> (-35.26%) ⬇️
.../uniffle/storage/handler/impl/LocalFileReader.java 50.00% <0.00%> (-3.34%) ⬇️
...orage/request/CreateShuffleReadHandlerRequest.java 0.00% <0.00%> (ø)
...ache/uniffle/storage/util/ShuffleStorageUtils.java 71.08% <ø> (+3.26%) ⬆️
...e/storage/handler/impl/HdfsShuffleReadHandler.java 51.35% <25.00%> (-4.90%) ⬇️
... and 6 more

... and 16 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@jerqi jerqi marked this pull request as ready for review April 10, 2023 08:25
@jerqi jerqi requested review from advancedxy, kaijchen and zuston and removed request for kaijchen April 10, 2023 08:26
byteBufInputStream = new ByteBufInputStream(Unpooled.wrappedBuffer(data.array(), data.position(), size), true);
// Uncompressed data is released in this class, Compressed data is release in the class ShuffleReadClientImpl
// So if codec is null, we don't release the data when the stream is closed
byteBufInputStream = new ByteBufInputStream(Unpooled.wrappedBuffer(data), codec != null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to unify where the buffer is released?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems difficult. I don't have good idea.

@advancedxy
Copy link
Contributor

I believe the off-heap read should be optional and configurable.

The reader happens in the client side, which mostly are spark clients. Spark applications doesn't enable off-heap management by default. If this is mandatory, it would require users to modify spark configurations to avoid direct memory out of memory.

@jerqi
Copy link
Contributor Author

jerqi commented Apr 11, 2023

I believe the off-heap read should be optional and configurable.

The reader happens in the client side, which mostly are spark clients. Spark applications doesn't enable off-heap management by default. If this is mandatory, it would require users to modify spark configurations to avoid direct memory out of memory.

We have controlled the size of data which we read. It is usually 32MB, it won't occupy too much off heap memory. If we add a config option for this feature. We will suffered the more GC problems when we use default config option and we need to mantain heap memory and off heap memory mode at the same time. It will burden the pressure of code maintain.

@advancedxy
Copy link
Contributor

We have controlled the size of data which we read. It is usually 32MB, it won't occupy too much off heap memory. If we add a config option for this feature. We will suffered the more GC problems when we use default config option and we need to mantain heap memory and off heap memory mode at the same time. It will burden the pressure of code maintain.

Do you have any cases that the client is suffered from GC problems and is especially related to HDFS data read code path?

It's just that normally no other system would to support read hdfs via off-heap bytebuffer specially?

As for code maintenance, it's the bill that we have to pay.

@jerqi
Copy link
Contributor Author

jerqi commented Apr 11, 2023

We have controlled the size of data which we read. It is usually 32MB, it won't occupy too much off heap memory. If we add a config option for this feature. We will suffered the more GC problems when we use default config option and we need to mantain heap memory and off heap memory mode at the same time. It will burden the pressure of code maintain.

Do you have any cases that the client is suffered from GC problems and is especially related to HDFS data read code path?

It's just that normally no other system would to support read hdfs via off-heap bytebuffer specially?

As for code maintenance, it's the bill that we have to pay.

  1. Our client's GC time is longer than origin Spark when we run the TPCDS
  2. Some new systems will be used direct buffer because they want to use vector read.
  3. It may be hard to use if we need modify the configuration. In origin code, this code used direct memory, it's changed by @zuston . It's ok for us to use direct memory. And the memory manager is the concept of Spark, We don't have memory manager in the shuffle client.

@zuston
Copy link
Member

zuston commented Apr 11, 2023

Our client's GC time is longer than origin Spark when we run the TPCDS

I guess this is caused by the too many small objects. Could we use the resident shareable memory of spark to avoid memory allocation?

@jerqi
Copy link
Contributor Author

jerqi commented Apr 12, 2023

@advancedxy All comments are addressed.

@jerqi jerqi requested a review from advancedxy April 12, 2023 06:11
Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally lgtm, left minor comments

@jerqi jerqi requested a review from advancedxy April 12, 2023 11:22
@jerqi jerqi requested a review from advancedxy April 13, 2023 01:38
Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for your working

@jerqi jerqi merged commit c6cde5d into apache:master Apr 13, 2023
jerqi added a commit that referenced this pull request May 10, 2023
### What changes were proposed in this pull request?
We support to read off heap data in the #806, we support to read index in this pr.

### Why are the changes needed?
#596 follow up pr

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?
GA passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Subtask] [Netty] Use off heap memory to read HDFS data
6 participants