[optimize][refractor] Optimizing memory usage for writing data #140

gnehil · 2023-09-08T11:15:33Z

Proposed changes

Optimized

Optimize data transmission

Before optimization: After formatting the batch data, package it as a StringEntity and send it all
After optimization: HTTP Chunk method is used to transmit data during Stream Load request through InputStream, which reduces the memory consumption caused by batch data conversion when building Entity.

Optimize partitioned data iteration

Before optimization: Before batch splitting, the iterator.grouped method was used to group the iterators according to batch size. After grouping, a collection of batch size is obtained. At this time, all the corresponding number of records need to be read into the memory. If If the batch setting is large, the memory usage will also increase, which may easily lead to OOM.
After optimization: iterate directly through the iterator object of the partition, and implement an InputStream of Iterator. InputStream reads one row of data each time, and maintains a counter object in InputSteam. When the number of rows read is greater than or equal to the batch size , end the reading of the Input Stream, and submit the Stream Load request. In this way, the source end only needs to read the minimum batch of data during the entire iterative reading process. There is no need to cache the entire written batch of data, reducing memory usage.

Test result

Environment information

Single data size: about 8KB
Spark resource:
- executor instance: 1
- exeuctor cores: 1
- executor memory: test variable
Job configuration
- read
  - doris.batch.size: 10000
  - doris.request.tablet.size: 4
- write
  - sink.properties.parallelism: 5

Test 1

Spark Executor memory: 1GB
sink.batch.size: 100000
Before optimization:
After optimization:

Test 2

Spark Executor memory: 1GB
sink.batch.size: 200000
Before optimization: not performed
After optimization:

Test 3

Spark Executor memory: 1GB
sink.batch.size: 500000
Before optimization: not performed
After optimization:

Test 4

Spark Executor memory: 2GB
sink.batch.size = 100000
Before optimization:
After optimization: Not performed

Test 5

Spark Executor memory: 4GB
sink.batch.size: 100000
Before optimization:
After optimization: Not performed

Test 6

Executor memory: 16GB
sink.batch.size: 100000
Before optimization:
After optimization: Not performed

Test summary

According to the test results, the memory usage of the optimized connector is relatively stable when the read size of each batch of the source segment remains unchanged, and the impact of the write batch size on memory usage is small, and it also reduces the time due to insufficient memory. The problem of slow data processing caused by high CPU usage caused by GC.

Checklist(Required)

Does it affect the original behavior: (Yes/No/I Don't know)
Has unit tests been added: (Yes/No/No Need)
Has document been added or modified: (Yes/No/No Need)
Does it need to update dependencies: (Yes/No)
Are there any changes that cannot be rolled back: (Yes/No)

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

JNSimba · 2023-09-08T11:29:43Z

add license header, thanks

gnehil · 2023-09-08T11:32:35Z

add license header, thanks

done

JNSimba

LGTM

gnehil added 9 commits September 6, 2023 17:55

init

412102b

reduce mem use

999ec6a

optimize

85c6032

fix buffer read error

0b7daed

optimize buffer expansion and add builder

28e29ad

optimize

b603a05

merge master and do some refract

63ab256

remove unused import

c53bd9a

convert internal row value manually

a4e99e5

add license header

1976e5d

JNSimba approved these changes Sep 13, 2023

View reviewed changes

JNSimba merged commit 0daf6c4 into apache:master Sep 13, 2023
3 checks passed

gnehil mentioned this pull request Sep 14, 2023

[feature] add write blocking properties to control write blocking. #117

Open

JNSimba mentioned this pull request Nov 9, 2023

Spark Doris Connector Release Note 1.3.0 #159

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[optimize][refractor] Optimizing memory usage for writing data #140

[optimize][refractor] Optimizing memory usage for writing data #140

gnehil commented Sep 8, 2023 •

edited

Loading

JNSimba commented Sep 8, 2023

gnehil commented Sep 8, 2023

JNSimba left a comment

[optimize][refractor] Optimizing memory usage for writing data #140

[optimize][refractor] Optimizing memory usage for writing data #140

Conversation

gnehil commented Sep 8, 2023 • edited Loading

Proposed changes

Optimized

Test result

Test 1

Test 2

Test 3

Test 4

Test 5

Test 6

Test summary

Checklist(Required)

Further comments

JNSimba commented Sep 8, 2023

gnehil commented Sep 8, 2023

JNSimba left a comment

Choose a reason for hiding this comment

gnehil commented Sep 8, 2023 •

edited

Loading