You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current TextSource implementation is spending a lot of time during byte[] copying:
Hadoop LineReader.java implementation is signficantly faster (~2x) when handling typical files due to an implementation that reduces how many byte[]s are copied. A simple benchmark reading 10 million lines (60-120 characters long) shows that it takes about ~2.05 seconds to process such a file while the Apache Beam TextSource takes ~4.03 seconds.
Issue Priority
Priority: 2
Issue Component
Component: io-java-text
The text was updated successfully, but these errors were encountered:
…e copied (fixesapache#23193)
This makes TextSource take about 2.3x less CPU resources during decoding.
Before this change:
```
TextSourceBenchmark.benchmarkTextSource thrpt 5 0.248 ± 0.029 ops/s
```
After this change:
```
TextSourceBenchmark.benchmarkHadoopLineReader thrpt 5 0.465 ± 0.064 ops/s
TextSourceBenchmark.benchmarkTextSource thrpt 5 0.575 ± 0.059 ops/s
```
…e copied (fixes#23193) (#23196)
* Improve the performance of TextSource by reducing how many byte[]s are copied (fixes#23193)
This makes TextSource take about 2.3x less CPU resources during decoding.
Before this change:
```
TextSourceBenchmark.benchmarkTextSource thrpt 5 0.248 ± 0.029 ops/s
```
After this change:
```
TextSourceBenchmark.benchmarkHadoopLineReader thrpt 5 0.465 ± 0.064 ops/s
TextSourceBenchmark.benchmarkTextSource thrpt 5 0.575 ± 0.059 ops/s
```
* Write file in pieces instead of pre-allocating entire buffer
* Address PR comments
What would you like to happen?
The current TextSource implementation is spending a lot of time during
byte[]
copying:Hadoop
LineReader.java
implementation is signficantly faster (~2x) when handling typical files due to an implementation that reduces how manybyte[]
s are copied. A simple benchmark reading 10 million lines (60-120 characters long) shows that it takes about ~2.05 seconds to process such a file while the Apache Beam TextSource takes ~4.03 seconds.Issue Priority
Priority: 2
Issue Component
Component: io-java-text
The text was updated successfully, but these errors were encountered: