Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Reduce number of byte[] copies in TextSource #23193

Closed
lukecwik opened this issue Sep 12, 2022 · 1 comment · Fixed by #23196
Closed

[Feature Request]: Reduce number of byte[] copies in TextSource #23193

lukecwik opened this issue Sep 12, 2022 · 1 comment · Fixed by #23196

Comments

@lukecwik
Copy link
Member

lukecwik commented Sep 12, 2022

What would you like to happen?

The current TextSource implementation is spending a lot of time during byte[] copying:
TextSource old implementation performance in pipeline

Hadoop LineReader.java implementation is signficantly faster (~2x) when handling typical files due to an implementation that reduces how many byte[]s are copied. A simple benchmark reading 10 million lines (60-120 characters long) shows that it takes about ~2.05 seconds to process such a file while the Apache Beam TextSource takes ~4.03 seconds.

Issue Priority

Priority: 2

Issue Component

Component: io-java-text

@lukecwik
Copy link
Member Author

CC: @bhisevishal

lukecwik added a commit to lukecwik/incubator-beam that referenced this issue Sep 12, 2022
…e copied (fixes apache#23193)

This makes TextSource take about 2.3x less CPU resources during decoding.

Before this change:
```
TextSourceBenchmark.benchmarkTextSource        thrpt    5  0.248 ± 0.029  ops/s
```

After this change:
```
TextSourceBenchmark.benchmarkHadoopLineReader  thrpt    5  0.465 ± 0.064  ops/s
TextSourceBenchmark.benchmarkTextSource        thrpt    5  0.575 ± 0.059  ops/s
```
lukecwik added a commit that referenced this issue Sep 15, 2022
…e copied (fixes #23193) (#23196)

* Improve the performance of TextSource by reducing how many byte[]s are copied (fixes #23193)

This makes TextSource take about 2.3x less CPU resources during decoding.

Before this change:
```
TextSourceBenchmark.benchmarkTextSource        thrpt    5  0.248 ± 0.029  ops/s
```

After this change:
```
TextSourceBenchmark.benchmarkHadoopLineReader  thrpt    5  0.465 ± 0.064  ops/s
TextSourceBenchmark.benchmarkTextSource        thrpt    5  0.575 ± 0.059  ops/s
```

* Write file in pieces instead of pre-allocating entire buffer

* Address PR comments
johnjcasey added a commit to johnjcasey/beam that referenced this issue Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant