Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Override default implementation of skip(long) method #18977

Merged

Conversation

findinpath
Copy link
Contributor

@findinpath findinpath commented Sep 8, 2023

Description

Delegate to stream the implementation of skip(long) method. By doing this, it is ensured that not the default implementation from java.io.InputStream is being used, because it calls read() causing actual reads from the underlying input stream although skip is just a logical operation.

Additional context and related issues

Fixes #18976

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive
* Avoid reading unrelated data to the text line reader splits. ({issue}`issuenumber`)

@findinpath findinpath force-pushed the findinpath/hdfs-trino-input-stream-skip branch from 4dbfc5c to 2473ea3 Compare September 8, 2023 20:39
@findinpath findinpath force-pushed the findinpath/hdfs-trino-input-stream-skip branch from 2473ea3 to 822102c Compare September 8, 2023 21:11
@findinpath findinpath force-pushed the findinpath/hdfs-trino-input-stream-skip branch 2 times, most recently from cc8fb2d to 8db9715 Compare September 8, 2023 21:41
@electrum
Copy link
Member

electrum commented Sep 8, 2023

Where are we using skip() rather than seek() for large sizes? The new S3 implementation (and possibly Azure) will also need to implement this efficiently.

@electrum
Copy link
Member

electrum commented Sep 8, 2023

Changes requested for the S3InputStream change.

To answer my previous question: I found that TextLineReader and SequenceFileReader use skipNBytes(), which we implement for S3InputStream and AzureInputStream. For HdfsTrinoInputStream, it uses the default implementation, which calls skip(), hence why we need to make the change here to delegate correctly.

@findinpath
Copy link
Contributor Author

I found that TextLineReader and SequenceFileReader use skipNBytes()

See here the hierarchy of calls

@findinpath findinpath force-pushed the findinpath/hdfs-trino-input-stream-skip branch from 8db9715 to 3862260 Compare September 9, 2023 04:53
@findinpath findinpath force-pushed the findinpath/hdfs-trino-input-stream-skip branch from 3862260 to 87ddc4e Compare September 9, 2023 04:57
Delegate to `stream` the implementation of `skip(long)` method.
By doing this, it is ensured that not the default implementation
from `java.io.InputStream` is being used, because it calls
`read()` causing actual reads from the underlying input stream
although `skip` is just a logical operation.

Co-authored-by: James Petty <[email protected]>
@findinpath findinpath force-pushed the findinpath/hdfs-trino-input-stream-skip branch from 87ddc4e to 3af6109 Compare September 11, 2023 08:31
@findinpath
Copy link
Contributor Author

Removed the changes in S3InputStream because they were unrelated to this change.
Thank you @electrum for raising this aspect.

@findepi
Copy link
Member

findepi commented Sep 11, 2023

/test-with-secrets sha=3af6109e7dc6a9f286f84c6c84b66bcf342c4967

@github-actions
Copy link

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/6145626334

@electrum electrum merged commit d06a3e0 into trinodb:master Sep 11, 2023
59 checks passed
@github-actions github-actions bot added this to the 427 milestone Sep 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Reading with native CSV reader on Hive is much slower on AWS S3
4 participants