Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datalake inputstream #21322

Merged
merged 28 commits into from
Jun 29, 2021
Merged

Conversation

jaschrep-msft
Copy link
Member

Added openInputStream() to DataLakeFileClient

@ghost ghost added the Storage Storage Service (Queues, Blobs, Files) label May 12, 2021
Fixed docstrings and parameters that used "blob" instead of "file"
terminology.
Fixed a docstring link that referenced cut blob-only functionality.
DataLakeFileInputStream now uses logger.logThrowableAsError.
Header and docstring fixes.
@check-enforcer
Copy link

This pull request is protected by Check Enforcer.

What is Check Enforcer?

Check Enforcer helps ensure all pull requests are covered by at least one check-run (typically an Azure Pipeline). When all check-runs associated with this pull request pass then Check Enforcer itself will pass.

Why am I getting this message?

You are getting this message because Check Enforcer did not detect any check-runs being associated with this pull request within five minutes. This may indicate that your pull request is not covered by any pipelines and so Check Enforcer is correctly blocking the pull request being merged.

What should I do now?

If the check-enforcer check-run is not passing and all other check-runs associated with this PR are passing (excluding license-cla) then you could try telling Check Enforcer to evaluate your pull request again. You can do this by adding a comment to this pull request as follows:
/check-enforcer evaluate
Typically evaulation only takes a few seconds. If you know that your pull request is not covered by a pipeline and this is expected you can override Check Enforcer using the following command:
/check-enforcer override
Note that using the override command triggers alerts so that follow-up investigations can occur (PRs still need to be approved as normal).

What if I am onboarding a new service?

Often, new services do not have validation pipelines associated with them, in order to bootstrap pipelines for a new service, you can issue the following command as a pull request comment:
/azp run prepare-pipelines
This will run a pipeline that analyzes the source tree and creates the pipelines necessary to build and validate your pull request. Once the pipeline has been created you can trigger the pipeline using the following comment:
/azp run java - [service] - ci

added a suppression to deal with a checkstyle bug
minor fixes
StorageInputStream now has an implementation of dispatchRead() and only
delegates out the implementation of the client read operation itself.
sdk/storage/azure-storage-file-datalake/CHANGELOG.md Outdated Show resolved Hide resolved
*
* @return {@link PathProperties}
*/
public PathProperties getProperties() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be part of the public interface?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we're kinda locked into this based on the equivalent API in blobs.

* {@link #NONE}
* {@link #ETAG}
*/
public enum ConsistentReadControl {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this consistency control graduate into azure-storage-common as it should be re-usable in Blobs and maybe Files if we choose to add open read/write functionality there as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that the blob package already has their own copy of this. So if we put one in common then there will be confusion in the blobs package. Something we'd need to have caught before that API GA'd.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datalake takes dependency on blobs directly? Don't we end up having both types anyway on class path ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be wrong but I think we go out of our way to avoid using blob types in the datalake public API.

Comment on lines 67 to 70
ByteBuffer currentBuffer = this.fileClient.readWithResponse(
new FileRange(offset, (long) readLength), null, this.accessCondition, false)
.flatMap(response -> FluxUtil.collectBytesInByteBufferStream(response.getValue()).map(ByteBuffer::wrap))
.block();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fileClient delegtes reads to blockblob client internally.
Maybe we should delegate opening a stream to blockblob client as well and just make this class an adapter, i.e. keep reference to stream from block blob and just proxy calls there? If we were returning plain InputStream from API we could just return stream from blobclient, but since we expose some extra properties adapter would be needed.
I'd consider this - writing an adapter is easier than maintaining two versions of logic that works on bytebuffers and offsets.

This is what dotnet does https://github.com/Azure/azure-sdk-for-net/blob/3f38e290bfc1b1579baa4abf329a3861355796f1/sdk/storage/Azure.Storage.Files.DataLake/src/DataLakeFileClient.cs#L3869-L3873

Or maybe we don't look at blobs and just return InputStream ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So most of the decisions were made with the intent to mirror the blob API shape. It's been expressed multiple times throughout this PR that the blob API for this is undesirable. If we make the decision to break from that design, we can return a class that holds a plain InputStream and the datalake properties separately, then use the block blob inputstream as the base implementation to wrap. Is this acceptable to people? @kasobol-msft @alzimmermsft @gapra-msft @rickle-msft

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example of what would be returned:

public class DataLakeFileInputStream {
    public InputStream getInputStream(); // literally returns a BlockBlobInputStream instance
    public PathProperties getProperties();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that idea. This should be called DataLakeFileInputStreamResult (or whatever matches Result pattern).

openInputStream returns a result class containing InputStream and
PathProperties members.
The returned InputStream is a BlobInputStream instance.
Ported blob inputStream test class
/**
* Result of opening an {@link InputStream} to a datalake file.
*/
public interface FileOpenInputStreamResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be Closeable since it wraps closeable stream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what we gain from this since they have to get the InputStream anyway. I guess someone who only got the properties would benefit, but they might be writing some weird code if that's where they find themselves. Most people working with streams are probably used to the stream being closable and adding a second hook to close the same thing may be confusing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do that in dotnet, see https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/storage/Azure.Storage.Blobs/src/Models/BlobDownloadStreamingResult.cs , i.e. a model holding reference to Disposable type should be Disposable itself.

@alzimmermsft is there a guideline around models that has Closeable property inside ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know there is no Azure SDK for Java guideline about implementing Closeable/AutoClosable if underlying properties implement those interfaces. That being said, if the expectation for this type is to be used alongside other stream types I'd implement AutoClosable to allow for try-with-resources functionality.

cc: @srnagar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add closable for now and leave it to final api review to catch this if it's undesirable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing a stream more than once is not always idempotent - https://bugs.openjdk.java.net/browse/JDK-8054565

Users are more likely to use try-with-resource when they call the get method - try(InputStream stream = result.getInputStream()){ } .

I don't expect users to use try-with-resources on a type named *Result. So, I am okay with this type not implementing Closeable/AutoCloseable and document that the stream returned by getInputStream() should be closed just because not all streams may have an idempotent close() method.

The only implementation I see for FileOpenInputStreamResult is the InternalFileOpenInputStreamResult which overrides close() method and is safe to call twice but we may add more implementations later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srnagar What if then never call getInputStream()? with current implementation the stream sitting inside the wrapper is "opened". I.e. if somebody writes code

{
   FileOpenInputStreamResult  result = client.openInputStream();
   System.out.println(result.getProperties().getSomeProperty())'
}

That snippet will leave "opened" stream in memory until GC collects it and there's no indication that user should close the stream sitting inside.
The issue about idempotency you mention is classified as bug - so under normal circumstances should we expect close to be idempotent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making FileOpenInputStreamResult implement Closeable is still not going to fix the issue if the code is written as above.

The name *Result doesn't clearly indicate that this is a stream that needs closing.

Couple of options here:

  • Rename FileOpenInputStreamResult to FileOpenInputStream and implement AutoCloseable
  • Rename FileOpenInputStreamResult to FileOpenInputStream and extend BlobInputStream

Also, just curious why this is type is an interface? Do we expect more implementations to be added later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srnagar I understand that making it closeable won't make the code snippet work, but it will make people who pay attention or using static analyzers to notice they should add try-with-resources - without closeable they won't get that hint.

I'm not a big fan of both alternatives. This type isn't a stream and we don't want to mix datalake and blob models. Neither we want to create stream derivative just to smuggle few extra properties.

If we're to make a trade-off here then I'd say we remove Closeable because the implementation of stream we return anyway has no-op close and it's highly unlikely it's going to change. If that changes we can make getInputStream to lazy-load or create new stream each time it's called.

The reason this type is interface is because we want to hide concrete implementation inside "internal" package. With ever expanding models it's easier to maintain existing types if they're not exposed "publicly". This is lesson learnt from here - we don't want to add new ctor overload everytime we add property to the immutable type.

Copy link
Member

@gapra-msft gapra-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new API changes look good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants