-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Datalake inputstream #21322
Datalake inputstream #21322
Conversation
added tests removed unncecessary openinputstream overload
Fixed docstrings and parameters that used "blob" instead of "file" terminology. Fixed a docstring link that referenced cut blob-only functionality.
DataLakeFileInputStream now uses logger.logThrowableAsError. Header and docstring fixes.
This pull request is protected by Check Enforcer. What is Check Enforcer?Check Enforcer helps ensure all pull requests are covered by at least one check-run (typically an Azure Pipeline). When all check-runs associated with this pull request pass then Check Enforcer itself will pass. Why am I getting this message?You are getting this message because Check Enforcer did not detect any check-runs being associated with this pull request within five minutes. This may indicate that your pull request is not covered by any pipelines and so Check Enforcer is correctly blocking the pull request being merged. What should I do now?If the check-enforcer check-run is not passing and all other check-runs associated with this PR are passing (excluding license-cla) then you could try telling Check Enforcer to evaluate your pull request again. You can do this by adding a comment to this pull request as follows: What if I am onboarding a new service?Often, new services do not have validation pipelines associated with them, in order to bootstrap pipelines for a new service, you can issue the following command as a pull request comment: |
…into datalake-inputstream
...-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java
Outdated
Show resolved
Hide resolved
...age-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileInputStream.java
Outdated
Show resolved
Hide resolved
added a suppression to deal with a checkstyle bug minor fixes
…into datalake-inputstream
...age-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileInputStream.java
Outdated
Show resolved
Hide resolved
StorageInputStream now has an implementation of dispatchRead() and only delegates out the implementation of the client read operation itself.
sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/StorageInputStream.java
Outdated
Show resolved
Hide resolved
sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/StorageInputStream.java
Outdated
Show resolved
Hide resolved
sdk/storage/azure-storage-common/src/main/java/com/azure/storage/common/StorageInputStream.java
Outdated
Show resolved
Hide resolved
...-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java
Outdated
Show resolved
Hide resolved
...-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java
Outdated
Show resolved
Hide resolved
...-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java
Outdated
Show resolved
Hide resolved
* | ||
* @return {@link PathProperties} | ||
*/ | ||
public PathProperties getProperties() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be part of the public interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we're kinda locked into this based on the equivalent API in blobs.
* {@link #NONE} | ||
* {@link #ETAG} | ||
*/ | ||
public enum ConsistentReadControl { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this consistency control graduate into azure-storage-common
as it should be re-usable in Blobs and maybe Files if we choose to add open read/write functionality there as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue is that the blob package already has their own copy of this. So if we put one in common then there will be confusion in the blobs package. Something we'd need to have caught before that API GA'd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Datalake takes dependency on blobs directly? Don't we end up having both types anyway on class path ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be wrong but I think we go out of our way to avoid using blob types in the datalake public API.
...ile-datalake/src/main/java/com/azure/storage/file/datalake/models/ConsistentReadControl.java
Outdated
Show resolved
Hide resolved
…into datalake-inputstream
ByteBuffer currentBuffer = this.fileClient.readWithResponse( | ||
new FileRange(offset, (long) readLength), null, this.accessCondition, false) | ||
.flatMap(response -> FluxUtil.collectBytesInByteBufferStream(response.getValue()).map(ByteBuffer::wrap)) | ||
.block(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fileClient delegtes reads to blockblob client internally.
Maybe we should delegate opening a stream to blockblob client as well and just make this class an adapter, i.e. keep reference to stream from block blob and just proxy calls there? If we were returning plain InputStream from API we could just return stream from blobclient, but since we expose some extra properties adapter would be needed.
I'd consider this - writing an adapter is easier than maintaining two versions of logic that works on bytebuffers and offsets.
This is what dotnet does https://github.com/Azure/azure-sdk-for-net/blob/3f38e290bfc1b1579baa4abf329a3861355796f1/sdk/storage/Azure.Storage.Files.DataLake/src/DataLakeFileClient.cs#L3869-L3873
Or maybe we don't look at blobs and just return InputStream ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So most of the decisions were made with the intent to mirror the blob API shape. It's been expressed multiple times throughout this PR that the blob API for this is undesirable. If we make the decision to break from that design, we can return a class that holds a plain InputStream and the datalake properties separately, then use the block blob inputstream as the base implementation to wrap. Is this acceptable to people? @kasobol-msft @alzimmermsft @gapra-msft @rickle-msft
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example of what would be returned:
public class DataLakeFileInputStream {
public InputStream getInputStream(); // literally returns a BlockBlobInputStream instance
public PathProperties getProperties();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that idea. This should be called DataLakeFileInputStreamResult
(or whatever matches Result pattern).
openInputStream returns a result class containing InputStream and PathProperties members. The returned InputStream is a BlobInputStream instance. Ported blob inputStream test class
…to datalake-inputstream
eng/code-quality-reports/src/main/resources/checkstyle/checkstyle-suppressions.xml
Outdated
Show resolved
Hide resolved
...-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/DataLakeFileClient.java
Outdated
Show resolved
Hide resolved
...ge/azure-storage-file-datalake/src/main/java/com/azure/storage/file/datalake/Transforms.java
Show resolved
Hide resolved
...azure-storage-file-datalake/src/test/java/com/azure/storage/file/datalake/FileAPITest.groovy
Outdated
Show resolved
Hide resolved
/** | ||
* Result of opening an {@link InputStream} to a datalake file. | ||
*/ | ||
public interface FileOpenInputStreamResult { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be Closeable
since it wraps closeable stream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what we gain from this since they have to get the InputStream anyway. I guess someone who only got the properties would benefit, but they might be writing some weird code if that's where they find themselves. Most people working with streams are probably used to the stream being closable and adding a second hook to close the same thing may be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do that in dotnet, see https://github.com/Azure/azure-sdk-for-net/blob/main/sdk/storage/Azure.Storage.Blobs/src/Models/BlobDownloadStreamingResult.cs , i.e. a model holding reference to Disposable type should be Disposable itself.
@alzimmermsft is there a guideline around models that has Closeable
property inside ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know there is no Azure SDK for Java guideline about implementing Closeable
/AutoClosable
if underlying properties implement those interfaces. That being said, if the expectation for this type is to be used alongside other stream types I'd implement AutoClosable
to allow for try-with-resources functionality.
cc: @srnagar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add closable for now and leave it to final api review to catch this if it's undesirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing a stream more than once is not always idempotent - https://bugs.openjdk.java.net/browse/JDK-8054565
Users are more likely to use try-with-resource when they call the get method - try(InputStream stream = result.getInputStream()){ }
.
I don't expect users to use try-with-resources on a type named *Result. So, I am okay with this type not implementing Closeable
/AutoCloseable
and document that the stream returned by getInputStream()
should be closed just because not all streams may have an idempotent close()
method.
The only implementation I see for FileOpenInputStreamResult
is the InternalFileOpenInputStreamResult
which overrides close()
method and is safe to call twice but we may add more implementations later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srnagar What if then never call getInputStream()
? with current implementation the stream sitting inside the wrapper is "opened". I.e. if somebody writes code
{
FileOpenInputStreamResult result = client.openInputStream();
System.out.println(result.getProperties().getSomeProperty())'
}
That snippet will leave "opened" stream in memory until GC collects it and there's no indication that user should close the stream sitting inside.
The issue about idempotency you mention is classified as bug - so under normal circumstances should we expect close to be idempotent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making FileOpenInputStreamResult
implement Closeable
is still not going to fix the issue if the code is written as above.
The name *Result
doesn't clearly indicate that this is a stream that needs closing.
Couple of options here:
- Rename
FileOpenInputStreamResult
toFileOpenInputStream
and implementAutoCloseable
- Rename
FileOpenInputStreamResult
toFileOpenInputStream
and extendBlobInputStream
Also, just curious why this is type is an interface? Do we expect more implementations to be added later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srnagar I understand that making it closeable won't make the code snippet work, but it will make people who pay attention or using static analyzers to notice they should add try-with-resources - without closeable they won't get that hint.
I'm not a big fan of both alternatives. This type isn't a stream and we don't want to mix datalake and blob models. Neither we want to create stream derivative just to smuggle few extra properties.
If we're to make a trade-off here then I'd say we remove Closeable
because the implementation of stream we return anyway has no-op close and it's highly unlikely it's going to change. If that changes we can make getInputStream to lazy-load or create new stream each time it's called.
The reason this type is interface is because we want to hide concrete implementation inside "internal" package. With ever expanding models it's easier to maintain existing types if they're not exposed "publicly". This is lesson learnt from here - we don't want to add new ctor overload everytime we add property to the immutable type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new API changes look good
Added openInputStream() to DataLakeFileClient