Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StorageClient.UploadObject is difficult to use effectively #9250

Closed
masonwheeler opened this issue Oct 27, 2022 · 13 comments · Fixed by #10122
Closed

StorageClient.UploadObject is difficult to use effectively #9250

masonwheeler opened this issue Oct 27, 2022 · 13 comments · Fixed by #10122
Assignees
Labels
api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@masonwheeler
Copy link

Describe the solution you'd like
The documentation claims that StorageClient.UploadObject[Async] is useful for situations where "You want to upload data but don't know the final size at the start of the upload, such as when generating the upload data from a process, or when compressing an object on-the-fly." But the API it provides is not suited to either scenario. Its working is the polar opposite of the standard .NET idiom for generating and storing data: the API gives you a writeable Stream, and you write the data to it.

Instead, StorageClient.UploadObject[Async] asks you for a readable Stream, and then uploads the data from it. The example given in the linked doc page even involves using FileStream to upload an existing file from the file system, which is the precise opposite of generating data in-process where you don't know the final size at the start of the upload. The scenario of "compressing an object on-the-fly" that the documentation claims to support does not work at all here; compressors such as GZipStream or BrotliStream require a writeable stream for their output.

While this API can certainly be useful for pre-existing data, it is clearly not fit for its advertised purpose. A better API would look like this:

        public (Stream uploader, Task<Object> receipt) BeginUploadObject(
            string bucket,
            string objectName,
            string contentType,
            UploadObjectOptions options = null)

(plus a comparable Async variant)

This would return a writeable Stream to write data to. The API will know that the data production is complete when Close or Dispose is called on the stream. Generation of the Object metatata that serves as a receipt for your object storage would be the responsibility of an async Task, even for the synchronous API.

Describe alternatives you've considered
Right now, the only way to upload generated data without buffering it either in memory or in a temp file is to come up with some sort of hack to adapt a writeable Stream to a readable Stream. System.IO.Pipelines.Pipe seems to work, but that imposes additional overhead for moving the uploading code onto a thread pool thread. It would be nice to instead have an API that's actually designed to fit the advertised workflow.

@shivgautam
Copy link

shivgautam commented Oct 28, 2022

@anshuldavid13 - Can you look into this?

@jskeet jskeet added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: storage Issues related to the Cloud Storage API. labels Oct 28, 2022
@jskeet
Copy link
Collaborator

jskeet commented Oct 28, 2022

First off, @masonwheeler, thanks for raising this - and I do completely agree that the library API surface is clunky here. Likewise it would be nice if DownloadObject could provide a stream to read from. I'm not sure that we'd want to expose a ValueTuple-based return type, but other than that your proposed API looks about right to me.

Unfortunately, implementing that API presents significant challenges. The existing implementation is based on ResumableUpload from the Google.Apis package - which only supports the current mode, and is a hairy piece of code to work with. (It's worth understanding that Google.Apis is effectively support for HTTP/1.1+JSON API calls. It's in maintenance mode - we're actively maintaining it, but not putting effort into adding features unless we absolutely need to. The newer generation of gRPC-based libraries is where most of our effort goes at the moment.)

There are three options we could use to move forward:

  • Continue to use ResumableUpload but use it with the kind of buffer you describe at the end. (With the overhead you describe as well. Note that there's likely to be some memory overhead, as in order to provide retry we'd need to keep the data locally until we've made a successful request.)
  • Add the functionality to ResumableUpload: this would be really complex, unfortunately.
  • Create an entirely separate path that doesn't use ResumableUpload at all - but needs to implement a lot of the same functionality

None of these is appealing, as you can imagine.

We'll need to discuss internally how important it is to fix this, estimate timelines, and see how it fits in with any other future plans. I don't want to make any promises that we'll ever address this - but I definitely acknowledge that the issue is a real one.

@masonwheeler
Copy link
Author

@jskeet

I'm not sure that we'd want to expose a ValueTuple-based return type, but other than that your proposed API looks about right to me.

Fair enough. The other alternative is to use an out parameter, but that gets awkward when you also have params with default values.

It's worth understanding that Google.Apis is effectively support for HTTP/1.1+JSON API calls. ... The newer generation of gRPC-based libraries is where most of our effort goes at the moment.

Huh. I'm actually a bit surprised to hear that, well over a decade after they were introduced, Google hasn't provided a WebSocket-based upload API. You'd think that would be the obvious implementation for this particular use case.

None of these is appealing, as you can imagine.

True. Well, of those three, I have to ask, does the buffer you describe as being necessary for making the resumable upload resumable already exist, seeing as how it's necessary? (Or does your implementation depend on Stream.Seek and explode when it's not available? Because I think that's the only other alternative, and that is not documented as being the case.) If the necessary buffer is already there, option 1 easily looks like the lesser of three evils.

@jskeet
Copy link
Collaborator

jskeet commented Oct 28, 2022

On a quick investigation, it looks like we always buffer - I had thought we might only buffer if Stream.CanSeek returns false, but it looks like that's not currently the case. (We definitely don't explode when seeking isn't available.)

So option 1 would be no worse than the current code in terms of overhead - it just wouldn't reduce the overhead either.

(In terms of the API, there are various options here; we'll think about that after deciding whether or not to actually go ahead at all.)

@masonwheeler
Copy link
Author

We definitely don't explode when seeking isn't available.

Well congrats on that; that puts you a step ahead of your AWS S3 counterpart!

@jskeet
Copy link
Collaborator

jskeet commented Jan 25, 2023

I'm looking at this now, with a tentative design that's basically a stream to perform the buffering - so you might have something like:

var mediaStream = new MediaStream(/* buffer size maybe? */);
// Important - we *don't* await this task yet...
var uploadTask = client.UploadObjectAsync("my-bucket", "my-object", "text/plain", mediaStream);
// We'd probably *only* support async code for this. It makes things simpler.
await mediaStream.WriteAsync(...);
mediaStream.Close(); // Or Dispose, or a different method entirely...

// This would only finish when the buffered data had all been pulled by the uploader
await uploadTask;

We might be able to use the same class (or at least bits of it) for a cleaner download as well (where you read from the download stream). Or we may be able to use an entirely different design for that. (That gets tricky in terms of hash checking.)

It's still a bit clunky - more so now that I'm focused on the user code rather than the implementation. Hmm.

@jskeet jskeet self-assigned this Jan 25, 2023
@masonwheeler
Copy link
Author

Honestly that setup looks like a footgun, if a minor one. "await SomeApiAsync()" is so ingrained into .NET developers' heads by this point that I can see a lot of people simply doing that automatically and not understanding why it didn't work, which seems at odds with the "pit of success" philosophy underlying .NET design.

What happens if you do await the call to UploadObjectAsync? I'm guessing it will simply never complete?

@jskeet
Copy link
Collaborator

jskeet commented Jan 25, 2023

@masonwheeler: Yup, that's my big concern. And yes, it would just never complete because it would be waiting for the stream to be completed. I'm starting to think about different options (which may basically use that underneath)...

@masonwheeler
Copy link
Author

@jskeet Aside from the ValueTuple, which could probably be worked around in a few different ways, is there any reason why the API I suggested in the first post wouldn't be viable?

@jskeet
Copy link
Collaborator

jskeet commented Jan 25, 2023

@masonwheeler: I think I'd probably want to wrap it into a separate type. There's still the concern that someone will see a task and just await it though. I don't want to leap to one suggested API rather than trying to think of multiple options - and ones that will work for other media upload/download in Google.Apis as well.

@jskeet
Copy link
Collaborator

jskeet commented Mar 30, 2023

Apologies for the silence on this for a couple of months - and apologies that you're not going to like the result, which is that I'm going to close this issue with a new PR that adds this to our backlog file. I've looked into it, and I just don't think there's anything nice we can do in the current library. If we end up with a new Storage library based on gRPC, we can look at it from scratch at that point, as we'll have a clean slate in terms of API surface - we won't need to worry about how the existing REST-based support library works, and we won't need to make changes to the existing API of Google.Cloud.Storage.V1. You have my word that we'll at least take another close look - and hopefully with that added freedom, we'll be able to come up with something appropriate. I think it's unlikely that we'll ever end up back-porting to Google.Cloud.Storage.V1 though.

I know this isn't the news you want to hear, but I'm afraid we have limited resources in the team, and implementing a robust and "clean" API (avoiding traps) here would just be too expensive in terms of time.

jskeet added a commit to jskeet/google-cloud-dotnet that referenced this issue Mar 30, 2023
Adds googleapis#9250 to the backlog, and removes a couple of items that have
been addressed.

Closes googleapis#9250
jskeet added a commit that referenced this issue Mar 30, 2023
Adds #9250 to the backlog, and removes a couple of items that have
been addressed.

Closes #9250
@masonwheeler
Copy link
Author

Wow. "We have limited resources" is kind of the last thing you expect to ever hear from Google... 😮

@groone
Copy link

groone commented May 22, 2024

I know this isn't the news you want to hear, but I'm afraid we have limited resources in the team, and implementing a robust and "clean" API (avoiding traps) here would just be too expensive in terms of time.

A possible way to avoid traps is to add another method that accepts System.IO.Pipelines.PipeReader and continuously uploads chunks from pipe until the pipe is completed by another thread with .Complete() method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants