Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiPart Upload #11588

Closed
suleymanbyzt opened this issue Jan 25, 2024 · 6 comments
Closed

MultiPart Upload #11588

suleymanbyzt opened this issue Jan 25, 2024 · 6 comments
Assignees
Labels
api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue.

Comments

@suleymanbyzt
Copy link

Hello, everyone,

I want to use your library to upload large files without having to write them to my disk. For example, I have a 1GB bucket. I want to download files from this bucket in 10 mb chunks, write them to stream and then upload them back in zip format.

I tried this with CreateObjectUploader but it won't let me overwrite the file. When the upload is done, I only have a 10mb file in the bucket.

https://cloud.google.com/storage/docs/performing-resumable-uploads

Once Cloud Storage persists bytes in a resumable upload, those bytes cannot be overwriten, and Cloud Storage ignores attempts to do so. Because of this, you should not send different data when rewinding to an offset that you sent previously.

For example, say you're uploading a 100,000 byte object, and your connection is interrupted. When you check the status, you find that 50,000 bytes were successfully uploaded and persisted. If you attempt to restart the upload at byte 40,000, Cloud Storage ignores the bytes you send from 40,000 to 50,000. Cloud Storage begins persisting the data you send at byte 50,001.

It is written in the document that this is not possible, but I wanted to ask you in case there is a different method. Is there a method that I can make it continue uploading from a certain point for upload, like Range in the DownloadOption

@jskeet
Copy link
Collaborator

jskeet commented Jan 25, 2024

Please provide a complete example - I don't really understand exactly what you're trying to achieve, and a concrete code example will make it a lot clearer.

@jskeet jskeet added the api: storage Issues related to the Cloud Storage API. label Jan 25, 2024
@jskeet
Copy link
Collaborator

jskeet commented Jan 25, 2024

(You can definitely overwrite the contents of an object later, but you can't overwrite them within that upload session. That's what the docs are trying to say.)

@jskeet jskeet self-assigned this Jan 25, 2024
@jskeet jskeet added the type: question Request for information or clarification. Not an issue. label Jan 25, 2024
@suleymanbyzt
Copy link
Author

Download process:

        const int chunkSizeInBytes = 10 * 1024 * 1024;
        ulong offset = 0;
        while (offset < fileSize)
        {
            int bytesRead;
            using (var downloadStream = new MemoryStream())
            {
                await _cloudStorage.Download(fileName, downloadStream, new DownloadObjectOptions
                {
                    Range = new RangeHeaderValue((long?)offset, (long?)(offset + chunkSizeInBytes - 1))
                });

                bytesRead = (int)downloadStream.Length;
                var endOfPart= bytesRead < chunkSizeInBytes;
                await _cloudStorage.UploadMultiPart(uploadPath, downloadStream, endOfPart);

            }

            offset += (ulong)bytesRead;
        }

Upload process:

 public async Task UploadMultiPart(string uploadPath, Stream memoryStream, bool endOfPart)
    {
        UploadObjectOptions options = new UploadObjectOptions
        {
            PredefinedAcl = PredefinedObjectAcl.PublicRead
        };
        
        if (!_sessions.TryGetValue(uploadPath, out Uri uploadUri))
        {
            ObjectsResource.InsertMediaUpload tempUploader = _storageClient.CreateObjectUploader(_bucketName, uploadPath, "application/octet-stream", memoryStream, options);
            uploadUri = await tempUploader.InitiateSessionAsync();
            
            _sessions.TryAdd(uploadPath, uploadUri);
        }
        
        IProgress<IUploadProgress> progress = new Progress<IUploadProgress>(
            p => Console.WriteLine($"bytes: {p.BytesSent}, status: {p.Status}")
        );

        ResumableUpload actualUploader = ResumableUpload.CreateFromUploadUri(uploadUri, memoryStream);
        actualUploader.ProgressChanged += progress.Report;
        await actualUploader.UploadAsync();

        if (endOfPart)
        {
            _ = _sessions.TryRemove(uploadPath, out _);
        }
    }

For example, here I want to download a 200 mb file in 10 mb parts and upload each downloaded part. I cannot overwrite the file in this way

@jskeet
Copy link
Collaborator

jskeet commented Jan 25, 2024

It's still not clear to me what you want the result to be though, or why you want to upload it in parts at all.
There's also aspects about sessions in there which make this sample incomplete. If you could just give me a complete console application that I can run, say what the precondition is (e.g. "a 1GB object") and what you want the end result to be, that would be really helpful.

At the moment I think you're effectively trying to "not complete" the upload after each part, in which case this issue is just a duplicate of googleapis/google-api-dotnet-client#2480 - please could you check whether that describes what you're trying to do?

@pkese
Copy link

pkese commented Jan 27, 2024

@jskeet I'm having a similar problem.

I'm streaming large data-lake files ~ 1 GB each, reading them Amazon S3 doing some processing & filtering and then writing them to Googles Cloud Storage.

Currently I need to store those files to disk before uploading them to Google cloud.
If streaming uploads worked, I could use some cheaper compute instances without local disk storage.


I think what is needed in the code is:

content.Headers.ContentRange =
   isLastChunk 
   ? Headers.ContentRangeHeaderValue(totalWritten, totalWritten + chunk.Length, totalWritten + chunk.Length)
   : Headers.ContentRangeHeaderValue(totalWritten, totalWritten + chunk.Length)

@jskeet
Copy link
Collaborator

jskeet commented Jan 29, 2024

@pkese: "I think what is needed in the code" - which code, exactly? (I very much doubt that implementing this is just a single statement change.) Please note the final comment in googleapis/google-api-dotnet-client#2480 - we'd like to get to this at some point, but it's not high on our priority list at the moment.

I'm going to close this issue as I believe it's a duplicate of the linked one, and I'd really prefer to avoid multiple issues getting separate comment threads. If you believe it's not a duplicate of that, please let me know and I can reopen this one.

@jskeet jskeet closed this as completed Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

3 participants