Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 uploader docs question #2228

Closed
iselind opened this issue Oct 25, 2018 · 19 comments
Closed

S3 uploader docs question #2228

iselind opened this issue Oct 25, 2018 · 19 comments
Labels
guidance Question that needs advice or information.

Comments

@iselind
Copy link

iselind commented Oct 25, 2018

Please fill out the sections below to help us address your issue.

Version of AWS SDK for Go?

Latest

Version of Go (go version)?

go version go1.11 linux/amd64

What issue did you see?

While using the s3 manager to upload a file, it seems the entire file is read into RAM before it's being sent. The docs give no clue how we can limit this behaviour. Our code base runs in a Kubernetes pod with a restricted amount of available memory.

Our file's sizes can vary from 1 MB to 1 GB. We need a way to control the RAM consumption while uploading these files.

We'd require a way to stream the file to the S3 bucket without having to read the entire file into RAM before sending to S3.

Steps to reproduce

Create a Kubernetes pod with a RAM limit of say 10MB and try to upload a 100MB file to any AWS bucket. The pod will be killed by the OOMkiller.

Example code from our code base:

// UploadCoreFile uploads body combined with meta to the appropriate bucket
func UploadCoreFile(body io.Reader, meta map[string]string) (string, error) {
    svc := getSession()

    uploader := s3manager.NewUploaderWithClient(svc)

    fname := fmt.Sprintf("%s-%s-%s-%d.core", meta["SerialNumber"], meta["ProdNbr"], meta["Version"], time.Now().Unix())

    // Upload input parameters
    upParams := &s3manager.UploadInput{
        Bucket: aws.String(coreFileBucket),
        Key:    aws.String(fname),
        Body:   body,
    } 

    // Perform an upload.
    resp, err := uploader.Upload(upParams)

    if err != nil { 
        return "", err
    } 
    return resp.Location, err
}
@diehlaws diehlaws self-assigned this Oct 25, 2018
@diehlaws diehlaws added the guidance Question that needs advice or information. label Oct 26, 2018
@diehlaws
Copy link
Contributor

Hi @iselind, thanks for reaching out to us about this. Unfortunately s3manager is not very light when it comes to memory usage at this time, there have been discussions about this in issues #142, #232, and #272 among others. At this time, we have an open feature request (#2036) to re-use buffer pools across requests issued by s3manager.Uploader which should reduce memory usage when using s3manager. You are welcome to +1 that feature request and include additional details for your use case, or suggestions for the implementation of this enhancement, in a comment on that issue.

Until this is implemented, s3manager unfortunately won't work for your use case without increasing available memory to your Kubernetes containers,so you'll have to perform your own multi-part uploads without s3manager. You could create a custom uploader that, for example, creates an io.ReadSeeker from []byte data streamed from the file you want to upload as the Body for the input of an UploadPart API call.

Issue #232 includes a link to a custom uploader written by @cespare that may work for your use case, or that you could use as a starting point to implement your own custom uploader.

@iselind
Copy link
Author

iselind commented Oct 28, 2018

Thanks a lot for looking into my issue, much appreciated!

I'll take a look at the other issues you mention and get appropriately involved in those.

Your suggestion for my own multipart uploader seem strange to me. In order for an io.Seeker to work, you need the data in a seekable storage, i.e. the data needs to be in RAM. The requirement for io.Seeker is, as i understand it, to a big extent the reason why the s3manager requires to much memory. Perhaps i misunderstood your suggestion?

@cespare
Copy link
Contributor

cespare commented Oct 28, 2018

In order for an io.Seeker to work, you need the data in a seekable storage, i.e. the data needs to be in RAM.

Seeking files and reading sections of them should generally only pull in the relevant pages to the buffer cache. It's definitely not the case that only data in RAM is seekable.

@iselind
Copy link
Author

iselind commented Oct 29, 2018

Firstly; how can you in any way or respect be confident that an io.Reader is in fact a file? In our case we get our data from a socket.

Secondly; even if the io.Reader is a file, that's still not page based until the file has been either memory mapped or read into RAM. Either way you end up in RAM again.

Perhaps i'm missing something vital here?

@cespare
Copy link
Contributor

cespare commented Oct 29, 2018

Firstly; how can you in any way or respect be confident that an io.Reader is in fact a file? In our case we get our data from a socket.

We can't. That's why my uploader code takes a file-like interface instead of a plain io.Reader.

Secondly; even if the io.Reader is a file, that's still not page based until the file has been either memory mapped or read into RAM. Either way you end up in RAM again.

Hmm, I'm not quite sure what you're getting at here. The pages of the file may be on disk only or they may be on disk and in the page cache (memory). It's true that if you seek to the middle of the file and then read some data (to upload it) then those pages are now in page cache, but the important things are (1) the kernel may drop those pages from the cache later after we're done reading that chunk of the file, and (2) the data for a chunk doesn't need to all be copied into the Go process's heap at once. That's why my uploading code has much lower memory usage than s3manager.Uploader.

@jasdel
Copy link
Contributor

jasdel commented Oct 29, 2018

Thanks for the update. The issue that you're running into seems to be directly related to using an io.Reader with the S3 Upload Manager. The S3 Upload Manager will attempt to optimize for an io.Reader that is also an io.ReadSeeker. If the uploader is able to do this optimization very little additional memory should be created by the uploader. The uploader is not able to make this optimization for an io.Reader. Instead the io.Reader will create a new byte slice of PartSize (normally about 5MB unless changed) for each chunk uploaded to S3.

Most likely the memory error that you're running into is due to the Uploader's default number of concurrent uploads (5) combined with the minimum part size (5MB), a 100MB fill will at a minimum consume at least 25MBs of memory just for buffering of the 5MB parts to be uploaded.

One potential way to limit the impact of this is to set the Uploader's Concurrency parameter to 1. I'm not positive if this will still be within the 10MB window though.

    // Perform an upload.
    resp, err := uploader.Upload(upParams, func(u *s3manager.Uploader) {
        u.Concurrency = 1
    })

@jasdel
Copy link
Contributor

jasdel commented Oct 29, 2018

Another way you could solve this issue with such a restricted memory is to use the SDK to generate a presigned URL for the request, then use the stdlib http.Client to stream the object to S3. Presigned URLs support streaming. Currently the AWS SDK for Go does not support streaming uploads with the PutObject S3 API operation. The SDK's S3 Upload Manager does support streaming, but the uploader's memory usage should be improved.

Here is an example of using a presigned URL for uploading an S3 Object. https://github.com/aws/aws-sdk-go/tree/master/example/service/s3/presignURL This pattern will consume the least amount of memory.

We have #2036 tracking refactoring the S3 Upload Manger's performance, but either reducing the concurrency, or using presigned URLs should work for your memory threshold.

@iselind
Copy link
Author

iselind commented Oct 30, 2018

Thanks @jasdel I'll look into those options and get back to you. I think the presigned URL solution looks the most promising.

Would you mind keeping this issue open until i've tested those options? It will probably not take more than a week for me to test this properly...

@jasdel
Copy link
Contributor

jasdel commented Oct 30, 2018

Thanks for the feedback, sure we can leave the issue open. I'll make it as pending feedback. Let us know how the change goes.

@jasdel jasdel added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Oct 30, 2018
@iselind
Copy link
Author

iselind commented Oct 31, 2018

It seems we hit a snag, perhaps you can see what we're missing.
We're able to presign the request and make the request. We get

<?xml version="1.0" encoding="UTF-8"?>
	<Error>
	<Code>NotImplemented</Code>
	<Message>A header you provided implies functionality that is not implemented</Message>
	<Header>Transfer-Encoding</Header>
	<RequestId>4FC2B7F1A122D0C6</RequestId>
	<HostId>fKG9On8k+o7mGCpu/FlslvghnsTx9deoMCaFTs9nIDNmUucvX3esKq7Ih53649+M99qK/PnehQ8=</HostId>
</Error>

back as a response.

The code looks like this now:

	svc := getSession()

	fname := fmt.Sprintf("%s-%s-%s-%d.core", meta["SerialNumber"], meta["ProdNbr"], meta["Version"], time.Now().Unix())

	putObj, _ := svc.PutObjectRequest(&s3.PutObjectInput{
		Bucket:        aws.String(coreFileBucket),
		Key:           aws.String(fname),
		ContentLength: aws.Int64(0), // We cannot know the length in advance
	})

	url, err := putObj.Presign(15 * time.Minute)
	if err != nil {
		return "", err
	}

	req, err := http.NewRequest("PUT", url, body)
	if err != nil {
		return "", err
	}
	resp, err := http.DefaultClient.Do(req)

I think we've followed the linked example code. I cannot see what we're doing wrong.

We get the same issue if we exclude the line setting content-length to 0 in the PutObjectRequest.

@jasdel
Copy link
Contributor

jasdel commented Oct 31, 2018

@iselind leave the ContentLength value unset in your PutObjectRequest. In addition, your http.Request needs to set the Content-Length header to the length of the content you'll be putting. S3 does not support chunked transfer-encoding PutObject requests.

In addition I suggest also using PresignRequest instead of Presign. PresignRequest will return a http.Header value with all headers that were signed, and must be included with the presigned URL is used. Generally this value is empty, but depending on what parameters you use for PutObjectInput there may be headers that need to be populated, e.g. SSE.

@iselind
Copy link
Author

iselind commented Nov 1, 2018

@jasdel Thanks a lot for your insights. We set the content length to 0 because we cannot possibly know the length of the body before receiving it all. As we cannot store it temporarily on disk we are left with RAM and we haven't won anything by using Presign(). I assume PresignRequest() has the same limitation, right?

I found this issue which seem very similar to my issue. They seem to have solved the content length issue. Can the same be done in the GoLang AWS SDK? I cannot interpret this in any other way than there is a way to do chunked transfers to S3. I know that issue is for the javascript version of the AWS SDK but the language can hardly be the culprit here, right?

Based on the previous discussion it seems to me that this issue is in essence based on the limitations in the AWS REST API and not really in the SDK. Based on this i'd say changing language to something like C, Python, or Java wouldn't change a thing in this regard. So changing the concurrency to 1 in the uploader seems to be our only vay forward then in trying to reduce the memory consumption of the uploader. Would you agree?

@iselind
Copy link
Author

iselind commented Nov 7, 2018

@diehlaws @jasdel Any updates on this?

@ncw
Copy link
Contributor

ncw commented Nov 26, 2018

@jasdel

Currently the AWS SDK for Go does not support streaming uploads with the PutObject S3 API operation.

Would it be possible to bring back streaming uploads for PutObject? The SDK used to support this (many iterations ago ;-) and the underlying API supports it just fine.

Supplying the content MD5 is optional so if the user is willing to live without this protection then it would be great if there was a PutObject variant which could stream.

I'd like to use this in rclone for uploading files with the minimum memory usage. I know the MD5SUM in advance so I'd like to set that and stream the content.

I'll try to see if I can get the presigned URL uploader to work.

@ncw
Copy link
Contributor

ncw commented Nov 26, 2018

I'll try to see if I can get the presigned URL uploader to work.

I did get it to work - thanks for the hints above :-)

@jasdel
Copy link
Contributor

jasdel commented Nov 26, 2018

@ncw I think for the SDK to support streaming we need to add additional functionality to the SDK's internals to allow an API parameter to be an io.Reader. Currently, a significant portion of the SDK's internal behavior expects a io.ReadSeeker. The ReadSeeker is used by both the protocol marshaler for serialization, and signer for computing the request signature.

S3's PutObject and similar methods are special, and support the magic value UNSIGNED-PAYLOAD for the request body's digest. We should investigate how viable it is to update the SDK to allow S3 PutObject to take an io.Reader instead of an io.ReadSeeker.

@ncw
Copy link
Contributor

ncw commented Nov 26, 2018

@jasdel

S3's PutObject and similar methods are special, and support the magic value UNSIGNED-PAYLOAD for the request body's digest. We should investigate how viable it is to update the SDK to allow S3 PutObject to take an io.Reader instead of an io.ReadSeeker.

That would be nice. In the mean time the code above plus your hints makes for a working uploader provided you know the size of the object in advance. For rclone's use this is fine. For objects for which the size isn't known I'll just use s3uploader as before.

I note that Backblaze have solved the hash problem in a different way to allow streaming - they allow the last 20 bytes of the upload stream to be the sha1 hash. One could also supply hashes in http trailers.

Thanks for your help :-)

@vitaly-zdanevich
Copy link
Contributor

I am not an expert neither in Go nor S3, please recommend me the most correct way to stream the response from HTTP to S3 without touching the SSD and without waiting to fetch the remote file completely, and also I need to use received from HTTP body after uploading to S3, without fetching this data again. I found that for reusing of streamed data I can use io.TeeReader, and about stream from http.Response - as I read in messages above I still need to use HTTP API PUT of S3 instead of Go SDK?

@iselind
Copy link
Author

iselind commented May 18, 2019

This issue can be closed as I see it. It doesn't seem to go anywhere. It's unfortunate that the S3 API doesn't support streaming.

I'll close the issue. If anyone is against this, please re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
guidance Question that needs advice or information.
Projects
None yet
Development

No branches or pull requests

6 participants