Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3 client hangs when uploading an object during streaming download of another object #1235

Closed
5 of 6 tasks
javitonino opened this issue Dec 16, 2024 · 9 comments
Closed
5 of 6 tasks

Comments

@javitonino
Copy link

Describe the bug

We've got some code that downloads a large archive file from S3, extracts it and uploads each file to another key in the same bucket, all using iterators to work on small chunks of data. We noticed that the process got stuck some times (maybe ~30% of the time) while doing so. No timeout or exceptions were raised but the process never finished. After debugging for a while we got to this minimum reproducing example:

from aiobotocore.session import get_session
async with get_session().create_client("s3") as client:
    i = 0
    downloader = await client.get_object(Bucket=some_bucket", Key="input/7_mb_file")
    async for chunk in downloader["Body"]:
        i += 1
        print(i, len(chunk))

        # This process only fails near the end of the download, this condition is just here to speed up the testing
        # by skipping some uploads but the bug reproduces without it, it just takes more time. 
        # It should be adjusted to trigger on the final dozen chunks of the download or so.
        if i > 6990:
            print("Streamed, time to upload")
            # It hangs awaiting this
            await client.put_object(
                Bucket="output",
                Key='output/some_file',
                Body=b"",
                ContentType="application/octet-stream"
            )
            print("Uploaded")

Some important observations:

  • It's a non-deterministic bug, it does not always happen. But with this script I managed to reproduce it ~30% of the tries.
  • If it hangs, it always does so near the end of the download, at exactly the same place in the code. In the original code, downloading an archive of123 files, it either failed when trying to upload the 110th, or it didn't fail. Debugging it a bit, I'm pretty sure that's when the network download has completed but we still have buffered data pending to process.
  • The upload seems stuck waiting here https://github.com/aio-libs/aiohttp/blob/6200513d8fd34a820a4d10d238ca92d9f73ce7ee/aiohttp/client_reqrep.py#L590
  • I attempted to reproduce against minio to test locally but I was unable to. I can only reproduce against the real S3 service, from my machine or an EC2 instance.
  • This feels like an aiohttp issue, but I was unable to reproduce using raw aiohttp. Although I might have missed something important since I made a bunch of simplifications (e.g: I skipped all authentication code from aiobotocore and used public buckets).
  • As a workaround, I'm using two separate clients, one to do the download an one to do the uploads. This works fine and I'm not experiencing any problems this way.
  • Putting breakpoints through the code makes it much harder to reproduce. I'm assuming it's some kind of timing bug and that pausing in the debugger makes it go away.

This is pure speculation based on the previous observations: it feels like there's some race condition when returning the downloading connection to the pool and it gets reused by one of the uploads.

Checklist

  • I have reproduced in environment where pip check passes without errors
  • I have provided pip freeze results
  • I have provided sample code or detailed way to reproduce
  • I have tried the same code in botocore to ensure this is an aiobotocore specific issue
  • I have tried similar code in aiohttp to ensure this is is an aiobotocore specific issue
  • I have checked the latest and older versions of aiobotocore/aiohttp/python to see if this is a regression / injection

pip freeze results

Local:

aiobotocore==2.15.2
aiohappyeyeballs==2.4.4
aiohttp==3.11.10
aioitertools==0.12.0
aiosignal==1.3.2
attrs==24.3.0
botocore==1.35.36
frozenlist==1.5.0
idna==3.10
jmespath==1.0.1
multidict==6.1.0
propcache==0.2.1
python-dateutil==2.9.0.post0
six==1.17.0
urllib3==2.2.3
wrapt==1.17.0
yarl==1.18.3

Environments:

Local:

  • Python Version: 3.13
  • OS name and version: Arch Linux

EC2 instance:

  • Python Version: 3.12
  • OS name and version: Ubuntu 24.04
@thehesiod
Copy link
Collaborator

great info thanks! ya almost certainly an aiohttp/asyncio bug. have you tried uvloop and/or different versions of python/asyncio?

@thehesiod
Copy link
Collaborator

oh I see you're latest aiohttp/python nm.

@thehesiod
Copy link
Collaborator

perhaps try running with debug logging for some hints. may need to add extra logging to aiohttp

@javitonino
Copy link
Author

javitonino commented Dec 16, 2024

Some progress on this.

I managed to reproduce locally using localstack (docker version), which greatly improved the debug experience. I had tried before with minio but I couldn't reproduce there. I'm just doing

docker run \
  --rm -it \
  -p 127.0.0.1:4566:4566 \
  -p 127.0.0.1:4510-4559:4510-4559 \
  -v /var/run/docker.sock:/var/run/docker.sock \
  localstack/localstack

dd if=/dev/zero of=ae4bbfcdc47f450aa8557abefeba4a5ct bs=1M count=1
aws --endpoint-url=http://localhost:4566 s3api create-bucket --bucket bucket --region local
aws --endpoint-url=http://localhost:4566 s3 cp ae4bbfcdc47f450aa8557abefeba4a5ct s3://bucket

and running the script below.

I started removing aiobotocore parts of the code to try to reduce the problem as much as possible, arrived at this following minimal example:

import asyncio
import aiohttp
from aiobotocore.response import StreamingBody

async def run():
    for _ in range(10):
        async with aiohttp.ClientSession() as session:
            response = await session.get(
                "http://localhost:4566/bucket/ae4bbfcdc47f450aa8557abefeba4a5ct",
            )
            i = 0
            async for chunk in StreamingBody(response, response.headers["content-length"]):
                i += 1

                if i >= 900:
                    print("Streamed, time to upload")
                    # It hangs awaiting this
                    await session.put(
                        "http://localhost:4566/bucket/output/some_file",
                        data=b""
                    )
                    print("Uploaded")

asyncio.run(run())

Removing StreamingBody and iterating directly on the aiohttp response.content solves the problem. So the bug must be in StreamingBody or triggered by something that StreamingBody does.

That's it for now, I'll try to get some more time to continue debugging later/tomorrow.

P.S: I also tried with uvloop and some other python/aiohttp versions. Nothing changed, the problem exists with all combinations I tried.

@javitonino
Copy link
Author

Well, I did something wrong earlier, but this can be definitely be reproduced without any aiobotocore. It just depends on the chunk size and aiobotocore's default of 1024 makes it happen pretty often in this example, not sure why.

Anyway, the following reproduces only with aiohttp:

import asyncio
import aiohttp

async def run():
    for _ in range(10):
        async with aiohttp.ClientSession() as session:
            response = await session.get(
                "http://localhost:4566/bucket/ae4bbfcdc47f450aa8557abefeba4a5ct",
            )
            i = 0
            async for chunk in response.content.iter_chunked(1024):
                i += 1

                print(f"chunk number {i}")
                if i >= 900:
                    print("Streamed, time to upload")
                    # It hangs awaiting this
                    await session.put(
                        "http://localhost:4566/bucket/output/some_file",
                        data=b""
                    )
                    print("Uploaded")

asyncio.run(run())

I guess I should move this issue to aiohttp's repo. Unless you want to keep this around here as well?

@jakob-keller
Copy link
Collaborator

I guess I should move this issue to aiohttp's repo.

Yes, please. And thank you for the detailed analysis and follow-up!

@javitonino
Copy link
Author

Linking the aiohttp issue in case someone needs to follow the trace.

Thanks for your help getting this triaged! ❤️ And sorry for the original report here, it took a while to be able to reproduce without aiobotocore involvement.

@Dreamsorcerer
Copy link
Member

I guess I should move this issue to aiohttp's repo.

Yes, please. And thank you for the detailed analysis and follow-up!

FYI, you can just transfer the issue next time, from the sidebar on the right.

@thehesiod
Copy link
Collaborator

rockstar status! no apologies needed, this was an awesome one to follow. Definitely will be following that bug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants