Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared buckets dropping frequently? #759

Closed
lauraduncanson opened this issue Jul 21, 2023 · 3 comments
Closed

Shared buckets dropping frequently? #759

lauraduncanson opened this issue Jul 21, 2023 · 3 comments
Assignees
Labels
ADE Algorithm Development Environment Subsystem
Milestone

Comments

@lauraduncanson
Copy link

I'm testing out some code that works perfectly in the old ade, and periodically getting the following error:

Error: [classify] incorrect number of values (too many) for writing
In addition: Warning messages:
1: TIFFFillTile:Read error at row 2816, col 2816, tile 60; got 104317 bytes, expected 994809 (GDAL error 1)
2: TIFFReadEncodedTile() failed. (GDAL error 1)
3: /projects/shared-buckets/nathanmthomas/dem_all/03/21/14/06/57/437547/Copernicus_24902_covars_cog_topo_stack.tif, band 2: IReadBlock failed at X offset 0, Y offset 5: TIFFReadEncodedTile() failed. (GDAL error 1)
Execution halted

R-renning the exact same code over the same files does not give this error every time, so it seems to be an issue with reading from the shared buckets which are maybe dropping and reloading quickly? I'm looping over a bunch of files so it crashes when the shared bucket disconnects even for a second.

@gchang gchang added the ADE Algorithm Development Environment Subsystem label Nov 29, 2023
@gchang
Copy link
Collaborator

gchang commented Nov 29, 2023

UWG still having issues with public and private buckets being dropped.

@gchang
Copy link
Collaborator

gchang commented Jan 18, 2024

No remedies yet, still looking into it.

@bsatoriu
Copy link
Collaborator

After testing numerous s3fs settings, I found the biggest driver of the bucket disconnects is under-sized memory limits. Out-of-memory errors can be found in the dmeg log inside the sf3s sidecar container when trying to copy large files (~100mb) from the local file system into an s3-mounted bucket.

To resolve the memory issue, I updated the memory limit setting in the workspace devfiles from 256Mi to 1024Mi. I couldn't find any formal guidance on recommend s3fs minumum memory requirements, but based this change on testing a workspace script that copies large files in and out of mounted buckets.

I also added a -o endpoint="us-west-2" s3fs setting to the sidecar image to eliminate 404 errors during initial s3fs mounting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ADE Algorithm Development Environment Subsystem
Projects
None yet
Development

No branches or pull requests

3 participants