Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write to S3 fails with operation timed out #1878

Closed
wjones127 opened this issue Jan 29, 2024 · 2 comments · Fixed by #1921
Closed

Write to S3 fails with operation timed out #1878

wjones127 opened this issue Jan 29, 2024 · 2 comments · Fixed by #1921
Assignees
Labels
bug Something isn't working priority: high Issues that are high priority (for LanceDb, the organization)

Comments

@wjones127
Copy link
Contributor

It does not appear to retry:

OSError: LanceError(IO): Generic S3 error: Error after 0 retries in 67.379407043s, max_retries:10, retry_timeout:180s, source:error sending request for url (s3://...): operation timed out, /home/runne
r/work/lance/lance/rust/lance-core/src/encodings/binary.rs:80:13
@wjones127 wjones127 added bug Something isn't working priority: high Issues that are high priority (for LanceDb, the organization) labels Jan 29, 2024
@wjones127 wjones127 self-assigned this Jan 29, 2024
@wjones127
Copy link
Contributor Author

Cannot reproduce this when writing from an EC2 instance to S3, using either images, large vectors, and large tensors. I suspect there are specific network or instance states that will trigger this.

@wjones127
Copy link
Contributor Author

It turns out the trick to reproduce this is:

  1. Make each batch > 10MB, so the multipart upload will initiate the upload request immediately.
  2. Make max_rows_per_group smaller than size of first batch. (Not necessary, just speeds things up.)
  3. Sleep for 2 minutes between each batch

If we add a time.sleep() call to the Python generator, we can reproduce the error:

import pyarrow as pa
import numpy as np
import time

nrows = 128
ndims = 4096

def make_batch():
    tensor = np.random.rand(nrows, 120, ndims).astype("float")
    tensor = pa.FixedShapeTensorArray.from_numpy_ndarray(tensor)
    return pa.table({'tensor1': tensor}).to_batches()[0]

schema = make_batch().schema

import lance

def data():
    for i in range(10):
        yield make_batch()
        print("yielded a batch")
        time.sleep(30)

lance.write_dataset(
    data(),
    "s3://lance-performance-testing/will-test",
    schema=schema,
    mode="overwrite",
    max_rows_per_group=128,
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority: high Issues that are high priority (for LanceDb, the organization)
Projects
None yet
1 participant