Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Cloud Storage failing when using threads #3501

Closed
Alexis-Jacob opened this issue Jun 14, 2017 · 12 comments
Closed

Google Cloud Storage failing when using threads #3501

Alexis-Jacob opened this issue Jun 14, 2017 · 12 comments
Assignees
Labels
api: core api: storage Issues related to the Cloud Storage API.

Comments

@Alexis-Jacob
Copy link

Alexis-Jacob commented Jun 14, 2017

  1. Ubuntu 16.04
  2. Python 2.7.6
  3. google-api-python-client>=1.6.2 and google-cloud-storage>=1.1.1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
    raise self._value
ssl.SSLError: [Errno 1] _ssl.c:1429: error:1408F10B:SSL routines:SSL3_GET_RECORD:wrong version number
  1. The client is not thread safe (I think)
from multiprocessing.pool import ThreadPool
from google.cloud import storage
from functools import partial

def upload(bucket, i):
    blob = bucket.blob("file{}.png".format(i))
    blob.upload_from_string("blabla")
    blob.make_public()
    return blob.public_url

bucket = storage.Client().get_bucket("deepo-test")
pool = ThreadPool()
fct = partial(upload, bucket)
pool.map(fct, [i for i in range(2)])
@dhermes
Copy link
Contributor

dhermes commented Jun 14, 2017

@Alexis-Jacob That's correct, the error you are seeing is caused by the lack of thread-safety in httplib2. We recommend (for now) creating an instance of Client that is local to your thread / process.

@dhermes dhermes added api: storage Issues related to the Cloud Storage API. api: core labels Jun 14, 2017
@dhermes
Copy link
Contributor

dhermes commented Jun 14, 2017

@Alexis-Jacob I am going to pre-emptively close this issue because it is "known" and something we are working on. If you'd like a thread-safe transport, I recommend looking into https://github.com/GoogleCloudPlatform/httplib2shim

@jonparrott Is there a "better" recommendation to make?

@theacodes
Copy link
Contributor

theacodes commented Jun 14, 2017 via email

@dhermes dhermes closed this as completed Jun 14, 2017
@evanj
Copy link
Contributor

evanj commented Jun 27, 2017

Wow, can we get this added to the top level README.md, and added to the documentation for each of the clients? This causes a significant change in how I can use this library. I believe I'm running into this, and now I need to restructure my app to avoid passing the client or any created sub-objects around.

Additionally, httplib2shim doesn't seem to work with the recent updates: the google-auth library no longer uses httplib2, and when using it with google.cloud.storage I get the following exception:

    File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 891, in upload_from_file
    client, file_obj, content_type, size, num_retries)
  File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 818, in _do_upload
    client, stream, content_type, size, num_retries)
  File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 768, in _do_resumable_upload
    client, stream, content_type, size, num_retries)
  File "/usr/local/lib/python2.7/site-packages/google/cloud/storage/blob.py", line 727, in _initiate_resumable_upload
    total_bytes=size, stream_final=False)
  File "/usr/local/lib/python2.7/site-packages/google/resumable_media/requests/upload.py", line 323, in initiate
    total_bytes=total_bytes, stream_final=stream_final)
  File "/usr/local/lib/python2.7/site-packages/google/resumable_media/_upload.py", line 410, in _prepare_initiate_request
    if stream.tell() != 0:
AttributeError: addinfourl instance has no attribute 'tell'

@theacodes
Copy link
Contributor

@evanj it's our hope that soon we'll be able to wholesale migrate to requests so that this actually won't be an issue. @dhermes were are we on #1998?

That error you posted is curious. Storage is already using our new non-httplib2 transport, so @dhermes might be able to shed some light there.

@evanj
Copy link
Contributor

evanj commented Jun 27, 2017

Thanks for the instant response! I get that this is going to be fixed "soon", but this was a bit of a surprise for me to discover that this is a known issue for the current release that only seems to be documented in Github Issues. I would love it if the following page would say "Client is not thread-safe; do not use it between threads": https://googlecloudplatform.github.io/google-cloud-python/stable/storage-client.html

Once the bug is fixed, then the docs could be fixed :)

Also don't worry about the exception, I'm probably doing something weird or have some library version mismatch. I'm just going to fix my code to not re-use clients, since that seems like the more sane, documented solution at the moment.

@dhermes
Copy link
Contributor

dhermes commented Jun 27, 2017

Indeed, thanks for the patience @evanj.

@RomHartmann
Copy link

I would like to add (for googling purposes) that I got the following errors inconsistently:

python 3.6.5
google-api-core==1.2.1
google-auth==1.5.0
google-cloud-core==0.28.1
google-cloud-storage==1.10.0
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.googleapis.com', port=443): Max retries exceeded with url: /download/storage/v1/b/xxx?alt=media (Caused by SSLError(SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2273)'),))

ssl.SSLError: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2273)
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='www.googleapis.com', port=443): Max retries exceeded with url: /download/storage/v1/b/xxx?alt=media (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2273)'),))

ssl.SSLError: [SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:2273)

Passing a new client to each new process solved this.

@tseaver
Copy link
Contributor

tseaver commented Jul 27, 2018

@RomHartmann

Passing a new client to each new process solved this.

Are you using multiprocessing, or threading?

@RomHartmann
Copy link

RomHartmann commented Jul 27, 2018

@tseaver Both.

I think it's best if I quickly also describe what I was doing.
The google.storage python api does not seem to have a better way to download a bunch of small files than to individually create a blob blob = bucket.blob(name); blob.download_as_x(), which is super slow. In contrast, gsutil -m cp x y is really quick, but all metadata is lost.
I needed that metadata as well.

So as a workaround I fetch all blobs I want to download with gsutil ls -l, create batches based on file size. Each batch is then sent to a new process (using python multiprocessing.Pool) and each blob is downloaded (blob.download_as_string() and combine the dict with blob.metadata) in its own thread (using python threading.Thread).

When only multithreading I got no errors passing a single Client/Bucket (storage.Client().get_bucket(name)) to each thread.
When I sent each batch to a new process and then multithreaded each blob in that batch I got the above errors. Additionally, that whole batch would fail about 50% of the time, with subsequent batches usually succeeding to connect (I think the processed exited after an error was raised and the script terminated).
Problem was solved creating and passing a new Bucket object for each process

@tseaver
Copy link
Contributor

tseaver commented Jul 27, 2018

@RomHartmann I'm not too surprised that the requests session pool, etc., might not function well across an os.fork() call. multiprocessing can paper over some of the differences between forking and threads, but not all of them.

@kumudraj
Copy link

kumudraj commented Feb 21, 2019

you need to create a new client connection for every pool / thread inside def upload(bucket, i).
That will be work.

from multiprocessing.pool import ThreadPool
from google.cloud import storage

def upload(i):
bucket = storage.Client().get_bucket("deepo-test")
blob = bucket.blob("file{}.png".format(i))
blob.upload_from_string("blabla")
blob.make_public()
return blob.public_url

pool = ThreadPool()
pool.map(fct, [i for i in range(2)])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: core api: storage Issues related to the Cloud Storage API.
Projects
None yet
Development

No branches or pull requests

7 participants