Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read h5 file using AWS S3 s3fs/boto3 #144

Closed
BoPengGit opened this issue Aug 9, 2018 · 13 comments
Closed

Read h5 file using AWS S3 s3fs/boto3 #144

BoPengGit opened this issue Aug 9, 2018 · 13 comments

Comments

@BoPengGit
Copy link

I am trying to read h5 file from AWS S3. I am getting the following errors using s3fs/boto3. Can you help? Thanks!

import s3fs

fs = s3fs.S3FileSystem(anon=False, key='key', secret='secret')

with fs.open('file', mode='rb') as f:
     h5 = pd.read_hdf(f)

TypeError: expected str, bytes or os.PathLike object, not S3File

fs = s3fs.S3FileSystem(anon=False, key='key', secret='secret')
with fs.open('file', mode='rb') as f:
    hf = h5py.File(f)

TypeError: expected str, bytes or os.PathLike object, not S3File

client = boto3.client('s3',aws_access_key_id='key',aws_secret_access_key='secret')
result = client.get_object(Bucket='bucket', Key='file')
with h5py.File(result['Body'], 'r') as f:
    data = f

TypeError: expected str, bytes or os.PathLike object, not StreamingBody

@mrocklin
Copy link
Collaborator

mrocklin commented Aug 9, 2018

The h5py library doesn't accept a Python file like object. It expects a string pathname to a local file. The HDF library does not work well with cloud-based data. See http://matthewrocklin.com/blog/work/2018/02/06/hdf-in-the-cloud for further discussion.

This is a good question, thank you for raising it, but solving it is out of scope for s3fs, so I'm going to close this issue.

@jreadey
Copy link

jreadey commented Nov 3, 2019

Support for file-like objects has been added to h5py v 2.9. See "Python file-like objects" in http://docs.h5py.org/en/stable/high/file.html.

You can open HDF5 files with s3fs like so:

    s3 = s3fs.S3FileSystem()
    f = h5py.File(s3.open("s3://my-bucket/my-file.h5", "rb"))

Performance will vary depending on how the file is structured and latency between where your code is running and the S3 bucket where the file is stored (running in the same AWS region is best), but if you have some existing Python h5py code, this is easy enough to try out.

@jrbourbeau
Copy link
Contributor

That's great, thanks for sharing @jreadey!

@mrocklin
Copy link
Collaborator

mrocklin commented Nov 4, 2019 via email

@martindurant
Copy link
Member

(should work with any file system backend)

@AlexVaith
Copy link

Support for file-like objects has been added to h5py v 2.9. See "Python file-like objects" in http://docs.h5py.org/en/stable/high/file.html.

You can open HDF5 files with s3fs like so:

    s3 = s3fs.S3FileSystem()
    f = h5py.File(s3.open("s3://my-bucket/my-file.h5", "rb"))

Performance will vary depending on how the file is structured and latency between where your code is running and the S3 bucket where the file is stored (running in the same AWS region is best), but if you have some existing Python h5py code, this is easy enough to try out.

thanks for the code snippet. I tried it and it works fine unteil I wanto to close the file to leave the function. Is there any way to get around this?

def _read_data(file_system, key):
    data = dict()
    with h5py.File(file_system.open(join('s3://', bucket_name, key), "rb")) as f:
        file_keys = list(f.keys())
        if 'meta_data_json' in file_keys:
            file_keys.remove('meta_data_json')
        for k in file_keys:
            try:
                obj = f[k]
                sem = obj.get('semantic').value[0]
                acc = obj.get('acc').value
                gyr = obj.get('gyr').value
                data[str(sem)] = np.concatenate((acc, gyr), axis=1)
            except:
                print()
        print('')
    f.close()
    return data

This is the error that comes up as soon as it leaves the with statement:

 File "/Users/alex/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/files.py", line 442, in __exit__
    self.close()
  File "/Users/alex/opt/anaconda3/lib/python3.7/site-packages/h5py/_hl/files.py", line 424, in close
    h5i.dec_ref(id_)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5i.pyx", line 150, in h5py.h5i.dec_ref
  File "h5py/defs.pyx", line 1187, in h5py.defs.H5Idec_ref
  File "h5py/h5fd.pyx", line 169, in h5py.h5fd.H5FD_fileobj_write
AttributeError: 'S3File' object has no attribute 'seek'

Any help is appreciated

@martindurant
Copy link
Member

but S3File does have an attribute seek.
This may be something to do with the order in which objects are being deleted, you could do

of = file_system.open(join('s3://', bucket_name, key), "rb")
with of as fh:
    with h5py.File(fh) as f:

to be explicit.

@AlexVaith
Copy link

but S3File does have an attribute seek.
This may be something to do with the order in which objects are being deleted, you could do

of = file_system.open(join('s3://', bucket_name, key), "rb")
with of as fh:
    with h5py.File(fh) as f:

to be explicit.

thanks. I solved i with the following solution:

data = dict()
    with h5py.File(s3.open(join('s3://', bucket_name, key), 'rb'), 'r', lib_version='latest') as f:
        file_keys = list(f.keys())
        if 'meta_data_json' in file_keys:
            file_keys.remove('meta_data_json')
        for k in file_keys:
            try:
                obj = f[k]
                sem = obj.get('semantic').value[0]
                if sem == 100 or sem == 601 or sem == 901:
                    acc = obj.get('acc').value
                    gyr = obj.get('gyr').value
                    data[str(sem)] = np.concatenate((acc, gyr), axis=1)
            except:
                print()
        f.close()

I don't really know why I had to give different read/write options to hdf5 and s3 , but this way it does work.

@martindurant
Copy link
Member

HD5 will assume binary in every case, but fsspec follows the python convention that 'r' means text-mode.
So it seems, the key was to close the HDF file, which is odd, since the with block should have done that.

@AlexVaith
Copy link

HD5 will assume binary in every case, but fsspec follows the python convention that 'r' means text-mode.
So it seems, the key was to close the HDF file, which is odd, since the with block should have done that.

thanks for the explanation.

@mstewart141
Copy link

for the interested reader directed here by search engines (like me) -- another effective workaround:

import tempfile

import pandas as pd
import s3fs

s3 = s3fs.S3FileSystem()

with tempfile.NamedTemporaryFile() as f:
    s3.get(remote_s3_path, f.name)
    df = pd.read_hdf(f.name)
df.shape

@martindurant
Copy link
Member

The above is essentially equivalent to:

import fsspec

fn = fsspec.open_local(f'simplecache::s3://{remote_s3_path}')
df = pd.read_hdf(fn)

or

with fsspec.open_local(f'simplecache::s3://{remote_s3_path}') as f:
    df = pd.read_hdf(f)

which may or may not seem simpler.

@firobeid
Copy link

firobeid commented Oct 8, 2020

but S3File does have an attribute seek.
This may be something to do with the order in which objects are being deleted, you could do

of = file_system.open(join('s3://', bucket_name, key), "rb")
with of as fh:
    with h5py.File(fh) as f:

to be explicit.

thanks. I solved i with the following solution:

data = dict()
    with h5py.File(s3.open(join('s3://', bucket_name, key), 'rb'), 'r', lib_version='latest') as f:
        file_keys = list(f.keys())
        if 'meta_data_json' in file_keys:
            file_keys.remove('meta_data_json')
        for k in file_keys:
            try:
                obj = f[k]
                sem = obj.get('semantic').value[0]
                if sem == 100 or sem == 601 or sem == 901:
                    acc = obj.get('acc').value
                    gyr = obj.get('gyr').value
                    data[str(sem)] = np.concatenate((acc, gyr), axis=1)
            except:
                print()
        f.close()

I don't really know why I had to give different read/write options to hdf5 and s3 , but this way it does work.

I am tring your solution but getting empty dictionary as if the file is only opening as view mode, any suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants