Implement smart_open instead of .open() to allow efficient streaming (saving/loading) of large files to cloud bucket. #264

hugolytics · 2022-08-19T13:27:59Z

The current implementation of the .open methods consists of a local cache which is then synchronized with the cloud.

This method can be replaced by smart_open, to allow for a more efficient mechanism.

One can take inspiration from aws' S3PathLib, (however, that library handles boto session in a way that is not thread-safe, which has made me switch to this library).

Currently, I subclassed S3Path and implemented the aforementioned S3Pathlib's implementation as follows:

from cloudpathlib import S3Path as BaseS3Path
import smart_open

# replace the .open method of S3Path with smart_open
class S3Path(BaseS3Path):
    def open(
            self,
            mode="r",
            buffering=-1,
            encoding=None,
            errors=None,
            newline=None,
            closefd=True,
            opener=None,
            ignore_ext=False,
            compression=None,
            api_kwargs: dict = None, # type: ignore
        ):
            """
            Open S3Path as a file-liked object.
            :return: a file-like object.
            See https://github.com/RaRe-Technologies/smart_open for more info.
            """
            
            kwargs = dict(
                uri=self.as_uri(),
                mode=mode,
                buffering=buffering,
                encoding=encoding,
                errors=errors,
                newline=newline,
                closefd=closefd,
                opener=opener,
                transport_params={"client": self.client}
            )
            return smart_open.open(**kwargs)

    def read_text(
        self,
        encoding="utf-8",
        errors=None,
    ) -> str:
        with self.open(
            mode="r",
            encoding=encoding,
            errors=errors,

        ) as f:
            return f.read()

    def read_bytes(self, ) -> bytes:
        with self.open(mode="rb") as f:
            return f.read()

    def write_text(
        self,
        data: str,
        encoding="utf-8",
        errors=None,
        newline=None,

    ):
        with self.open(
            mode="w",
            encoding=encoding,
            errors=errors,
            newline=newline,
        ) as f:
            f.write(data)

    def write_bytes(self, data: bytes):
        with self.open(mode="wb") as f:
            f.write(data)

However, I could also open a pull request to merge this with the cloudpath definition, since smart_open is cloud-agnostic.

pjbull · 2022-08-19T19:07:46Z

Thanks @hugolytics for your thoughts here. It's definitely interesting to think about ways to leverage smart_open, especially given the breadth of backends it supports. We have a number of features we are considering that it could help with (#9, #10, #29).

To me, this issue is similar to the discussion in #96 and #109. There are certainly backend packages like this that handle the operational side of things that we should consider, since our primary purpose is supporting the pathlib API.

That said, we won't merge the PR (#265 ) as is for a number of reasons, so I'm going to close it:

smart_open is designed explicitly for streaming, but there are workflows that benefit from the local cache architecture instead. Ideally we'd support both.
We've been wary of taking additional dependencies beyond the "official" SDKs for cloud providers.
I think we'd favor an implementation that leverages smart_open but keep the consistency that our current *Client/*Path APIs support.

Note that it is worth bumping #92 that lists these alternatives

msmitherdc · 2023-04-22T11:15:00Z

I’d love to see smart_open added here also, as an alternative to the cache concept. We have to open large zip files (10s-100s of gb) just to read some content in place and this all works well with cloudpathlib and smart-open (in my fork).

pjbull · 2023-04-22T19:58:52Z

@msmitherdc Thanks for the comment. To better understand your use case, what are the specific things that you want smart_open for? Is it streaming/partial reads/writes or something beyond that case?

msmitherdc · 2023-04-22T20:37:19Z

we are opening 3dtiles and i3s (slpk) mesh files. These are large zip files that we read json files out of. We use reads of the files out of the zip to get info about the mesh. For serving them out for cesium, we read byte ranges and serve them out to the client. So streaming and partial reads.

pjbull mentioned this issue Aug 19, 2022

implemented smart_open #265

Closed

mjkanji mentioned this issue Nov 23, 2022

Add caching to download_to and upload_from #292

Open

pjbull mentioned this issue Jul 29, 2024

Read from http - httppathlib? #455

Open

msmitherdc mentioned this issue Oct 10, 2024

Implement partial or streaming reads/writes (CloudFile abstraction) #9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement smart_open instead of .open() to allow efficient streaming (saving/loading) of large files to cloud bucket. #264

Implement smart_open instead of .open() to allow efficient streaming (saving/loading) of large files to cloud bucket. #264

hugolytics commented Aug 19, 2022 •

edited by pjbull

Loading

pjbull commented Aug 19, 2022

msmitherdc commented Apr 22, 2023

pjbull commented Apr 22, 2023

msmitherdc commented Apr 22, 2023

Implement smart_open instead of .open() to allow efficient streaming (saving/loading) of large files to cloud bucket. #264

Implement smart_open instead of .open() to allow efficient streaming (saving/loading) of large files to cloud bucket. #264

Comments

hugolytics commented Aug 19, 2022 • edited by pjbull Loading

pjbull commented Aug 19, 2022

msmitherdc commented Apr 22, 2023

pjbull commented Apr 22, 2023

msmitherdc commented Apr 22, 2023

hugolytics commented Aug 19, 2022 •

edited by pjbull

Loading