Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ObjectStore Directory Semantics #2445

Closed
tustvold opened this issue May 4, 2022 · 20 comments
Closed

ObjectStore Directory Semantics #2445

tustvold opened this issue May 4, 2022 · 20 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@tustvold
Copy link
Contributor

tustvold commented May 4, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

LocalFileSystem interprets the prefix passed to ObjectStore::list_file as the path to a directory, and then proceeds to enumerate this directory recursively. S3FileSystem, however, interprets the prefix as a string prefix.

The distinction arises if you consider a file structure like

foo/a.txt
foo/b.txt

If called with a prefix of fo, LocalFileSystem will return an error, whereas S3FileSystem will return both files.

Describe the solution you'd like

I personally would expect something called ObjectStore to behave like an object store, and not a filesystem. In particular I would expect it to behave like a KV store without any notion of directories.

I would therefore suggest:

  • Remove FileSystem from the naming of the implementations
  • Map object storage semantics to the local filesystem, as opposed to mapping filesystem semantics to all object storages

Describe alternatives you've considered

We could instead call the trait something like FileSystem and give it file system like semantics.

Additional context

I noticed this whilst reviewing #2394 - it seems off to me that we should need to split based on path delimiters given object stores don't have such a concept.

Thoughts @matthewmturner @alamb @timvw ?

@tustvold tustvold added enhancement New feature or request question Further information is requested labels May 4, 2022
@timvw
Copy link
Contributor

timvw commented May 5, 2022

Regardless the chosen approach (ObjectStore vs FileSystem) I would consider to make the trait (and it's methods) consistent:

Currently the trait is named ObjectStore but it only has methods related to Files. Either update/rename the methods (and datatypes) such as fn list_object(s) -> ObjectMetadata .. Or rename the trait to FileSystem...

@timvw
Copy link
Contributor

timvw commented May 6, 2022

Also consider Issue-2465.

The objectstore is requested to list files that match prefix "/Users/blah//".
LocalFileSystem returns items such as "/Users/blah/test.txt" .

One could claim that this path does not match the prefix.
One could also claim that an object can have multiple keys and that this file has an alternative key /Users/blah//test.txt" which does match the prefix.

For globbing to work, within a key/prefix concept, the returned objects/files should carry the "key" that matches the prefix. (Currently my fix in the mentioned issue works the other way round -> globbing is adapted to filesystem implementation).

@wjones127
Copy link
Member

wjones127 commented May 6, 2022

Also relevant:

As part of that PR, I plan on creating a generic suite of tests to validate a ObjectStore implementation, and that could enforce these behavior expectations for each implementation.

For FileSystem vs ObjectStore, I'm only familiar with implementations of the first in the context of query engines (such as Arrow C++'s FileSystem or Python's fsspec). Are there examples of ObjectStore implementations?

My preference is for a "FileSystem" approach since that's more familiar, but open the ObjectStore approach as long as that can be used to read and write in a way compatible with other systems that may use a FileSystem approach. (For example, the current implementation doesn't force delimiting paths with /, but I think that expected by other systems.)

@tustvold
Copy link
Contributor Author

tustvold commented May 6, 2022

Are there examples of ObjectStore implementations

I'm not sure what you mean by this, but object stores are really just key value stores with a vaguely RESTful API, i.e.

  • PutObject - associate an object (set of bytes) with a string key, replacing any existing value
  • GetObject - get the object associated with a key
  • CopyObject - copy the object associated with one key, to another
  • ListObjects - list the keys with a given prefix
  • DeleteObject - delete the value with a given key

There are more complex APIs for things like multipart uploads, bucket creation, etc... but in terms of what a client would be interested in that is the entirety of the API. To put it another way, the interface of object storage is significantly less expressive than that of a filesystem. This is why object storage is scalable, and things like NFS, EFS, are... not 😅

Trying to make object storage behave exactly like a filesystem is impossible (e.g. S3 doesn't support CreateIfNotExists), however, my thesis is that no query engine actually wants filesystem semantics, and this is why these linked abstractions kind of work (#2205 (comment)).

My suggestion is that by instead implementing the less expressive object storage semantics, we can avoid a whole host of funky edge-cases around directories, paths, buffering, read-ahead etc...

in a way compatible with other systems that may use a FileSystem approach

Could you expand on what you mean by this, do you mean being able to read data written by another system which should be trivial, or are you talking about some sort of API-level integration like FFI?

@alamb
Copy link
Contributor

alamb commented May 6, 2022

Are there examples of ObjectStore implementations?

The canonical example of ObjectStore is AWS's S3: https://aws.amazon.com/s3/ and then there are many distributed storage systems that present a similar interface, as @tustvold describes in #2445 (comment)

The idea of the "ObjectStore" interface in DataFusion was to provide API access to the lowest common denominator feature set across several storage implementations. For example, here are three implementations for S3, HDFS, and Azure specifically:

In terms of "glob"ing, that is typically not a feature provided by object stores (e.g. there is no such thing in S3, which instead offers a much more restricted notion of prefixes). Thus, it seems to me if we want to support globbing for DataFusion when running on local files, it will have to be a special case somehow.

You can see another example of a Rust API to object storage in IOx: https://github.com/influxdata/influxdb_iox/blob/main/object_store

@alamb
Copy link
Contributor

alamb commented May 6, 2022

It would help me significantly, to understand the globbing usecase more -- like when exactly are you selecting a subset of files in a directory via a glob? Most analytic systems I have seen tend to assume data has been pre-grouped into directories (or equivalent)

AWS redshift does offer the ability to specify a subset of files that are not all in the same directory, but it does so by taking a manifest file: https://docs.aws.amazon.com/redshift/latest/dg/loading-data-files-using-manifest.html

@alamb
Copy link
Contributor

alamb commented May 6, 2022

Also, @carols10cents spent considerable time sorting out consistent directory semantics for object stores and local files in https://github.com/influxdata/influxdb_iox/blob/main/object_store -- maybe we can just use those semantics (or maybe even the code?)

@wjones127
Copy link
Member

I'm not sure what you mean by this

Sorry that wasn't clear. I pointed out two implementations of an abstraction over object stores (S3, GCS, etc.) that are like filesystems (in that they have a notion of directories, not that they make any guarantees about atomicity). These are used by analytics systems like Dask and PyArrow, so there's some evidence we can build useful query engines on top of such an abstraction.

Thanks @alamb for the IOx example.

Trying to make object storage behave exactly like a filesystem is impossible (e.g. S3 doesn't support CreateIfNotExists), however, my thesis is that no query engine actually wants filesystem semantics,

I largely agree. I think the main thing these "FileSystem" abstractions provide is a notion of "directory", which is important in directory-partitioned datasets. The existing API can handle that fine with delimiter, but it does seem a little funny you can provide whatever delimiter you want.

Could you expand on what you mean by this, do you mean being able to read data written by another system which should be trivial, or are you talking about some sort of API-level integration like FFI?

Yeah I think as long as you could do the expected filesystem operations on top of the API, then that seems fine. For context, I plan to wrap the ObjectStore API in a PyArrow-compatible filesystem for use in delta-rs. Hence #2246.

But I think I'll scale back my changes in #2246 and remove the create_dir(), remove_dir() methods if we want to just think of this as an object store abstraction with no awareness of directories.

Also, @carols10cents spent considerable time sorting out consistent directory semantics for object stores and local files in https://github.com/influxdata/influxdb_iox/blob/main/object_store -- maybe we can just use those semantics (or maybe even the code?)

That sounds very promising @lamb. Thanks for pointing out!

@tustvold
Copy link
Contributor Author

tustvold commented May 6, 2022

If we like the IOx object store interface and want to reuse the implementation, I can probably see about getting it published to crates.io, just let me know. It wasn't my intent with this issue, rather I just wanted clarity on what I should be reviewing 😅, but I would be happy to help make it happen if there is consensus on it being a good idea

@wjones127
Copy link
Member

If we like the IOx object store interface and want to reuse the implementation, I can probably see about getting it published to crates.io, just let me know.

I would be supportive of that, but we probably would need to discuss what that means for
https://github.com/datafusion-contrib/datafusion-objectstore-s3
https://github.com/datafusion-contrib/datafusion-objectstore-hdfs
https://github.com/datafusion-contrib/datafusion-objectstore-azure

Do we want to create a new issue to discuss that?

@timvw
Copy link
Contributor

timvw commented May 6, 2022

@alamb The globbing is mainly relevant in raw/ingestion folders...

Eg: we have end up with a structure such as:
/nyc-taxidata/input/yellow_tripdata_2021-11.csv
/nyc-taxidata/input/yellow_tripdata_2021-12.csv
/nyc-taxidata/input/yellow_tripdata_2022-01.csv
/nyc-taxidata/input/green_tripdata_2021-12.csv
/nyc-taxidata/input/green_tripdata_2022-01.csv
/nyc-taxidata/input/green_tripdata_2022-02.csv

In a typical job we would then process and prepare the data for consumption:
/nyc-taxidata/accepted/yellow_tripdata/year=2022/month=1/blah.parquet
/nyc-taxidata/accepted/green_tripdata/year=2022/month=1/blah.parquet

I don't need access to all sorts of key filters (compared to all key filters in a system such as HBase but globbing is not something I would push back to the end-user (In hadoop this is also supported by alternative (s3, azure) hadoop filesystem implementations)

@timvw
Copy link
Contributor

timvw commented May 7, 2022

In summary, I agree with the ObjectStore semantics being sufficient.

I also do want to point out that globbing is nothing more than making the suffix filter more powerful (instead of matching against a static suffix (eg: ".parquet") it allows matching against a pattern).

/// Calls `list_file` with a suffix filter
async fn list_file_with_suffix(
    &self,
    prefix: &str,
    suffix: &str,
) -> Result<FileMetaStream>

@timvw
Copy link
Contributor

timvw commented May 7, 2022

Currently the globbing implementation in datafusion is somewhat blurry, because it tries to workaround a limitation of the localfilesystem objectstore implementation..

As we all seem to agree, that proper solution would be to fix the LocalFileSystem implementation such that it does not err on a prefix which does not represent a file/directory.

@timvw
Copy link
Contributor

timvw commented May 7, 2022

Apologies, for the going back and forth, next time i'll save you from my my out-loud-thinking and only post a coherent answer...

Last realisation: By having an ObjectStore that only can filter/scan on prefix, we take away the possibility for objectstores to optimise eventual suffix filters (predicate pushdown for file searching as you will).

@Cheappie
Copy link
Contributor

Cheappie commented May 8, 2022

In my case existing design of ObjectStore interface forced me to re-engineer ListingTable in order to provide yet another way of listing data source.

From my perspective It might be beneficial to push information about data source from TableProvider to ObjectStore. Then ObjectStore for a local file system, would combine data(table) location and strategy for listing that kind of storage. As a result listing methods present in ObjectStore could drop the concept of path as a way to access data.

Then ObjectStore could offer more generic interface with two methods:

  • list(filters)
    • query filters should be available in ObjectStore list method, to let anyone provide their own predicate pushdown algorithm
  • file_reader(sized_file)

Such interface should allow us to provide any kind of listing approach(dir, glob, etc), what do you think ?

It's not a necessity but last component bound to a path is SizedFile, where actually outside of ObjectStore It should be treated as abstract blob with characteristics e.g. size because only ObjectStore should know how to access It via file_reader.

@alamb
Copy link
Contributor

alamb commented May 8, 2022

From my perspective It might be beneficial to push information about data source from TableProvider to ObjectStore. Then ObjectStore for a local file system, would combine data(table) location and strategy for listing that kind of storage. As a result listing methods present in ObjectStore could drop the concept of path as a way to access data.

I really like the idea of providing an extensible storage interface that allows APIs such as suggested by @Cheappie and @timvw.

Given these APIs seem to be adding semantics to the list of files on ObjectStorage, perhaps we could an extra layer specifically in the APIs rather than trying to extend ObjectStore or adding more logic to ListingTable. Perhaps something like the StorageCatalog in:

┌───────────────────────────────────┐
│                                   │
│           ListingTable            │
│                                   │
└───────────────────────────────────┘
┌───────────────────────────────────┐
│          StorageCatalog           │
│  (e.g figure out which files on   │
│     object store to process)      │
└───────────────────────────────────┘
┌────────────────┐ ┌────────────────┐
│  ObjectStore   │ │  File Format   │
│(e.g. S3, HDFS) │ │ (e.g. parquet) │
│                │ │                │
└────────────────┘ └────────────────┘

@tustvold
Copy link
Contributor Author

tustvold commented May 8, 2022

I think it is important to keep a separation between:

  • Catalog: what data files are where, what schema they have, what encoding they are, etc...
  • Data Access: how to get the data of a specific file

In particular, there is a very common use case where an additional catalog is used to provide query performance, listing files, performing schema inference, etc... is not cheap. By keeping the concerns separate we can ensure this remains well supported.

Currently I would view the catalog abstraction as SchemaProvider/TableProvider, and the data access as ObjectStore, but there is definitely potential to extract common catalog logic as suggested by @alamb 👍 Lots of systems will have some notion of data partitioning for instance.

FWIW I created some tickets a while back on supporting external catalogs (e.g. #2206, #2208 and #2209) which may be relevant here. I also created tickets to make the file operators themselves less coupled with the catalog - #2291 and #2293.

@matthewmturner
Copy link
Contributor

@tustvold thank you very much for driving these efforts. I apologize I have not been able to contribute much to the conversation or code on these. Based on my current capacity I will likely be limited in what I can contribute on most of these in the foreseeable future - the one exception being #2206 which would actually be very helpful on my side. Perhaps I could work with @timvw to get a first cut of this created in datafusion-contrib / published to crates.

@timvw
Copy link
Contributor

timvw commented May 15, 2022

@tustvold thank you very much for driving these efforts. I apologize I have not been able to contribute much to the conversation or code on these. Based on my current capacity I will likely be limited in what I can contribute on most of these in the foreseeable future - the one exception being #2206 which would actually be very helpful on my side. Perhaps I could work with @timvw to get a first cut of this created in datafusion-contrib / published to crates.

Getting there (still want to test some things and change some signatures (Eg: return Vec instead of Result when adding multiple tables at once).. -> https://github.com/timvw/datafusion-catalogprovider-glue

@tustvold
Copy link
Contributor Author

tustvold commented May 18, 2022

I'm going to close this as I think it is superceded by #2504

Thank you all for helping move this forward 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

6 participants