Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should S3Path stop using the FilePathsBase interface? #226

Open
ericphanson opened this issue Nov 16, 2021 · 24 comments
Open

Should S3Path stop using the FilePathsBase interface? #226

ericphanson opened this issue Nov 16, 2021 · 24 comments

Comments

@ericphanson
Copy link
Member

ericphanson commented Nov 16, 2021

It provides a lot of methods for free, but an S3Path is not like a file system path in a lot of ways, and the abstractions break down in frustrating ways. I think it might be better for AWSS3 to define it's own methods for most things, and have it such that things that are not explicitly supported by AWSS3 are not supported at all, rather than supported in a somewhat broken way (which you might not notice at first but can bite you later-- for example cp, which works until you have nested directories, ref #225).

inspired by rofinn/FilePathsBase.jl#128 (comment)

@ericphanson
Copy link
Member Author

I think it would make sense for an S3Path to roughly mimic the AWS console commands, aws s3 cp ..., aws s3 ls ..., etc. So, following https://awscli.amazonaws.com/v2/documentation/api/latest/reference/s3/index.html, I think these operations make definitely sense:

  • cp
    • write and read, which are like aws s3 cp - s3://... and aws s3 cp s3://... -
  • mv
  • rm
  • readdir (analogous to ls)
  • sync

(I don't think we need rb or mb, at least currently).

I think they should match Base's semantics though, not the AWS console ones, e.g. cp is recursive by default.

And then some level of path-manipulation operations could make sense too (manipulating S3Path objects, i.e. not as an API to interact with s3):

  • joinpath
  • splitext
  • relpath
  • splitdir

But they need to be pretty careful to follow s3's semantics for prefixes.

And I think these functions don't really make sense:

And these are the tricky ones:

  • isfile
  • isdir
  • ispath

which I don't know how to deal with.

@hannahilea
Copy link
Contributor

...our team has lost a lot of time collectively to [diagnosing/working around/bug fixing/knowledge sharing] issues that boil down to

the abstractions break down in frustrating ways

and

supported in a somewhat broken way (which you might not notice at first but can bite you later [...)]

so I'm strongly in favor of this proposal.

@rofinn
Copy link
Member

rofinn commented Dec 8, 2021

If I'm understanding this correctly, the problem isn't with trying to have an overlapping API for various filesystem operations, but rather with having fallbacks that don't work with the special cases introduced in this package and S3? Back when the API was initially designed around a type hierarchy the assumption is that you'd just overload whatever didn't work for your particular type.

If you think isfile, isdir, etc doesn't/can't work then you can always choose to throw a method error for your particular type just can't support it. Perhaps the interface shouldn't implement anything and force users to implement every behaviour themselves? That seemed overly verbose previously, but maybe that would reduce the maintenance overhead in the long run? That would also fit into my desire to move from a type hierarchy to a more Tables.jl like interface.

That being said, even if FilePathsBase.jl didn't make any fallback assumption regarding the interface, any operations that extend the Base.Filesystem or FilePathsBase.jl interface would need to provide consistent behaviour otherwise we can't really write generic application code (same issue as we're finding with the default fallbacks). Perhaps the safest path forward is to introduce MethodErrors for operations we don't think we can support at all. Introducing a separate CLI style interface (with different function names or namespace) might be the way to go if folks are expecting to translated aws s3 commands (vs filepath operations) into a Julia script.

@ericphanson
Copy link
Member Author

The issue with overloading all the methods that don’t work is that we don’t know which ones those are and tend to find out at bad times or in code that’s hard to quickly iterate on for unrelated reasons. That’s why I think it’s better to start with no fallbacks for AWSS3 and implement things as-needed with a particular eye to s3’s semantics. This is particular to s3 because the semantics are really different than other file systems, so these problems come up more.

We could add a bunch of overloads with MethodErrors instead but I don’t really see the advantage of that; sounds annoying to maintain and update as FilePathsBase gets new functions.

@rofinn
Copy link
Member

rofinn commented Dec 8, 2021

That’s why I think it’s better to start with no fallbacks for AWSS3 and implement things as-needed with a particular eye to s3’s semantics.

That's fine, but I think my recommendation still holds that you shouldn't plan to extend Base.Filesystem functions if you plan to have functions that behave fundamentally differently (e.g., changing what splitdir, readdir, etc returns). That's the kind of thing that's been a nuisance to walk back from with the FilePaths stuff.

sounds annoying to maintain and update as FilePathsBase gets new functions

That doesn't happen very often, but sure.

@ericphanson
Copy link
Member Author

you shouldn't plan to extend Base.Filesystem functions if you plan to have functions that behave fundamentally differently (e.g., changing what splitdir, readdir, etc returns)

I don't really get this point, do you mean returning S3Paths from readdir when join=true? I don't think that is behaving fundamentally differently. Yes, S3Paths are different from Strings, but the more important point is that you can do

file = first(readdir(path; join=true))
bytes = read(file)

whether or not path is a Base filesystem path (a string) or an S3Path. If we returned a string there, we would match the output type of readdir(::String), but we would break the more fundamental interface, which is that you get something you can actually use as a file path. (That's the whole point of duck typing, right? You can make things agnostic to the specific implementation detail of the type used and still have everything work).

I would argue that returning Strings from readdir(path; join=true) for non-String paths is actually breaking the interface, because the thing you get out cannot be used as a filepath in all cases, and for some input types you need to do an extra step of converting it to a valid path in order to use it.

@ericphanson
Copy link
Member Author

ericphanson commented Jan 3, 2022

urgh, rofinn/FilePathsBase.jl#152 broke things for AWSS3 since it assumes readdir emits strings always. Specifically, that PR adds join=true to readdir(::AbstractPath) with different semantics than the earlier #129 for AWSS3, and then assumes readdir always emits strings by calling parse on the output.

Workaround:

Base.parse(::Type{P}, path::P; kwargs...) where {P <: S3Path} = path

edit: turns out this was already filed here: #227

@rofinn
Copy link
Member

rofinn commented Jan 4, 2022

Yeah, cause readdir should be type stable. I did comment on the AWSS3 PR before make the FPB change #129 (comment). You can always use readpath if you want to return a path type.

@ericphanson
Copy link
Member Author

Ok, well I think that's breaking the interface presented by Base, and moreover adding the parse line means it's not only inconsistent but also that any function using readdpath (like rm apparently does) is broken on S3Paths with any FilesPathBase release after that PR was merged. If FilesPathBase doesn't want to support S3Paths (e.g. tries ensure they are not broken by patch releases) then that's OK (I guess) but we really should not use the interface, hence this issue.

@rofinn
Copy link
Member

rofinn commented Jan 4, 2022

FilesPathBase doesn't want to support S3Paths

I don't think it's the responsibility of the FilePathBase package to adjust to adhoc API decisions within S3Path. Yes, S3Path has probably overloaded enough functionality at this point that perhaps it should just define its own behaviour for everything.

@ericphanson
Copy link
Member Author

ericphanson commented Jan 4, 2022

adhoc API decisions? You approved that PR back in January and then 10 months later decided to make a different API decision in FilesPathBase...

@rofinn
Copy link
Member

rofinn commented Jan 4, 2022

I'm also referring to various changes to how configs, constructors and delimiters are handled over the past couple years, not just that particular PR. I approved cause there was an explicit use-case that needed to be solved and there wasn't anything technically wrong with the solution within that context. I noticed an issue that I missed during code review which made generalizing it problematic. I realize that's frustrating, but it happens.

@omus
Copy link
Member

omus commented Jan 4, 2022

As AWSS3.jl is very sensitive to FilePathsBase.jl changes we should probably use a fixed version of FilePathsBase.jl or a fixed version range. At least this would be a good short term fix but we'd need extensive testing in AWSS3.jl to ensure we can update this range safely

@omus
Copy link
Member

omus commented Jan 4, 2022

Would it be possible to use AbstractPath just be a supertype with no defined behaviour in FilePathsBase and have another type used as the parent for path types wanting to adopt the default behaviour (e.g. InheritAbstractPath <: AbstractPath)? That would allow S3Path to still be a subtype of AbstractPath and make it's own define it's own behaviour without worrying about fallback behaviour. Additionally, this allows end users to continue to use AbstractPath for custom functions expecting paths.

@rofinn
Copy link
Member

rofinn commented Jan 4, 2022

use a fixed version of FilePathsBase.jl or a fixed version range

Yeah, I can just tag a 1.0 release of FilePathsBase.jl. Any new features (to support new julia releases) would then be minor releases that you can choose to accept or not.

Would it be possible to use AbstractPath just be a supertype with no defined behaviour in FilePathsBase and have another type used as the parent for path types wanting to adopt the default behaviour

Sure, but that would be yet another breaking release before 1.0. Why can't S3Path just overload all the default behaviour since it'll need to do that anyways? Is it an issue with ambiguities or something? FWIW, a couple concerns to consider are:

  1. Is overloading all the default behaviour going to introduce a maintenance burden (ie: APIs between Base.Filesystem, FilePathsBase, AWSS3 further deviate from one another over time)?
  2. Is all the overloaded behaviour conceptually the same as the methods in Base.Filesystem and FilePathsBase? If we want behaviour to better align with awscli then maybe that should just exist as separate module/functions?

@ExpandingMan
Copy link
Contributor

My take on this is that the problem is that the underlying S3 API is a mess and contains some half-assed features without proper support (e.g. versioning). I suppose there's a good argument to be made that it was just a key-value store so nobody should have been treating it like a file system, but the fact is that all sorts of things that commonly make use of S3 assume it can be used that way and this has certainly made its way into the API itself by now. I therefore cannot see the benefit in dropping FilePathsBase as we all know it's going to be used like a file system and it's just going to wind up implementing all the methods anyway. The practical effect will be that we will have FilePathsBase look-alike which is far less compatible than it could be. As @rofinn pointed out, even if every method from FilePathsBase had to be re-implemented, it seems preferable to do so rather than abandon it.

That said, I think this package has exposed a number of issues that suggest some reformation of FilePathsBase is in order. It seems somewhat overzealous in assuming that everything which looks like a file system must implement everything required for POSIX compatibility. Certainly one could implement a file system without any concept of permissions, for example. It's been a while since I've looked at FilePathsBase, but I think that some traits-based interface for indicating different file system capabilities on the part of implementations is in order.

@ExpandingMan
Copy link
Contributor

I have made a speculative draft PR outlining what I think a more general FilePathsBase might look like. I have done very little, but I was hoping that interested parties in this thread could comment on the approach before I put any real effort into it. Please let me know your thoughts.

bors bot added a commit that referenced this issue Feb 4, 2022
232: Set FilePathsBase compatibility to 0.9.11 - 0.9.15 r=omus a=omus

Fixes #227 by avoiding using FilePathsBase versions that require `readdir` when using `join=true` to return strings. As changing the return type could be breaking for some users we'll just limit the versions of FilePathsBase supported here for now for this release. For AWSS3.jl version 0.10 we'll correct that problem in this package and start using a minimum version of FilePathsBase 0.9.16

At first glance appears to address #226 but it does not as we are just temporarily setting an upperbound.

Also, we cannot use Pkg version ranges using a hypen as that requires a minimum version of Julia 1.4

Co-authored-by: Curtis Vogt <[email protected]>
bors bot added a commit that referenced this issue Feb 4, 2022
232: Set FilePathsBase compatibility to 0.9.11 - 0.9.15 r=omus a=omus

Fixes #227 by avoiding using FilePathsBase versions that require `readdir` when using `join=true` to return strings. As changing the return type could be breaking for some users we'll just limit the versions of FilePathsBase supported here for now for this release. For AWSS3.jl version 0.10 we'll correct that problem in this package and start using a minimum version of FilePathsBase 0.9.16

At first glance appears to address #226 but it does not as we are just temporarily setting an upperbound.

Also, we cannot use Pkg version ranges using a hypen as that requires a minimum version of Julia 1.4

Co-authored-by: Curtis Vogt <[email protected]>
@ExpandingMan
Copy link
Contributor

Another suggestion might be to strip this package down to fundamental functionality and create a separate package S3Paths.jl which depends on both this package and FilePathsBase, perhaps that would be best for everyone.

If I wound up being the steward of S3Paths.jl (which I'd be willing to do), I'd do the following;

  • Drop support for features which I consider "broken" in the underlying API such as versioning.
  • Only perform testing with Minio. Hopefully between that and AWS tests for AWSS3.jl we'd be mostly covered 🤷

@rofinn
Copy link
Member

rofinn commented Feb 8, 2022

I do worry that splitting this up further might just make the compatibility issues worse. Part of the problem is that FilePathsBase.jl doesn't have any control over how AWSS3 overrides functionality, apart from the provide test suite. If we were to introduced a 3rd package then it would be one more point of failure. I do think your suggestion of moving towards a more trait based approach could help reduce those issues to some extent, I just haven't gotten around to reviewing your proposal yet. I wonder if it'd be simpler for S3Paths.jl package to just extend AWS.jl directly and ignore some of the assumptions/conventions that AWSS3.jl is making?

Drop support for features which I consider "broken" in the underlying API such as versioning

You mean drop support for object versioning? Yeah, I agree, I don't think AWS ever fully flushed that out. I always kinda thought versions should be accessible via a more specific convention (ie: s3://<bucket_name>/<version>/<prefix>)?

@ararslan
Copy link
Member

ararslan commented Feb 8, 2022

Versioning is the worst... I hate being able to recover something I accidentally overwrote.

@ExpandingMan
Copy link
Contributor

I wonder if it'd be simpler for S3Paths.jl package to just extend AWS.jl directly and ignore some of the assumptions/conventions that AWSS3.jl is making?

That would be wonderful, but unfortunately the reality is that AWS.jl is very thin, and the AWS API is not very good, so sadly I don't think there's any way around wrapping it.

Versioning is the worst... I hate being able to recover something I accidentally overwrote.

I wasn't addressing the merits of versioning, only that the AWS API treats it as a "second class" feature, not to mention that it's not a concept that's built into FilePathsBase.

@ericphanson
Copy link
Member Author

ericphanson commented Feb 8, 2022

I think different folks have different needs. Recently, I've been advocating internally that we should write our own s3 utilities on top of AWS.jl, since that already provides a full wrapper of the AWS API, and then drop AWSS3. I see Beacon's needs (or at least that of my team) as...

  1. a Julia object to represent a bucket + key + version to serve as a reference to data/files
    • helpful to be a single object; may be stored in a table or hardcoded in a package, const DATASET = ....
    • needs to support serialization to different formats (JSON, Arrow, etc)
    • must support s3 versioning for reproducibility
  2. a convenient way to read/write/query s3 objects, versions, and prefixes
    • respecting granular permissions: I may be able to query /bucket/path/dir but not /bucket/path/dir2 or /bucket/path (ref Add prefix to _s3_exists_dir to avoid permissions issues #214).
    • needed for both interactive use at the REPL and programmatic use in package code
    • possibly with different AWS.jl backends (Downloads.jl or HTTP.jl)
  3. utilities for performance: multipart writes, splitting reads of large objects into multiple byte-range'd requests, helpers for bulk actions (threaded or async'd)

Currently, S3Paths serve both purposes (1) and (2), and I don't see an advantage to splitting them (if I have a reference to an object, I might as well be able to Base.read it, right?). It also helps to provide some path-like functionality for use in generic code, but I think this should be fairly tightly constrained and conservative, because...

  1. All the above needs to be very stable and robust.

Patch releases of dependencies must not knowingly break the code. It's just not viable to have issues like that at the base of a software stack in my opinion.

@ExpandingMan
Copy link
Contributor

For me portability of code between S3 and local file systems is a huge part of why AWSS3.jl is so useful. There are a number of things such as parquet which sort of force you to treat them the same way, and it can be useful for "microservices" which treat S3 as a shared storage resource. I agree that pretending S3 paths are real file system paths is not ideal, but too many things already do that for me to embrace that.

@omus
Copy link
Member

omus commented Feb 9, 2022

The purpose of the path-type for me has always to been to support the creation of generic functions which require something path-like and to allow for specialization when needed. Over time FilePathsBase.jl has become more specific to file systems while AWSS3.jl's S3Path has become more S3 specific. These systems having some non-overlapping features have resulted in the pain points we've all felt.

I think the best path forward forward would be some kind of trait system where S3Paths can specify what methods it supports which can be used in generic path functions. S3Path should also be free to provide S3 specific functionality, like versioning, but should avoid breaking the generic usage. I can also see the possibility of additional S3 types for specializing on Minio S3 paths vs. AWS S3 paths. It may even be useful to have S3 types/parameters for allowing users to specify the behaviour of functions like isfile.

Most likely how we actually proceed here will be to create new S3Path type(s) instead of modifying the existing implementation. It would be great to have a shared path supertype that provides little to no default functionality but probably any variation of this change would be too disruptive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants