-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS S3 added listObjects endpoint including common prefixes for a delimiter #2023
Conversation
At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user |
I've added the missing email address and signed the CLA |
I'm not sure what the story with the failing build is. The same test is failing for me in Intellij, but even before my commits. When I run it in SBT directly it's fine for both. |
That's alright. We've previously identified this as a flakey test. It needs further investigation, but it may just need a longer timeout since travis VMs can be rather underpowered. Thanks for the PR! I'll do a review later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that there is a lot of duplicate code between the listBucket
and listObjects
DSL and implementation. It looks like it would be natural to merge ListBucketResultCommonPrefixes
into ListBucketResultContents
and make the prefix
property optional and only set when a prefix is used in the original query. The only downside to this is that it would require modifying ListBucketResultContents
which would break backwards compatibility for users who use this type in tests, but we could add constructor overloads to support the additional field. Also, since it's optional, it has an obvious base case in the existing constructor of None
.
Before I review any further I want to understand your rationale and general thoughts about consolidating this return type in case I misunderstood something.
EDIT: I thought about it a little more and I see how it makes sense to treat them differently. It would be nice to DRY up the code in S3Stream
though. Maybe you could pass a function from listObjects
and listBucket
to a consolidated implementation method in S3Stream
that will construct the return type you want. For the object with prefix results, you could return a tuple of (ListBucketResultContents, ListBucketResultCommonPrefixes)
instead of trying to combine them together with the new base trait you added.
* | ||
* The `alpakka.s3.list-bucket-api-version` can be set to 1 to use the older API version 1 | ||
* | ||
* @see https://docs.aws.amazon.com/AmazonS3/latest/API/v2-RESTBucketGET.html (version 1 API) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this is actually the v2 API. Can you update this reference and the one in listBucket
javadoc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those links seem to be outdated and redirect to https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html (and https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html accordingly). I'll update the links and fix the version numbers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do, see eg. #2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. i've updated the other s3 api links too
* @param s3Headers any headers you want to add | ||
* @return [[akka.stream.scaladsl.Source Source]] of [[ListBucketResultBase]] | ||
*/ | ||
def listObjects(bucket: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this should be another listBucket
overload. Is there a reason you gave it a different name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or better yet, listBucketWithPrefix
to accommodate the different return type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
listBucketWithPrefixes could work. Then we could make the delimiter parameter even non-optional.
I've taken the listObjects name from the actual S3 API Naming as I was shortly confused myself which alpakka S3 function is linked to which API call. But I guess this only makes sense if in the long run the listBucket function gets deprecated and completely replaced by listObjects*. Until then it would probably just confuse more. So yea lets make it listBucketWithPrefixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh it seems this "list contents/prefixes" endpoint used to be called "GET Bucket (List Objects)" in the documentation but has been renamed to "ListObjectsV2". Now I get where the original name is coming from.
We still stick to listBucket and listBucketWithPrefixes? (or better listBucketWithCommonPrefixes to be accurate?)
Thanks for the feedback. Sorry about the DRY, it seemed some discussion would follow so I wanted to keep it WET and wait with premature optimisations until we've decided the final interface. Probably should have mentioned that ;) Once it's in place I'll follow your guide.
One streaming event can be EITHER a Contents OR a CommonPrefixes item. So I'm not sure how returning a tuple would work here. The only way to keep it separate would be to return a tuple of Sources with those types. |
Sounds good!
I see. Thanks for the clarification. I understand now why you took this approach. Since we're exploding the contents results there's no real appropriate place to include the prefixes result too, so you've added them to the stream as elements. This leaves it to the end user to match on the correct concrete type. I suppose one alternative would be to add a reference to the prefixes in every |
You mean every |
I mean iterate over Though your reply made me think of another approach. Instead of merging results and common prefixes into a single stream you could return a |
I think the CommonPrefixes should be Elements of a Source as well as there might still be a lot of them.
That's what I meant earlier by a tuple of sources. I prolly should use more code as examples instead of confusing descriptions ;) How about this interface: def listBucketWithCommonPrefixes(
bucket: String,
delimiter: String,
prefix: Option[String] = None,
s3Headers: S3Headers = S3Headers.empty
): Source[(ListBucketResultCommonPrefixes, ListBucketResultContents), NotUsed] This way the user could consume both independently. Even though I wonder if we should internally keep the common base trait. Then the user could still opt in to concat both sources and do pattern matching on a single stream (similar to the original approach). |
True, but they've already been deserialized to a
Do you mean this: Source[(Source[ListBucketResultCommonPrefixes, NotUsed], Source[ListBucketResultContents, NotUsed]), NotUsed] |
Exactly, the
Ah sorry, I actually meant: def listBucketWithCommonPrefixes(
bucket: String,
delimiter: String,
prefix: Option[String] = None,
s3Headers: S3Headers = S3Headers.empty
):
(
Source[ListBucketResultCommonPrefixes, NotUsed],
Source[ListBucketResultContents, NotUsed]
) = ??? Here are a few usage use cases I'd see with this interface: // I don't care about the contents, only commonPrefixes
val (commonPrefixes, _) = listBucketWithCommonPrefixes("bucket","/")
commonPrefixes.runForeach(???)
// I care about both separately
val (commonPrefixes, contents) = listBucketWithCommonPrefixes("bucket","/")
commonPrefixes.runForeach(???)
contents.runForeach(???)
// That's not really a use case as you can just use the existing listBucket function if you're only interested in the contents
val (_, contents) = listBucketWithCommonPrefixes("bucket","/")
// if we keep the ListContentsResultBase trait one could even merge them to one single stream and use pattern matching later
val (commonPrefixes, contents) = listBucketWithCommonPrefixes("bucket","/")
val merged: Source[ListBucketResultBase, NotUsed] = commonPrefixes.merge(contents)
merged.runForeach {
case contents: ListBucketResultContents => ???
case prefixes: ListBucketResultCommonPrefixes => ???
} EDIT: For javadsl we could use akka.japi.Pair |
I see. That could work. I would have thought this would be the most common use case: // That's not really a use case as you can just use the existing listBucket function if you're only interested in the contents
val (_, contents) = listBucketWithCommonPrefixes("bucket","/") Because you're using the delimiter to essentially filter one level of results, but you don't care about the common prefixes. Do you want to experiment with this API? |
I don't think keeping |
True!
Yep I'll update the PR |
My current use case is similar to the S3 console inside AWS: A "directory" based browser, listing both, files and commonPrefixes. Pretty much like an Doing it yourself without a common base trait would involve mapping to an Either and then pattern match later again. It's not too bad but seems rather unnecessarily clunky: val (commonPrefixes, contents) = listBucketWithCommonPrefixes("bucket","/")
val commonPrefixesEither = commonPrefixes.map(Right(_))
val contentsEither = contents.map(Left(_))
commonPrefixesEither.merge(contentsEither).runForeach {
case Left(content) => ???
case Right(commonPrefix) => ???
} Or am I missing some easier way? I'd have thought leaving a common base trait wouldn't affect the API directly anyway so it does not hurt but makes it easier for the end user in case he needs it. In any case, that's not a show stopper for me I'll continue doing the other changes. |
Well, the implementation of the Tuple Source seems to be tricky. I'm now down to having a Source[(Seq[ListBucketResultCommonPrefixes], Seq[ListBucketResultContents]), NotUsed] (every element is a page from the result) (
Source[ListBucketResultCommonPrefixes, NotUsed],
Source[ListBucketResultContents, NotUsed]
) without forcing the client to use custom graph logic like e.g. described in https://stackoverflow.com/questions/38438983/split-akka-stream-source-into-two Do you have any suggestions @seglo ? Otherwise returning to the original single source using a base trait might be the better solution after all? You can still fairly easy consume e.g. only the contents: def listBucketWithCommonPrefixes(
bucket: String,
delimiter: String,
prefix: Option[String] = None,
s3Headers: S3Headers = S3Headers.empty
): Source[ListBucketResultBase, NotUsed] = ???
val contentsOnly: Source[ListBucketResultContents, NotUsed] =
listBucketWithCommonPrefixes("bucket", "/").collect {
case contents: ListBucketResultContents => contents
} EDIT: The only things comes to mind is using two Source.queue and feeding them but that seems a pretty dodgy workaround |
I see. I'm unfamiliar with this AWS S3 utility. When you do a
I think the return type you mention here is fine. I tried playing around with this as well and discovered that it doesn't really make sense to return the nested sources since they would be emitted for every result anyway. I realize this makes your use case a little more tricky, but I think it's still the best option over mixing the elements together with a common base trait. We could also forego returning a tuple/pair and just return the concrete upstream result object (or something like it). I think you already have this code, but in case it's useful I pushed my experimental code as a demonstration. |
Alright, I've pushed it all. The only main change I made to your code was making the Otherwise:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good. I left a few comments.
Something else I was considering was adding a new listBucket
overload that only returns ListBucketResultContents
for the one hierarchy level since that's going to be a common use case. It could have the same return type as the other listBuckets
because it's just flattening out the one sequence and skipping common prefixes. We can do this in another PR though.
Thanks for your patience!
it should "list keys and common prefixes for a given bucket with a prefix and delimiter using the version 1 api" in { | ||
mockListBucketAndCommonPrefixesVersion1() | ||
|
||
//#list-bucket-and-common-prefixes-attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a reference to this code snippet in docs. These types of comments anchor code snippets that get extracted into paradox.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, that was c&p from the listBucket which is used for the S3 Attributes example. I guess it's still ok to test the passing of the attributes but no need for the paradox comments. I'll get rid of them
Co-Authored-By: Sean Glover <[email protected]>
Co-Authored-By: Sean Glover <[email protected]>
I've added it. Have a look if that's what you're after 3ea4954 . |
Exactly what I was looking for. Thanks! |
Some stage 1 check failures. If you look at the right-most field it gives you the command you can run locally. https://github.com/akka/alpakka/pull/2023/checks?check_run_id=321556931 |
…llection conversions
Thanks for the tip. Was some back and forth due to cross compilation, looking good now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the contribution @an-tex ! I appreciated your patience during the discussion :) |
Purpose
The listObjects endpoint allows to specify a delimiter, which is used to return only one hierarchy level of objects. This allows for a directory style browsing (like in the AWS S3 console) without having to retrieve all objects.
References
References #2021
Changes
Background Context
I've choosen to create a new function instead of changing the existing listBucket one. The listBucket function returns a
ListBucketResultContents
, which can't be a common prefix. Compared to the actual listObjects endpoint, which can return both,ListBucketResultContents
andListBucketResultCommonPrefixes
(API, Guide).To reflect the two kinds of stream events, the new listObjects Source emits a
ListBucketResultBase
trait, which can either be aListBucketResultContents
orListBucketResultCommonPrefixes
.This change would have broken existing implementations, that's why I opted to create a new function. Does this make sense?
I postponed updating the documentation until this approach is validated.