Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS S3 added listObjects endpoint including common prefixes for a delimiter #2023

Merged
merged 19 commits into from
Nov 27, 2019

Conversation

an-tex
Copy link
Contributor

@an-tex an-tex commented Nov 20, 2019

Purpose

The listObjects endpoint allows to specify a delimiter, which is used to return only one hierarchy level of objects. This allows for a directory style browsing (like in the AWS S3 console) without having to retrieve all objects.

References

References #2021

Changes

  • added listObjects function taking an optional delimiter parameter

Background Context

I've choosen to create a new function instead of changing the existing listBucket one. The listBucket function returns a ListBucketResultContents, which can't be a common prefix. Compared to the actual listObjects endpoint, which can return both, ListBucketResultContents and ListBucketResultCommonPrefixes (API, Guide).

To reflect the two kinds of stream events, the new listObjects Source emits a ListBucketResultBase trait, which can either be a ListBucketResultContents or ListBucketResultCommonPrefixes.

This change would have broken existing implementations, that's why I opted to create a new function. Does this make sense?

I postponed updating the documentation until this approach is validated.

@lightbend-cla-validator

At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user

@an-tex
Copy link
Contributor Author

an-tex commented Nov 20, 2019

At least one pull request committer is not linked to a user. See https://help.github.com/en/articles/why-are-my-commits-linked-to-the-wrong-user#commits-are-not-linked-to-any-user

I've added the missing email address and signed the CLA

@an-tex
Copy link
Contributor Author

an-tex commented Nov 21, 2019

I'm not sure what the story with the failing build is. The same test is failing for me in Intellij, but even before my commits. When I run it in SBT directly it's fine for both.

@seglo
Copy link
Contributor

seglo commented Nov 21, 2019

I'm not sure what the story with the failing build is. The same test is failing for me in Intellij, but even before my commits. When I run it in SBT directly it's fine for both.

That's alright. We've previously identified this as a flakey test. It needs further investigation, but it may just need a longer timeout since travis VMs can be rather underpowered.

#1802

Thanks for the PR! I'll do a review later today.

@seglo seglo self-assigned this Nov 21, 2019
Copy link
Contributor

@seglo seglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there is a lot of duplicate code between the listBucket and listObjects DSL and implementation. It looks like it would be natural to merge ListBucketResultCommonPrefixes into ListBucketResultContents and make the prefix property optional and only set when a prefix is used in the original query. The only downside to this is that it would require modifying ListBucketResultContents which would break backwards compatibility for users who use this type in tests, but we could add constructor overloads to support the additional field. Also, since it's optional, it has an obvious base case in the existing constructor of None.

Before I review any further I want to understand your rationale and general thoughts about consolidating this return type in case I misunderstood something.

EDIT: I thought about it a little more and I see how it makes sense to treat them differently. It would be nice to DRY up the code in S3Stream though. Maybe you could pass a function from listObjects and listBucket to a consolidated implementation method in S3Stream that will construct the return type you want. For the object with prefix results, you could return a tuple of (ListBucketResultContents, ListBucketResultCommonPrefixes) instead of trying to combine them together with the new base trait you added.

*
* The `alpakka.s3.list-bucket-api-version` can be set to 1 to use the older API version 1
*
* @see https://docs.aws.amazon.com/AmazonS3/latest/API/v2-RESTBucketGET.html (version 1 API)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is actually the v2 API. Can you update this reference and the one in listBucket javadoc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those links seem to be outdated and redirect to https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html (and https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html accordingly). I'll update the links and fix the version numbers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do, see eg. #2024

Copy link
Contributor Author

@an-tex an-tex Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. i've updated the other s3 api links too

* @param s3Headers any headers you want to add
* @return [[akka.stream.scaladsl.Source Source]] of [[ListBucketResultBase]]
*/
def listObjects(bucket: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this should be another listBucket overload. Is there a reason you gave it a different name?

Copy link
Contributor

@seglo seglo Nov 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or better yet, listBucketWithPrefix to accommodate the different return type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listBucketWithPrefixes could work. Then we could make the delimiter parameter even non-optional.

I've taken the listObjects name from the actual S3 API Naming as I was shortly confused myself which alpakka S3 function is linked to which API call. But I guess this only makes sense if in the long run the listBucket function gets deprecated and completely replaced by listObjects*. Until then it would probably just confuse more. So yea lets make it listBucketWithPrefixes

Copy link
Contributor Author

@an-tex an-tex Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh it seems this "list contents/prefixes" endpoint used to be called "GET Bucket (List Objects)" in the documentation but has been renamed to "ListObjectsV2". Now I get where the original name is coming from.

We still stick to listBucket and listBucketWithPrefixes? (or better listBucketWithCommonPrefixes to be accurate?)

@an-tex
Copy link
Contributor Author

an-tex commented Nov 22, 2019

Thanks for the feedback. Sorry about the DRY, it seemed some discussion would follow so I wanted to keep it WET and wait with premature optimisations until we've decided the final interface. Probably should have mentioned that ;) Once it's in place I'll follow your guide.

For the object with prefix results, you could return a tuple of (ListBucketResultContents, ListBucketResultCommonPrefixes) instead of trying to combine them together with the new base trait you added.

One streaming event can be EITHER a Contents OR a CommonPrefixes item. So I'm not sure how returning a tuple would work here. The only way to keep it separate would be to return a tuple of Sources with those types.

@seglo
Copy link
Contributor

seglo commented Nov 22, 2019

Thanks for the feedback. Sorry about the DRY, it seemed some discussion would follow so I wanted to keep it WET and wait with premature optimisations until we've decided the final interface. Probably should have mentioned that ;) Once it's in place I'll follow your guide.

Sounds good!

One streaming event can be EITHER a Contents OR a CommonPrefixes item. So I'm not sure how returning a tuple would work here. The only way to keep it separate would be to return a tuple of Sources with those types.

I see. Thanks for the clarification. I understand now why you took this approach. Since we're exploding the contents results there's no real appropriate place to include the prefixes result too, so you've added them to the stream as elements. This leaves it to the end user to match on the correct concrete type. I suppose one alternative would be to add a reference to the prefixes in every ListBucketResultContents emitted for that response. That would keep the return type the same. WDYT?

@an-tex
Copy link
Contributor Author

an-tex commented Nov 22, 2019

I suppose one alternative would be to add a reference to the prefixes in every ListBucketResultContents emitted for that response. That would keep the return type the same. WDYT?

You mean every ListBucketResultContents element contains an additional field of type Source[ListBucketResultCommonPrefixes] as the reference?

@seglo
Copy link
Contributor

seglo commented Nov 22, 2019

You mean every ListBucketResultContents element contains an additional field of type Source[ListBucketResultCommonPrefixes] as the reference?

I mean iterate over ListBucketResult.contents and emit downstream a new type like ListBucketResultContents, but that includes a Seq[ListBucketResultCommonPrefixes].

Though your reply made me think of another approach. Instead of merging results and common prefixes into a single stream you could return a Source for each, which the user could then handle separately.

@an-tex
Copy link
Contributor Author

an-tex commented Nov 22, 2019

I mean iterate over ListBucketResult.contents and emit downstream a new type like ListBucketResultContents, but that includes a Seq[ListBucketResultCommonPrefixes].

I think the CommonPrefixes should be Elements of a Source as well as there might still be a lot of them.

Though your reply made me think of another approach. Instead of merging results and common prefixes into a single stream you could return a Source for each, which the user could then handle separately.

That's what I meant earlier by a tuple of sources. I prolly should use more code as examples instead of confusing descriptions ;)

How about this interface:

  def listBucketWithCommonPrefixes(
                  bucket: String,
                  delimiter: String,
                  prefix: Option[String] = None,
                  s3Headers: S3Headers = S3Headers.empty
): Source[(ListBucketResultCommonPrefixes, ListBucketResultContents), NotUsed]

This way the user could consume both independently. Even though I wonder if we should internally keep the common base trait. Then the user could still opt in to concat both sources and do pattern matching on a single stream (similar to the original approach).

@seglo
Copy link
Contributor

seglo commented Nov 22, 2019

I think the CommonPrefixes should be Elements of a Source as well as there might still be a lot of them.

True, but they've already been deserialized to a Seq from that page/response, so referencing that Seq multiple times shouldn't be a concern, though it could be confusing if the common prefixes are paged as well.

Source[(ListBucketResultCommonPrefixes, ListBucketResultContents), NotUsed]

Do you mean this:

Source[(Source[ListBucketResultCommonPrefixes, NotUsed], Source[ListBucketResultContents, NotUsed]), NotUsed]

@an-tex
Copy link
Contributor Author

an-tex commented Nov 22, 2019

True, but they've already been deserialized to a Seq from that page/response, so referencing that Seq multiple times shouldn't be a concern, though it could be confusing if the common prefixes are paged as well.

Exactly, the CommonPrefixes should be treated just like the Contents especially due to the paging.

Source[(ListBucketResultCommonPrefixes, ListBucketResultContents), NotUsed]

Do you mean this:

Source[(Source[ListBucketResultCommonPrefixes, NotUsed], Source[ListBucketResultContents, NotUsed]), NotUsed]

Ah sorry, I actually meant:

  def listBucketWithCommonPrefixes(
                                    bucket: String,
                                    delimiter: String,
                                    prefix: Option[String] = None,
                                    s3Headers: S3Headers = S3Headers.empty
                                  ):
  (
    Source[ListBucketResultCommonPrefixes, NotUsed],
    Source[ListBucketResultContents, NotUsed]
  ) = ???

Here are a few usage use cases I'd see with this interface:

// I don't care about the contents, only commonPrefixes
val (commonPrefixes, _) = listBucketWithCommonPrefixes("bucket","/")
commonPrefixes.runForeach(???)

// I care about both separately
val (commonPrefixes, contents) = listBucketWithCommonPrefixes("bucket","/")
commonPrefixes.runForeach(???)
contents.runForeach(???)

// That's not really a use case as you can just use the existing listBucket function if you're only interested in the contents
val (_, contents) = listBucketWithCommonPrefixes("bucket","/")

// if we keep the ListContentsResultBase trait one could even merge them to one single stream and use pattern matching later
val (commonPrefixes, contents) = listBucketWithCommonPrefixes("bucket","/")
val merged: Source[ListBucketResultBase, NotUsed] = commonPrefixes.merge(contents)
  merged.runForeach {
    case contents: ListBucketResultContents => ???
    case prefixes: ListBucketResultCommonPrefixes => ???
  }

EDIT: For javadsl we could use akka.japi.Pair

@seglo
Copy link
Contributor

seglo commented Nov 22, 2019

I see. That could work. I would have thought this would be the most common use case:

// That's not really a use case as you can just use the existing listBucket function if you're only interested in the contents
val (_, contents) = listBucketWithCommonPrefixes("bucket","/")

Because you're using the delimiter to essentially filter one level of results, but you don't care about the common prefixes.

Do you want to experiment with this API?

@seglo
Copy link
Contributor

seglo commented Nov 22, 2019

// if we keep the ListContentsResultBase trait one could even merge them to one single stream and use pattern matching later
val (commonPrefixes, contents) = listBucketWithCommonPrefixes("bucket","/")
val merged: Source[ListBucketResultBase, NotUsed] = commonPrefixes.merge(contents)
  merged.runForeach {
    case contents: ListBucketResultContents => ???
    case prefixes: ListBucketResultCommonPrefixes => ???
  }

I don't think keeping ListBucketResultBase is that useful. It seems odd to me to have these two different things in one stream. If the user really wants to do that they can do it themselves.

@an-tex
Copy link
Contributor Author

an-tex commented Nov 25, 2019

Because you're using the delimiter to essentially filter one level of results, but you don't care about the common prefixes.

True!

Do you want to experiment with this API?

Yep I'll update the PR

@an-tex
Copy link
Contributor Author

an-tex commented Nov 25, 2019

I don't think keeping ListBucketResultBase is that useful. It seems odd to me to have these two different things in one stream. If the user really wants to do that they can do it themselves.

My current use case is similar to the S3 console inside AWS: A "directory" based browser, listing both, files and commonPrefixes. Pretty much like an ls in the shell. So I need exactly that one stream ;)

Doing it yourself without a common base trait would involve mapping to an Either and then pattern match later again. It's not too bad but seems rather unnecessarily clunky:

  val (commonPrefixes, contents) = listBucketWithCommonPrefixes("bucket","/")
  
  val commonPrefixesEither = commonPrefixes.map(Right(_))
  val contentsEither = contents.map(Left(_))
  
  commonPrefixesEither.merge(contentsEither).runForeach {
    case Left(content) => ???
    case Right(commonPrefix) => ???
  }

Or am I missing some easier way?

I'd have thought leaving a common base trait wouldn't affect the API directly anyway so it does not hurt but makes it easier for the end user in case he needs it. In any case, that's not a show stopper for me I'll continue doing the other changes.

@an-tex
Copy link
Contributor Author

an-tex commented Nov 25, 2019

Well, the implementation of the Tuple Source seems to be tricky. I'm now down to having a

Source[(Seq[ListBucketResultCommonPrefixes], Seq[ListBucketResultContents]), NotUsed]

(every element is a page from the result)
but I don't know how to turn that into the desired

(
    Source[ListBucketResultCommonPrefixes, NotUsed],
    Source[ListBucketResultContents, NotUsed]
  )

without forcing the client to use custom graph logic like e.g. described in https://stackoverflow.com/questions/38438983/split-akka-stream-source-into-two

Do you have any suggestions @seglo ?

Otherwise returning to the original single source using a base trait might be the better solution after all? You can still fairly easy consume e.g. only the contents:

 def listBucketWithCommonPrefixes(
                                    bucket: String,
                                    delimiter: String,
                                    prefix: Option[String] = None,
                                    s3Headers: S3Headers = S3Headers.empty
                                  ): Source[ListBucketResultBase, NotUsed] = ???

  val contentsOnly: Source[ListBucketResultContents, NotUsed] =
    listBucketWithCommonPrefixes("bucket", "/").collect {
      case contents: ListBucketResultContents => contents
    }

EDIT: The only things comes to mind is using two Source.queue and feeding them but that seems a pretty dodgy workaround

@seglo
Copy link
Contributor

seglo commented Nov 25, 2019

My current use case is similar to the S3 console inside AWS: A "directory" based browser, listing both, files and commonPrefixes. Pretty much like an ls in the shell. So I need exactly that one stream ;)

I see. I'm unfamiliar with this AWS S3 utility. When you do a ls does it mix common prefixes with results? Could you paste a snippet?

Well, the implementation of the Tuple Source seems to be tricky. I'm now down to having a

Source[(Seq[ListBucketResultCommonPrefixes], Seq[ListBucketResultContents]), NotUsed]

(every element is a page from the result)
but I don't know how to turn that into the desired

I think the return type you mention here is fine. I tried playing around with this as well and discovered that it doesn't really make sense to return the nested sources since they would be emitted for every result anyway.

I realize this makes your use case a little more tricky, but I think it's still the best option over mixing the elements together with a common base trait. We could also forego returning a tuple/pair and just return the concrete upstream result object (or something like it).

I think you already have this code, but in case it's useful I pushed my experimental code as a demonstration.

https://github.com/an-tex/alpakka/pull/1/files

@an-tex
Copy link
Contributor Author

an-tex commented Nov 26, 2019

Alright, I've pushed it all. The only main change I made to your code was making the delimiter parameter mandatory. As otherwise the commonPrefixes would always be empty and the function would behave exactly like the original listBucket.

Otherwise:

  • paradox documentation
  • DRY
  • scala&java specs
  • merge upstream
  • build successful

Copy link
Contributor

@seglo seglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good. I left a few comments.

Something else I was considering was adding a new listBucket overload that only returns ListBucketResultContents for the one hierarchy level since that's going to be a common use case. It could have the same return type as the other listBuckets because it's just flattening out the one sequence and skipping common prefixes. We can do this in another PR though.

Thanks for your patience!

docs/src/main/paradox/s3.md Outdated Show resolved Hide resolved
s3/src/main/scala/akka/stream/alpakka/s3/scaladsl/S3.scala Outdated Show resolved Hide resolved
s3/src/main/scala/akka/stream/alpakka/s3/model.scala Outdated Show resolved Hide resolved
it should "list keys and common prefixes for a given bucket with a prefix and delimiter using the version 1 api" in {
mockListBucketAndCommonPrefixesVersion1()

//#list-bucket-and-common-prefixes-attributes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reference to this code snippet in docs. These types of comments anchor code snippets that get extracted into paradox.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, that was c&p from the listBucket which is used for the S3 Attributes example. I guess it's still ok to test the passing of the attributes but no need for the paradox comments. I'll get rid of them

@an-tex
Copy link
Contributor Author

an-tex commented Nov 26, 2019

Something else I was considering was adding a new listBucket overload that only returns ListBucketResultContents for the one hierarchy level since that's going to be a common use case. It could have the same return type as the other listBuckets because it's just flattening out the one sequence and skipping common prefixes. We can do this in another PR though.

I've added it. Have a look if that's what you're after 3ea4954 .

@seglo
Copy link
Contributor

seglo commented Nov 26, 2019

I've added it. Have a look if that's what you're after 3ea4954 .

Exactly what I was looking for. Thanks!

@seglo
Copy link
Contributor

seglo commented Nov 26, 2019

Some stage 1 check failures. If you look at the right-most field it gives you the command you can run locally.

https://github.com/akka/alpakka/pull/2023/checks?check_run_id=321556931

@an-tex
Copy link
Contributor Author

an-tex commented Nov 27, 2019

Some stage 1 check failures. If you look at the right-most field it gives you the command you can run locally.

Thanks for the tip. Was some back and forth due to cross compilation, looking good now!

@seglo seglo self-requested a review November 27, 2019 13:59
Copy link
Contributor

@seglo seglo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@seglo seglo merged commit eab5486 into akka:master Nov 27, 2019
@seglo
Copy link
Contributor

seglo commented Nov 27, 2019

Thanks for the contribution @an-tex ! I appreciated your patience during the discussion :)

@an-tex
Copy link
Contributor Author

an-tex commented Nov 27, 2019

Thanks for the contribution @an-tex ! I appreciated your patience during the discussion :)

Glad I could contribute! Thanks for the guidance @seglo 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants