Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic Sync Depagination #26

Closed
millems opened this issue Jul 3, 2017 · 29 comments
Closed

Automatic Sync Depagination #26

millems opened this issue Jul 3, 2017 · 29 comments
Assignees
Labels
2.0 New feature-request A feature should be added or improved.

Comments

@millems
Copy link
Contributor

millems commented Jul 3, 2017

Customers are currently required to iterate over paginated result sets themselves in almost all locations by making extra service calls to retrieve more results if they're loading large sets of results.

We currently provide depaginators for S3 (S3Objects) and DynamoDBMapper results (PaginatedList), but these patterns are applied inconsistently across the code base.

Provide general-purpose List-view methods for traversing and accessing full result sets without having to make multiple service calls manually or loading the full result set in memory at one time.

@kiiadi
Copy link
Contributor

kiiadi commented Jul 17, 2017

Related : aws/aws-sdk-java#1239

@rdifalco
Copy link

I for my part would like a simple internal iterate for de-pagination that takes a Consumer<AwsResult>. For example:

   ec2.describeInstances(request, result -> System.out::println);

Even a standard pattern of external depagination would require multiple lines instead of the above one liner. But I'm sure whatever you guys come up with will be just the ticket.

@millems
Copy link
Contributor Author

millems commented Jul 26, 2017

Do you think it would be idiomatic and easy to read if we provided easy access to a stream of the results?

Making up some syntax:

ec2.describeInstances(request).results().forEach(System.out::println);

@jodastephen
Copy link

Both approaches are useful - standard iterators and stream-style internal iterators. ie. ideally you should support:

ec2.describeInstances(request).results().forEach(System.out::println);

and

for (Foo foo : ec2.describeInstances(request).results()) {
  // process
}

You get the stream-style method for free if implementing Iterable.

@kiiadi
Copy link
Contributor

kiiadi commented Jul 27, 2017

I think the problem with this approach is that when invoking the operation it's not obvious to the SDK whether we should return a single result or return a stream of them. The thing I like about @rdifalco's suggested approach is that it separates the regular, single call from a call that does the pagination.

If we want to expose it more 'simply' I think we'd have to have a different operation names for Streams and Iterables:

ec2.describeInstancesAsStream(request)
    .flatMap(r -> r.reservations().stream())
    .filter(...)
    .collect();

for (DescribeInstancesResponse response : ec2.describeInstancesAsIterable(request)) {
    //do processing
}

@millems
Copy link
Contributor Author

millems commented Jul 28, 2017

We could also consider exposing the different views in the response object, to prevent having multiple methods:

ec2.describeInstances(request).allInstances() // SdkIterable<Instance> (implements Iterable<Instance>)
ec2.describeInstances(request).allInstances().stream() // Stream<Instance>
ec2.describeInstances(request).allInstances().iterator() // Iterator<Instance>
ec2.describeInstances(request).instances() // List<Instance> (single page)

We might be able to then support the DynamoDB-style pagination strategy when customers want standard collection types:

ec2.describeInstances(request).allInstances().asList(DepaginationStrategy.LOAD_ALL); // List<Instance>

@brharrington
Copy link
Contributor

However it is done, it would be nice if it was consistent across all calls. For example, if added to the response objects, then it would be nice if there was a base interface implemented so that the way to get all results or a single page is not custom per request type. In @millems example, I think the proposal is:

ec2.describeInstances(request).allInstances()
ec2.describeInstances(request).instances()

elb.describeLoadBalancers(request).allLoadBalancers()
elb.describeLoadBalancers(request).loadBalancers()

cloudwatch.listMetrics(request).allMetrics()
cloudwatch.listMetrics(request).metrics()

To the extent possible, I want all of them to work the same way so I can interact with the SDK in a generic way. Something like:

interface PaginatedResponse<T> {
  SdkIterable<T> allResults();
  List<T> singlePage();
}

ec2.describeInstances(request).allResults()
ec2.describeInstances(request).singlePage()

elb.describeLoadBalancers(request).allResults()
elb.describeLoadBalancers(request).singlePage()

cloudwatch.listMetrics(request).allResults()
cloudwatch.listMetrics(request).singlePage()

As an FYI, at Netflix there is a simple pagination helper for the 1.x sdk in use by some teams to make the access a bit more consistent.

@varunnvs92 varunnvs92 self-assigned this Aug 16, 2017
@millems
Copy link
Contributor Author

millems commented Aug 28, 2017

One thing we're discussing having is just having an SdkIterable<T> where T would be the usual return type of the method for paginated APIs.

So instead of:

interface Ec2Client {
    DescribeInstancesResponse describeInstances(DescribeInstancesRequest);
}

we'd have

interface Ec2Client {
    SdkIterable<DescribeInstancesResponse> describeInstances(DescribeInstancesRequest);
}
interface SdkIterable<T> extends Iterable<T> {
    Stream<T> stream();
}

You could then get a stream of all instances with:

Stream<Instance> allInstances = ec2.describeInstances(request).stream()
                                   .flatMap(response -> response.instances().stream());

I'm not sure how we will handle APIs that start non-paginated and have pagination added as an option later. It's also not ideal if you just want the non-paginated portion of the response:

Owner owner = s3.listBuckets(request).iterator().next().owner(); // ???

@brharrington
Copy link
Contributor

Not sure I follow the s3.listBuckets example as it can return many buckets and so the iterable seems like it would fit just fine. I would argue there is a clear distinction between list and describe calls that are expected to return many results and get calls that return a particular item. For example, with s3 I wouldn't expect s3.getS3AccountOwner to ever return more than one item so it could just return the object. However, for ec2.describeImages I am expecting many results and it may need pagination, but at least with 1.11.x it is not supported.

Are there any example of getXYZ calls in the 1.11.x SDK that require pagination today? If so, could/should those be listXYZ or describeXYZ instead?

@millems
Copy link
Contributor Author

millems commented Aug 28, 2017

The listBuckets example is just illustrating what it would look like when you only need a member out of the first response in a paginated API. In this example, the buckets themselves aren't needed - just the value of the "owner" member.

The difference between Get, List, and Describe is kind of nuanced in our current API standards.

Get - Retrieves the contents of one or more resources.
Describe - Retrieves metadata about one or more resources.
List - Retrieves the list of instances of a given resource type.

It's possible that there are Get APIs that could need to be paginated if they have a large set of contents associated with a particular resource, but that seems more rare than List and Describe results.

We don't always do a good job making sure we are using precisely the right verb in every case. Some APIs were also created before these standards were really nailed down, so the verbs aren't quite right.

Unfortunately any potentially-wrong verbs are probably here to stay, even in 2.x. We really want to remain consistent with the service-specific wire-protocol documentation and the other SDKs. It would be confusing for a customer to find the GetSprocketsRequest in the service documentation, but ListSprocketsRequest in the Java SDK.

@varunnvs92
Copy link
Contributor

The following are the two designs we are considering for automatic de-pagination. Please provide your feedback on these designs and let us know which one you like. If you have better ideas, share them too.

Option 1: Pagination on the Response objects

All paginated operations will return a custom iterable that can be used to iterate over multiple response objects. As you process current response, SDK will make service calls internally to get the next one.

Code sample (Print all EC2 reservation ids)

SdkIterable<DescribeInstancesResponse> responseIterable = ec2.describeInstances();
 
responseIterable.stream()
        .flatMap(response -> response.reservations().stream())
        .map(r -> r.reservationId())
        .forEach(System.out::println);

Use case: You only need first response and don't want additional service calls.
Call the first() method on the response which gives you only the first page and doesn't make any more service calls internally.

SdkIterable<DescribeInstancesResponse> responseIterable = ec2.describeInstances();
DescribeInstancesResponse singleResponse = responseIterable.first();

Option 2: Pagination on both Response objects and the data structure in the response

For each paginated operation supported by service, there will be two operations in the client. One will return the iterable of paginated data structure and the other returns iterable of response objects (same as in Option 1)

a) Work directly with the paginated data structure
Example: When calling describeInstances, majority of users might be interested in the stream of reservations (instances) in the response object and not the other metadata. SDK will return an iterable of the reservations, which can be used to iterate through all the reservations in your account. SDK will make service calls internally to provide you a continuous stream of reservations.

Code Sample

SdkIterable<Reservation> reservations = ec2.describeInstances();
 
reservations.stream()
        .map(r -> r.reservationId())
        .forEach(System.out::println);

b) Work with the stream of response objects
This is same as option 1. But the name of operation will be different (The current name is not final and suggestions are welcome)

SdkIterable<DescribeInstancesResponse> responseIterable = ec2.describeInstancesResponses(); 

responseIterable.stream()
        .flatMap(response -> response.reservations().stream())
        .map(r -> r.reservationId())
        .forEach(System.out::println);

@MikeFHay
Copy link

For these SdkIterables, it's not clear to me what the behaviour is if I call .stream() twice on the same object. Does it create a new chain of API calls for each stream?

@jodastephen
Copy link

jodastephen commented Sep 25, 2017

Option 2 seems considerably more friendly in the common use case of wanting to access the data. Forcing flatMap on everyone seems a bit mean.

However I don't see why SdkIterable can't have both methods in option 2.

// dual generics - most use cases wouldn't actually assign this to a variable
SdkIterable<DescribeInstancesResponse, Reservation> paginated = ec2.describeInstances();

// stream() would stream the objects users generally care about
ec2.describeInstances().stream()
    .map(r -> r.reservationId())
    .forEach(System.out::println);

// streamResponses() would stream the responses
ec2.describeInstances().streamResponses()
    .flatMap(response -> response.reservations().stream())
    .map(r -> r.reservationId())
    .forEach(System.out::println);

// SdkIterable would also implement Iterable
for (Reservation r : ec2.describeInstances()) {
    System.out::println(r.reservationId());
}

// and toList() would return the complete list (with appropriate warnings on memory size)
ec2.describeInstances().toList();

@brharrington
Copy link
Contributor

Either option is fine with me, but option 2 would be more convenient for our use-cases. More detailed comments below.


I think option 1 would be more palatable if it could be flattened in a generic way. Perhaps by making the response object iterable:

class DescribeInstancesResponse implements SdkIterable<Reservation> {}

SdkIterable<DescribeInstancesResponse> responseIterable = ec2.describeInstances();
 
responseIterable.stream()
    .flatMap(SdkIterable::stream)
    .map(r -> r.reservationId())
    .forEach(System.out::println);

Option 2 is probably more convenient for many use-cases, but I think it would take a bit longer for a new user to understand why the two variants of the call exist on the client. Purely from a usage standpoint, @jodastephen's proposal would probably work fine, though SdkIterable doesn't seem like the right name for that return type.

Does any of the additional metadata vary across calls? In 1.11, it looks like it is mostly just generic response metadata from AmazonWebServiceResult and the tokens used to drive pagination. Since the user doesn't need to paginate manually, the tokens probably do not need to be exposed (though maybe that is still useful if a user wanted to implement resume if there was a failure in the middle?). If it is just the generic metadata, then that could be paired separately and the DescribeInstancesResponse object could represent the overall response rather than a response for a particular HTTP request. Something like:

// Makes the signature easier to read for the user and maybe keeps options open
// if additional stuff needs to be added later with less risk of compatibility issues.
public class DescribeInstancesResponse implements SdkIterable<Reservation> {
  public Stream<Reservation> stream();
  public Iterator<Reservation> iterator();
  public Stream<Map.Entry<ResponseMetadata, List<Reservation>>> streamWithMetadata();
}

DescribeInstancesResponse response = ec2.describeInstances();

// stream() would stream the objects users generally care about
ec2.describeInstances().stream()
    .map(r -> r.reservationId())
    .forEach(System.out::println);

// streamWithMetadata() would stream pairs of the metadata and list of reservations
ec2.describeInstances().streamWithMetadata()
    .flatMap(entry -> entry.getValue().stream())
    .map(r -> r.reservationId())
    .forEach(System.out::println);

// SdkIterable would also implement Iterable
for (Reservation r : ec2.describeInstances()) {
    System.out::println(r.reservationId());
}

@shorea
Copy link
Contributor

shorea commented Sep 25, 2017

Does any of the additional metadata vary across calls?

@brharrington It depends on the service. For the most part it is generic metadata. For DDB it has the very useful consumed capacity which users may definitely want to throttle their requests. https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-dynamodb/src/main/java/com/amazonaws/services/dynamodbv2/model/QueryResult.java#L89

@kiiadi
Copy link
Contributor

kiiadi commented Sep 25, 2017

For these SdkIterables, it's not clear to me what the behaviour is if I call .stream() twice on the same object. Does it create a new chain of API calls for each stream?

@MikeFHay great question - what do you think would be the expected behaviour? My initial thought was that we'd throw an IllegalStateException, but it may be equally valid to start a new call chain as you elude to.

@millems
Copy link
Contributor Author

millems commented Sep 25, 2017

@kiiadi That's interesting, because my initial impression is that it would behave the same as calling stream() on a Collection, or calling StreamSupport.stream(sdkIterable.spliterator(), false);: you get a new stream.

@brharrington
Copy link
Contributor

brharrington commented Sep 25, 2017

@millems would that mean that it would also refetch the data from the first request? If the first request is made when describeInstances() is called and then stream is called multiple times with some meaningful delay in between, then it could create an odd result where the first set of data is stale and subsequent pages are current.

@kiiadi
Copy link
Contributor

kiiadi commented Sep 25, 2017

SdkIterable probably would only make requests lazily.

@MikeFHay
Copy link

MikeFHay commented Sep 25, 2017

@kiiadi well if it doesn't support multiple iteration I'd be a bit confused at it being called SdkIterable. Iterables in Java are usually expected to support multiple iteration. There are a few options here, none perfect as far as I can tell:

  1. OneShotIterable: Implement Iterable but don't support multiple iteration, and accept the inconsistency with most implementations of Iterable.

  2. SdkStream: Don't implement Iterable, and require users to jump a weird hoop to do a for-each loop, i.e. for(T t : stream::iterator)

  3. LazyIterable: Do support multiple traversal, and accept the inconsistency that SdkIterable return values are fully lazy, whereas non-paginated calls are eager (potentially confusing when some errors aren't thrown immediately).

  4. EagerIterable: Do support multiple traversal, eagerly load the first page, but have subsequent calls to .stream()/.iterator() load it again.

  5. Stream: Forget this whole SdkIterable thing and have all paginated operations take a request and immediately return a plain old Stream<Response>, avoiding much of this ambiguity but requiring users to flatMap to a Stream<T>, and again making for-each loops slightly awkward.

Personally I like option 5, but I can see the usability argument for option 4. Not sure what your feelings are on eagerly loading the first page.

@kiiadi
Copy link
Contributor

kiiadi commented Sep 26, 2017

From reading all the feedback, I think what we're saying is having the response type be something like this:

public interface Paginated<PageT, ItemT> extends Iterable<ItemT> {
    
    Stream<ItemT> stream();
    
    PageT firstPage();

    Stream<PageT> pageStream();

    Iterator<PageT> pageIterator();

    void forEach(BiConsumer<PageT, ItemT> consumer);
}

In the above case the PageT represents each individual SDK response and the ItemT represents the thing that the response contains a list of. So for describeInstances the signature would look like so:

Paginated<DescribeInstancesResponse, Reservation> describeInstances(DescribeInstancesRequest request);

I can then use it in the following way:

//The standard use-case
for (Reservation r : client.describeInstances()) {
    System.out.println(r.reservationId());
}

//The standard use-case using streams
Optional<Reservation> someId = client.describeInstances().stream()
         .filter(r -> r.reservationId() == "Some Id")
         .findFirst();

//I can get the first page if that's the only thing I care about
DescribeInstancesResponse firstPage = client.describeInstances().firstPage();

//If I need to have response information then I can get that
client.describeInstances().pageStream()
         .filter(p -> p.reservations().size() > 20)
         .collect(toList());

//If I want to iterate over the items but also have the response detail for a given item
client.describeInstances().forEach((response, item) -> System.out.println(response + ":" + item));

In order to preserve the 'fail-fast' behaviour, the SDK would greedily request the first page (and thus ensure that the request was valid and well-formed) and this page would be used in the first call to iterator or stream. Subsequent calls to iterator or stream would create a new request chain.

@shorea
Copy link
Contributor

shorea commented Sep 26, 2017

I really don't care for making the "items" the first class thing for multiple reasons

  1. Less clear we are making multiple HTTP requests. When you get back an Iterable of response objects it's much more clear the SDK is making multiple requests. This is potentially dangerous as what may work fine in a dev environment where the number of pages are few, may explode in a prod environment where results could be unbounded
  2. Request level options are more ambiguous. Do request timeouts apply to each request (assuming you know the SDK is making requests), the time it takes to return the iterable (i.e. the first call), or every request to the service (this is how we'd likely implement it).
  3. Deviates from the service's API. DescribeInstances returns a DescribeInstanceResponse, not Reservations. I feel like we are hiding what the service returns which in some cases is useful data.
  4. The layering just feels off to me. Iterating over pages is more powerful and more flexible. That should be our primitive and we should build niceties like item iteration on top of it, not the other way around.
  5. Less clear that service exceptions may be thrown in the middle of iteration. Since you don't know when additional calls are being made (if they are at all) then you just have to know to try/catch around the iteration as a whole.

I propose we flip the suggestion above and have iteration over pages as the first class thing.

public interface SdkIterable<T> extends Iterable<T> {
    Stream<T> stream();
    
}
public interface Paginated<PageT, ItemT> extends SdkIterable<PageT> {
    
    // stream() inherited from SdkIterable

    PageT firstPage();

    SdkIterable<ItemT> allItems();

    // Not sure if we need this? Guess it could be nice
    void forEach(BiConsumer<PageT, ItemT> consumer);
}

Example usage

//The standard use-case
for(Reservation r : client.describeInstances().allItems()) {
    System.out.println(r.reservationId());
}

//The standard use-case using streams
Optional<Reservation> someId = client.describeInstances().allItems().stream()
         .filter(r -> r.reservationId() == "Some Id")
         .findFirst();


//I can get the first page if that's the only thing I care about
DescribeInstancesResponse firstPage = client.describeInstances().firstPage();

//If I need to have response information then I can get that
client.describeInstances().
         .filter(p -> p.reservations().size() > 20)
         .collect(toList());

//If I want to iterate over the items but also have the response detail for a given item
client.describeInstances().forEach((response, item) -> System.out.println(response + ":" + item));

// Can use flatmap if you're into that kind of thing
client.describeInstances().
         .flatmap(r -> r.reservations().stream())
         .collect(toList());

// Or nested for if that floats your boat
for(DescribeInstancesResponse response : client.describeInstances()) {
    for(Reservation r : response.reservations()) {
        // Do something
    }
}

@kiiadi
Copy link
Contributor

kiiadi commented Sep 26, 2017

👍 Couple of small tweaks:

  1. I'd push first() down to the Paginated interface, we may use SdkIterable for other things where the existence of at least one value is not necessarily guaranteed
  2. I'd potentially rename items() to allItems() or some other name that makes it clear that we're getting all items from all pages.

@shorea
Copy link
Contributor

shorea commented Sep 26, 2017

+1, made edits to the comment above.

@brharrington
Copy link
Contributor

//I can get the first page if that's the only thing I care about
DescribeInstancesResponse firstPage = client.describeInstances().first();

s/first()/firstPage()/

@millems
Copy link
Contributor Author

millems commented Sep 26, 2017

I like it! Do we have an answer for APIs that start off non-paginated and have pagination added later on as an optional feature?

@varunnvs92
Copy link
Contributor

The usage experience in shorea@ suggestion looks good. It satisfies both use cases (iterating top level responses and iterating items). And there is no need to use flat map too :)

One problem with all these designs we discussed is potential backwards compatibility issue. If a service team changes an existing non-paginated operation into a paginated operation, this will change the API in the generated SDK and break customers.

Example:
Now Foo is a non-paginated operation

FooResponse foo(FooRequest); // API declaration

FooResponse response = client.foo(FooRequest); //Usage

Service team need to send large lists in the FooResponse and decides to make this a paginated operation. This cases happen occasionally. If we don't detect these cases, then the API in generated SDK will look like:

Paginated<FooResponse, ItemT> foo(FooRequest);  // API declaration

Paginated<FooResponse, ItemT> response = client.foo(FooRequest); //Usage

Solutions

  1. SDK team can detect these changes and make customization to change the API name. So now the client will have following APIs:
FooResponse foo(FooRequest); // Old API

Paginated<FooResponse, ItemT> fooPaginated(FooRequest); // New API

The main problem with this solution is inconsistency of naming across paginated operations. Most paginated operations will have names without "Paginated" keyword (like describeInstances). But few operations that turned from non-paginated to paginated will have "Paginated" keyword. This might cause customer confusion.
The other problem is if SDK validation doesn't catch these cases, customer applications will break.

  1. The other solution is to always have two methods for paginated operations. I will use describeInstances example. The method with normal name (describeInstances) will be same as in v1. It is not auto-paginated and customers has to use do-while loops (or similar) to get all results. The other method (describeInstancesAutoPaginated) will be the auto-paginated design shorea@ mentioned.
DescribeInstancesResponse describeInstances(DescribeInstancesRequest); // v1 style. no auto-pagination
Paginated<DescribeInstancesResponse, Reservation> describeInstancesAutoPaginated(DescribeInstancesRequest);

// Similarly for foo API
FooResponse foo(FooRequest);
Paginated<FooResponse, ItemT> fooAutoPaginated(FooRequest);

In this solution, we don't need to worry about backwards compatibility as the method declaration (foo or describeInstances) will always be the same. More optional parameters might be added in Request/Response shapes to enable pagination when the API transitions from non-paginated to paginated.
The only problem I see with this is discoverability. When I look describeInstances API and IDE shows describeInstances method first and describeInstancesAutoPaginated next, I might not care about the second method and won't find that SDK supports the awesome auto-pagination feature.

@shorea shorea changed the title Automatic Depagination Automatic Sync Depagination Sep 27, 2017
@shorea
Copy link
Contributor

shorea commented Sep 27, 2017

I think we are closing down on the Sync discussion and converging towards a solution. Created #185 to continue the discussion for Async clients. Let us know what you think!

@varunnvs92
Copy link
Contributor

The feature is released in "2.0.0-preview-9" version. Here is a blog post about this feature with code samples. Please try out the feature and provide us your feedback. Thank you.

@justnance justnance added feature-request A feature should be added or improved. and removed Feature Request labels Apr 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.0 New feature-request A feature should be added or improved.
Projects
None yet
Development

No branches or pull requests

9 participants