-
Notifications
You must be signed in to change notification settings - Fork 856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic Sync Depagination #26
Comments
Related : aws/aws-sdk-java#1239 |
I for my part would like a simple internal iterate for de-pagination that takes a
Even a standard pattern of external depagination would require multiple lines instead of the above one liner. But I'm sure whatever you guys come up with will be just the ticket. |
Do you think it would be idiomatic and easy to read if we provided easy access to a stream of the results? Making up some syntax: ec2.describeInstances(request).results().forEach(System.out::println); |
Both approaches are useful - standard iterators and stream-style internal iterators. ie. ideally you should support:
and
You get the stream-style method for free if implementing |
I think the problem with this approach is that when invoking the operation it's not obvious to the SDK whether we should return a single result or return a stream of them. The thing I like about @rdifalco's suggested approach is that it separates the regular, single call from a call that does the pagination. If we want to expose it more 'simply' I think we'd have to have a different operation names for ec2.describeInstancesAsStream(request)
.flatMap(r -> r.reservations().stream())
.filter(...)
.collect();
for (DescribeInstancesResponse response : ec2.describeInstancesAsIterable(request)) {
//do processing
} |
We could also consider exposing the different views in the response object, to prevent having multiple methods: ec2.describeInstances(request).allInstances() // SdkIterable<Instance> (implements Iterable<Instance>)
ec2.describeInstances(request).allInstances().stream() // Stream<Instance>
ec2.describeInstances(request).allInstances().iterator() // Iterator<Instance>
ec2.describeInstances(request).instances() // List<Instance> (single page) We might be able to then support the DynamoDB-style pagination strategy when customers want standard collection types: ec2.describeInstances(request).allInstances().asList(DepaginationStrategy.LOAD_ALL); // List<Instance> |
However it is done, it would be nice if it was consistent across all calls. For example, if added to the response objects, then it would be nice if there was a base interface implemented so that the way to get all results or a single page is not custom per request type. In @millems example, I think the proposal is: ec2.describeInstances(request).allInstances()
ec2.describeInstances(request).instances()
elb.describeLoadBalancers(request).allLoadBalancers()
elb.describeLoadBalancers(request).loadBalancers()
cloudwatch.listMetrics(request).allMetrics()
cloudwatch.listMetrics(request).metrics() To the extent possible, I want all of them to work the same way so I can interact with the SDK in a generic way. Something like: interface PaginatedResponse<T> {
SdkIterable<T> allResults();
List<T> singlePage();
}
ec2.describeInstances(request).allResults()
ec2.describeInstances(request).singlePage()
elb.describeLoadBalancers(request).allResults()
elb.describeLoadBalancers(request).singlePage()
cloudwatch.listMetrics(request).allResults()
cloudwatch.listMetrics(request).singlePage() As an FYI, at Netflix there is a simple pagination helper for the 1.x sdk in use by some teams to make the access a bit more consistent. |
One thing we're discussing having is just having an So instead of: interface Ec2Client {
DescribeInstancesResponse describeInstances(DescribeInstancesRequest);
} we'd have interface Ec2Client {
SdkIterable<DescribeInstancesResponse> describeInstances(DescribeInstancesRequest);
} interface SdkIterable<T> extends Iterable<T> {
Stream<T> stream();
} You could then get a stream of all instances with: Stream<Instance> allInstances = ec2.describeInstances(request).stream()
.flatMap(response -> response.instances().stream()); I'm not sure how we will handle APIs that start non-paginated and have pagination added as an option later. It's also not ideal if you just want the non-paginated portion of the response: Owner owner = s3.listBuckets(request).iterator().next().owner(); // ??? |
Not sure I follow the Are there any example of getXYZ calls in the 1.11.x SDK that require pagination today? If so, could/should those be listXYZ or describeXYZ instead? |
The The difference between
It's possible that there are We don't always do a good job making sure we are using precisely the right verb in every case. Some APIs were also created before these standards were really nailed down, so the verbs aren't quite right. Unfortunately any potentially-wrong verbs are probably here to stay, even in 2.x. We really want to remain consistent with the service-specific wire-protocol documentation and the other SDKs. It would be confusing for a customer to find the |
The following are the two designs we are considering for automatic de-pagination. Please provide your feedback on these designs and let us know which one you like. If you have better ideas, share them too. Option 1: Pagination on the Response objectsAll paginated operations will return a custom iterable that can be used to iterate over multiple response objects. As you process current response, SDK will make service calls internally to get the next one. Code sample (Print all EC2 reservation ids)
Use case: You only need first response and don't want additional service calls.
Option 2: Pagination on both Response objects and the data structure in the responseFor each paginated operation supported by service, there will be two operations in the client. One will return the iterable of paginated data structure and the other returns iterable of response objects (same as in Option 1) a) Work directly with the paginated data structure Code Sample
b) Work with the stream of response objects
|
For these |
Option 2 seems considerably more friendly in the common use case of wanting to access the data. Forcing However I don't see why
|
Either option is fine with me, but option 2 would be more convenient for our use-cases. More detailed comments below. I think option 1 would be more palatable if it could be flattened in a generic way. Perhaps by making the response object iterable: class DescribeInstancesResponse implements SdkIterable<Reservation> {}
SdkIterable<DescribeInstancesResponse> responseIterable = ec2.describeInstances();
responseIterable.stream()
.flatMap(SdkIterable::stream)
.map(r -> r.reservationId())
.forEach(System.out::println); Option 2 is probably more convenient for many use-cases, but I think it would take a bit longer for a new user to understand why the two variants of the call exist on the client. Purely from a usage standpoint, @jodastephen's proposal would probably work fine, though SdkIterable doesn't seem like the right name for that return type. Does any of the additional metadata vary across calls? In 1.11, it looks like it is mostly just generic response metadata from AmazonWebServiceResult and the tokens used to drive pagination. Since the user doesn't need to paginate manually, the tokens probably do not need to be exposed (though maybe that is still useful if a user wanted to implement resume if there was a failure in the middle?). If it is just the generic metadata, then that could be paired separately and the DescribeInstancesResponse object could represent the overall response rather than a response for a particular HTTP request. Something like: // Makes the signature easier to read for the user and maybe keeps options open
// if additional stuff needs to be added later with less risk of compatibility issues.
public class DescribeInstancesResponse implements SdkIterable<Reservation> {
public Stream<Reservation> stream();
public Iterator<Reservation> iterator();
public Stream<Map.Entry<ResponseMetadata, List<Reservation>>> streamWithMetadata();
}
DescribeInstancesResponse response = ec2.describeInstances();
// stream() would stream the objects users generally care about
ec2.describeInstances().stream()
.map(r -> r.reservationId())
.forEach(System.out::println);
// streamWithMetadata() would stream pairs of the metadata and list of reservations
ec2.describeInstances().streamWithMetadata()
.flatMap(entry -> entry.getValue().stream())
.map(r -> r.reservationId())
.forEach(System.out::println);
// SdkIterable would also implement Iterable
for (Reservation r : ec2.describeInstances()) {
System.out::println(r.reservationId());
} |
@brharrington It depends on the service. For the most part it is generic metadata. For DDB it has the very useful consumed capacity which users may definitely want to throttle their requests. https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-dynamodb/src/main/java/com/amazonaws/services/dynamodbv2/model/QueryResult.java#L89 |
@MikeFHay great question - what do you think would be the expected behaviour? My initial thought was that we'd throw an |
@kiiadi That's interesting, because my initial impression is that it would behave the same as calling |
@millems would that mean that it would also refetch the data from the first request? If the first request is made when |
|
@kiiadi well if it doesn't support multiple iteration I'd be a bit confused at it being called SdkIterable. Iterables in Java are usually expected to support multiple iteration. There are a few options here, none perfect as far as I can tell:
Personally I like option 5, but I can see the usability argument for option 4. Not sure what your feelings are on eagerly loading the first page. |
From reading all the feedback, I think what we're saying is having the response type be something like this: public interface Paginated<PageT, ItemT> extends Iterable<ItemT> {
Stream<ItemT> stream();
PageT firstPage();
Stream<PageT> pageStream();
Iterator<PageT> pageIterator();
void forEach(BiConsumer<PageT, ItemT> consumer);
} In the above case the Paginated<DescribeInstancesResponse, Reservation> describeInstances(DescribeInstancesRequest request); I can then use it in the following way: //The standard use-case
for (Reservation r : client.describeInstances()) {
System.out.println(r.reservationId());
}
//The standard use-case using streams
Optional<Reservation> someId = client.describeInstances().stream()
.filter(r -> r.reservationId() == "Some Id")
.findFirst();
//I can get the first page if that's the only thing I care about
DescribeInstancesResponse firstPage = client.describeInstances().firstPage();
//If I need to have response information then I can get that
client.describeInstances().pageStream()
.filter(p -> p.reservations().size() > 20)
.collect(toList());
//If I want to iterate over the items but also have the response detail for a given item
client.describeInstances().forEach((response, item) -> System.out.println(response + ":" + item)); In order to preserve the 'fail-fast' behaviour, the SDK would greedily request the first page (and thus ensure that the request was valid and well-formed) and this page would be used in the first call to |
I really don't care for making the "items" the first class thing for multiple reasons
I propose we flip the suggestion above and have iteration over pages as the first class thing. public interface SdkIterable<T> extends Iterable<T> {
Stream<T> stream();
}
public interface Paginated<PageT, ItemT> extends SdkIterable<PageT> {
// stream() inherited from SdkIterable
PageT firstPage();
SdkIterable<ItemT> allItems();
// Not sure if we need this? Guess it could be nice
void forEach(BiConsumer<PageT, ItemT> consumer);
} Example usage //The standard use-case
for(Reservation r : client.describeInstances().allItems()) {
System.out.println(r.reservationId());
}
//The standard use-case using streams
Optional<Reservation> someId = client.describeInstances().allItems().stream()
.filter(r -> r.reservationId() == "Some Id")
.findFirst();
//I can get the first page if that's the only thing I care about
DescribeInstancesResponse firstPage = client.describeInstances().firstPage();
//If I need to have response information then I can get that
client.describeInstances().
.filter(p -> p.reservations().size() > 20)
.collect(toList());
//If I want to iterate over the items but also have the response detail for a given item
client.describeInstances().forEach((response, item) -> System.out.println(response + ":" + item));
// Can use flatmap if you're into that kind of thing
client.describeInstances().
.flatmap(r -> r.reservations().stream())
.collect(toList());
// Or nested for if that floats your boat
for(DescribeInstancesResponse response : client.describeInstances()) {
for(Reservation r : response.reservations()) {
// Do something
}
} |
👍 Couple of small tweaks:
|
+1, made edits to the comment above. |
//I can get the first page if that's the only thing I care about
DescribeInstancesResponse firstPage = client.describeInstances().first(); s/first()/firstPage()/ |
I like it! Do we have an answer for APIs that start off non-paginated and have pagination added later on as an optional feature? |
The usage experience in shorea@ suggestion looks good. It satisfies both use cases (iterating top level responses and iterating items). And there is no need to use flat map too :) One problem with all these designs we discussed is potential backwards compatibility issue. If a service team changes an existing non-paginated operation into a paginated operation, this will change the API in the generated SDK and break customers. Example:
Service team need to send large lists in the FooResponse and decides to make this a paginated operation. This cases happen occasionally. If we don't detect these cases, then the API in generated SDK will look like:
Solutions
The main problem with this solution is inconsistency of naming across paginated operations. Most paginated operations will have names without "Paginated" keyword (like describeInstances). But few operations that turned from non-paginated to paginated will have "Paginated" keyword. This might cause customer confusion.
In this solution, we don't need to worry about backwards compatibility as the method declaration (foo or describeInstances) will always be the same. More optional parameters might be added in Request/Response shapes to enable pagination when the API transitions from non-paginated to paginated. |
I think we are closing down on the Sync discussion and converging towards a solution. Created #185 to continue the discussion for Async clients. Let us know what you think! |
The feature is released in "2.0.0-preview-9" version. Here is a blog post about this feature with code samples. Please try out the feature and provide us your feedback. Thank you. |
Customers are currently required to iterate over paginated result sets themselves in almost all locations by making extra service calls to retrieve more results if they're loading large sets of results.
We currently provide depaginators for S3 (
S3Objects
) and DynamoDBMapper results (PaginatedList
), but these patterns are applied inconsistently across the code base.Provide general-purpose
List
-view methods for traversing and accessing full result sets without having to make multiple service calls manually or loading the full result set in memory at one time.The text was updated successfully, but these errors were encountered: