-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE REQ] improve spring-data-cosmos paging for non-trivial datasets #24262
Comments
@stliu @kushagraThapar could you please take a look |
@Blackbaud-MikeLueders - let me discuss this issue within my team - Here is some general information that I found - https://docs.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-pagination @j82w - tagging you here, in case you have any ideas on how to implement pagination in our SDKs with cosmos queries in general. |
@Blackbaud-MikeLueders - We are still discussing this feature with our query team to make sure we cover every aspect of this. Also, another limitation on continuation tokens is that they don't work for DISTINCT / GROUP BY queries, as mentioned here - https://docs.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-pagination#continuation-tokens |
@kushagraThapar - No worries, thanks for the update. I think that's a reasonable limitation, especially since DISTINCT does work if you add an ORDER BY. Depending on the route we go, we're likely going to need to include documentation on how paging works with spring-data, so I think this will be fine as long as we clearly document the limitations and include workarounds for the DISTINCT/GROUP BY queries (e.g. using sliceQuery from CosmosTemplate). The documentation specifically says "you cannot use continuation tokens" - do you know specifically what that means? Will a continuation token not be returned as a reply header on the initial query? So is it the case that the library would need to detect use of DISTINCT (without ORDER BY) and GROUP BY and disallow those queries if in 'continuation token' mode? Otherwise, it would 'work' in the sense that the first page of results would be returned but it would appear as there were no more results. Or is there another way to detect this? |
@Blackbaud-MikeLueders Regarding the continuation token, it will be returned from the initial page for the GROUP BY / DISTINCT query, however, SDK won't resume from that continuation token returned in the first page. It throws the error - Line 32 in 3f31d68
|
@Blackbaud-MikeLueders what if the getTotalPages/getTotalElements/etc is made to be lazy while making the hasNext depend on the the continuation token? This way it avoids the count query without causing a breaking change so most users will get the benefit without needing to opt-in via some option. |
@j82w yeah, i think that's probably a better approach. We should still consider a configuration option for this - defaulting to lazy but configurable as eager. The reason being, some clients (such as my company) do not use the built-in autoscale mechanism of cosmos but instead rely on a proxy to the repository object which can be used to intercept method requests and scale up only after retry. We've found that the vast majority of throttle events do not require a scale up but just a single retry to succeed. If we were relying on the count, having the count operation part of the repository method call would be desirable since the retry/autoscale mechanism would kick in - executing the count lazily would require detecting the Page response and wrapping that in a separate proxy. That said, I'm not going to fight for the configuration option since I can't think of a scenario I would use it (since I don't use paging in the first place) but I wanted to present an actual usage scenario for consideration, since there might be clients who rely on an eager count operation. |
I agree with the configuration option. |
@kushagraThapar @j82w are we aligned on an approach? Should I begin working on a PR or are y'all still discussing? |
@Blackbaud-MikeLueders We are aligned, would be great to start the work and then we can continue the discussion on the PR. |
I've closed the associated PR as I no longer believe the overall approach makes sense. Rather than use |
Is your feature request related to a problem? Please describe.
spring-data paging requires row counts in order to function. Row counts in cosmos are generally expensive. For datasets of non-trivial size (say, >100k), a count query which includes an order clause is orders of magnitude more expensive than the same query without the count clause. In the spring-data-cosmos library, every 'paged' query also issues a count query in order to conform to the spring-data page model. For collections of non-trivial size, this makes paged queries prohibitively expensive. For small datasets, this is not an issue and things work as you would expect. For larger datasets, this makes paging unusable.
Describe the solution you'd like
One option is to do away with the facade of paging in spring-data and lean into the continuation token for all operations. This would mean certain methods of org.springframework.data.domain.Page would simply fail - getTotalPages/getTotalElements/previous/etc - while hasNext would rely on the continuation token instead of total element calculation. There are other options like not failing but returning incorrect data, though I think failing would be better from a user perspective so as to not give the false illusion of functioning. In any case, the important part is the elimination of the count query.
This brings up two obvious issues. First, it breaks traditional paging for clients that are using spring-data-cosmos and do not have large datasets. Second, it is not backwards compatible with the existing implementation. To sidestep these issues, a configuration option could be introduced which would allow the client to select which paging strategy to use. The current functionality would remain the default but the user could opt into the continuation token strategy which would not require count queries on paged requests.
Thoughts? Am I missing something wrt to the count queries being extremely inefficient? Are there other approaches which should be considered? I'd like to get consensus on an approach before I begin work on a PR.
Additional context
To be clear, I realize I could use CosmosTemplate:sliceQuery to execute paged queries since no count query is executed as part of the slice method. However, this requires manually constructing the query and using the CosmosTemplate directly, as opposed to using dynamic query derivation via repository methods which accept a Pageable input. Using the CosmosTemplate for paged queries and the repository for all other queries is far from ideal.
Information Checklist
Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report
The text was updated successfully, but these errors were encountered: