Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Datastore QueryResult#next? to use NOT_FINISHED #803

Merged
merged 1 commit into from
Aug 1, 2016

Conversation

blowmage
Copy link
Contributor

Change what check is made to paginate. Requested by the Datastore team.

[closes #793]

Change what check is made to paginate. Requested by the Datastore team.

[closes googleapis#793]
@blowmage blowmage added the api: datastore Issues related to the Datastore API. label Jul 25, 2016
@blowmage blowmage added this to the v0.12 milestone Jul 25, 2016
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Jul 25, 2016
@blowmage
Copy link
Contributor Author

@pcostell Can you verify this is correct? We'd like to get this resolved and released ASAP. Thanks!

@blowmage blowmage changed the title Change Datastore#next? to use NOT_FINISHED Change Datastore QueryResult#next? to use NOT_FINISHED Jul 25, 2016
@@ -133,7 +133,7 @@ def initialize arr = []
# end
#
def next?
!no_more?
not_finished?

This comment was marked as spam.

This comment was marked as spam.

This comment was marked as spam.

@blowmage
Copy link
Contributor Author

This PR is an attempt to un-stick users of the emulator and to be more in-line with how the Datastore team wants the client to be used. There is a larger issue of not exposing NOT_FINISHED and always streaming results, but that is outside of this PR and will require a larger set of changes.

@pcostell
Copy link
Contributor

I guess I'm confused about how next? is used. It seems like maybe it is used for both paging and batching. Right now it is implemented like it should be for paging, but that breaks batching. But the PR changes it to be implemented like it should for batching, but that will break paging.

@blowmage
Copy link
Contributor Author

The other cloud services don't differentiate between paging and batching. We want consistency with the other services, which can use #next?, #next, and all. But perhaps we should discuss the larger question of how much of the Datastore API needs to be exposed. I'm surprised that NOT_FINISHED should not be exposed to users, for example.

@pcostell
Copy link
Contributor

Consider the case where a user specifies limit(100). If we were implementing a streaming API, all the multi-RPC handling would happen inside of gRPC. The end result would be the user getting a stream of exactly 100 results (only less if there aren't that many results). In this world, there is no such thing as NOT_FINISHED; if the stream wasn't finished, it would have continued to stream data. When the stream completes more_results will only have 3 possible results: MORE_RESULTS_AFTER_LIMIT, MORE_RESULTS_AFTER_END_CURSOR, or NO_MORE_RESULTS.

In a streaming API world, the API doesn't need any batching primitives (like more_results = NOT_FINISHED). However, it would still expose paging primitives which are user facing and must work across streams (because they'll generally be associated with exposing something to the end user, like a "Next page" button, or fancy ajax infinite scrolling).

Since the desired state is a streaming API, we should hide all batching details inside the client library, since it's really just an implementation detail until streaming is supported.

Here's the same discussion on gcloud-node.

@blowmage
Copy link
Contributor Author

Thanks for the heads up. We will absorb all this and come up with a new recommendation.

@blowmage
Copy link
Contributor Author

After reading the discussion on gcloud-node I think I have a better appreciation for the terminology being used. To date gcloud-ruby has not differentiated between "batching" and "paging", but what we are intending to offer in Datastore is batching. Unfortunately we have been referring to this as "pagination" because that is what we are calling it in non-Datastore services.

We have no plans currently to provide pagination by incrementing offset. We use the cursor of the previously last returned result to batch fill the results. There may be instances where this approach does not play nicely with limit or offset.

Our original approach was to model the batched list of results in an object, but we have recently implemented Enumerators on the result list's #all method. With more services moving to grpc and streaming, it may make sense to drop the batched list object completely and only expose Enumerators.

# Currently
datastore.run query #=> Datastore::QueryResult

# Future
datastore.run query #=> Enumerator

This would allow us to stream results as they are needed. And users could use Enumerable to collect the results as they want.

# get the first 100 results, regardless of how many batch requests are made
datastore.run(query).take(100)

# use a lazy enumerator to get the first 25 titles
datastore.run(query).lazy.map { |entry| entry["title"] }.take(25)

@bmclean
Copy link
Contributor

bmclean commented Jul 27, 2016

@blowmage If changed to return an Enumerator instead of Datastore::QueryResult we would still receive the query’s cursor and more_results value though, right?

@quartzmo
Copy link
Member

Based on @pcostell's comments in the gcloud-node thread on this issue, I think any solution must continue to expose cursor and more_results:

This means we'll need to expose the end_cursor and more_results so that the user can...

@blowmage
Copy link
Contributor Author

Also, starting in v1beta3 there is a per-result cursor that we will also need to continue to support.

@quartzmo
Copy link
Member

We currently have a looping method, all_with_cursor, that exposes the cursor for each result:

query = datastore.query "Task"
tasks = datastore.run query
tasks.all_with_cursor do |task, cursor|
  puts "Task #{task.key.id} (#cursor)"
end

Can we add more_results to the block? (It's not pretty to go beyond two block parameters, but...)

query = datastore.query "Task"
tasks = datastore.run query
tasks.all_with_cursor do |task, cursor, more_results|
  puts "Task #{task.key.id} (#cursor)"
  if more_results == "NO_MORE_RESULTS"
    # disable some UI element...
  end
end

@blowmage
Copy link
Contributor Author

blowmage commented Jul 27, 2016

The #all_with_cursor Enumerator batches requests as needed, so it probably doesn't make much sense to pass more_results as an argument to the yielded block, since the value will change as more and more batches are made. I don't see the benefit. The intention is to be analogous to Enumerator#each_with_index.

One option would be to place the value on the Enumerator object.

require "gcloud"

gcloud = Gcloud.new
datastore = gcloud.datastore

query = datastore.query("Task")
results = datastore.run query

results #=> #<Enumerator: [...]:run more_results="NOT_FINISHED">
results.take(100).each do |entry|
  puts entry["name"]
end
results #=> #<Enumerator: [...]:run more_results="NO_MORE_RESULTS">

@quartzmo
Copy link
Member

@blowmage Can you tweak your example to show both cursor and more_results being accessible? (I agree that more_results does not need to be available per-result; its value should be NOT_FINISHED until iteration completes.)

@blowmage
Copy link
Contributor Author

I can't say for sure how it would work, I'm just throwing out some ideas. It is possible that the Enumerator resets each time it is used, and it isn't possible for the more_results value shown below to be mutable on the Enumerator object. But if possible, we could squash the QueryResults behavior onto the Enumerator object:

require "gcloud"

gcloud = Gcloud.new
datastore = gcloud.datastore

query = datastore.query("Task")
results = datastore.run query

results #=> #<Enumerator: [...]:run more_results="NOT_FINISHED">

# 10 results
results.take(10).each do |entry|
  puts entry["name"]
end

results.more_results #=> "NOT_FINISHED"
results.not_finished? #=> true

# take 100 results with cursor, with two batch API requests, but there are only 70 results available...
results.take(100).count #=> 70
results.take(100).each_with_cursor do |entry, cursor|
  puts "#{entry["name"]} - #{cursor}"
end

results.more_results #=> "NO_MORE_RESULTS"
results.no_more? #=> true
results #=> #<Enumerator: [...]:run more_results="NO_MORE_RESULTS">

It is worth pointing out that the 10 results pulled will also be pulled in the take(100) call.

@quartzmo
Copy link
Member

@blowmage Thanks!
@pcostell How does the example above look to you?

@pcostell
Copy link
Contributor

The example seems reasonable, but I'm a little concerned with how take and limit interact. If take is the ruby-esque / gcloud-ruby way of doing things, maybe we can drop setting a limit on query and pass through take to the query before actually running it.

@quartzmo
Copy link
Member

quartzmo commented Jul 28, 2016

@pcostell My understanding is that gcloud-ruby's limit works with Datastore's limit/offset-based pagination, but that Ruby's Enumerable#take, which can be called on the return value from gcloud-ruby's all, works with Datastore's cursor-based batching. To the degree to which these mechanisms are intended to be used separately in Datastore, they should be used separately in gcloud-ruby. (Is there any practical reason to use limit with cursors? Edit: Yes, see below.)

If you dig into the implementation of QueryResults#all, you will see that it will make repeated API requests using the last cursor, until there are NO_MORE_RESULTS. Using take will transform this relatively open-ended iteration into a concrete array of a specific length.

Use case: Give me the top 5 records of 100,000 records.
Solution: Use limit, do not use all or take.

Use case: Give me an in-memory array holding the top 1,000 records of 100,000 records.
Solution: Use all and take, do not use limit.

@bmclean
Copy link
Contributor

bmclean commented Jul 28, 2016

@quartzmo What we currently have implemented is the following:
For a particular entity kind give me the top 25 records of 100,000 records from datastore.
The user scrolls down the page, and we use the provided cursor to query for the next 25 records. And the next. And so on.

@quartzmo
Copy link
Member

@bmclean So for the above, you use limit and cursor together? So then it is offset and cursor that are nonsensical to use together, actually? Thanks for the example.

But I'm pretty sure you would never use all (and take) for this use case, you use the cursor yourself directly, correct?

@bmclean
Copy link
Contributor

bmclean commented Jul 28, 2016

@quartzmo Correct, we use limit and cursor together for most of our user facing queries. Yes, we use the cursor directly. Not sure about offset, I don't think we have ever used it.

I love the idea of using all and take but bringing back more entities from datastore than we need to display at any given time will increase our page load.

@quartzmo
Copy link
Member

@bmclean Now that you mention it, this use case seems totally obvious (and probably super common). Appreciate your input!

@pcostell
Copy link
Contributor

cursor + limit is very common for Cloud Datastore (eg show a page of results on your site). @bmclean's use case is also what the more_results field ( and the paging feature in general) are used for -- knowing when to stop trying to get more results in the UI.

I believe cursor + offset is less useful, but still totally possible (start @cursor but then skip 100 results).

You might use it if you have a paging UI (eg Google web search):

Prev Page 1 2 3 4 5 6 7 Next Page

Prev Page and Next Page would just use cursors to continue, but if you wanted to skip here from page 4 to page 6 you would use the page 4 end cursor plus an offset of page-size to skip over page 5.

@quartzmo
Copy link
Member

Currently, this PR simply corrects the logic in QueryResult#next? to use NOT_FINISHED. Am I correct to think that this is adequate, and that the PR can be accepted?

Each new QueryResults instance returned by #next already exposes more_results and cursor. Based on the use cases above, I do not see a need for the Enumerator returned by #all to expose these values as well, but does anyone else?

@timanovsky
Copy link

Guys, what is the resolution here? Would be great to get 0.12 release. Meanwhile I have to do a monkey patching to bring this fix in.

@quartzmo
Copy link
Member

quartzmo commented Aug 1, 2016

I'm going to accept this PR presently for release today unless anyone objects.

@quartzmo quartzmo merged commit 0f70805 into googleapis:master Aug 1, 2016
@timanovsky
Copy link

@quartzmo It looks like 0.12 has already been released. Will there be separate release containing this PR?

@quartzmo
Copy link
Member

quartzmo commented Aug 1, 2016

@timanovsky Yes. I am working now on release 0.12.1.

@quartzmo
Copy link
Member

quartzmo commented Aug 1, 2016

@timanovsky Please try 0.12.1 and confirm the fix. Thanks again.

@timanovsky
Copy link

@quartzmo confirming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the Datastore API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

QueryResults.next? keeps returning true forever under emulator
6 participants