Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pagination #962

Open
fcollonval opened this issue Sep 5, 2022 · 25 comments
Open

Support pagination #962

fcollonval opened this issue Sep 5, 2022 · 25 comments

Comments

@fcollonval
Copy link
Member

fcollonval commented Sep 5, 2022

Problem

It will be good to support optionally pagination for get requests (at least for the content, sessions and kernels are probably less relevant).

Proposed Solution

Implement pagination using GitHub logic (except the cursor-based pagination): https://docs.github.com/en/rest/overview/resources-in-the-rest-api#pagination

Below is a copy of the referenced documentation.

The idea is to implement this in a backward compatible manner so that ContentsManager that do not accept pagination will work and no Link header will be set to notify the frontend. And vice-versa, frontend not sending query argument per_page will receive all elements.

On the Python side, ContentsManager.get will have two new keywords: page and per_page.
The question is how to return the last page. I propose to add it to the model returned by get as an optional attribute.

The handler will take care of setting the Link header if pagination is supported.


Pagination

Requests that return multiple items will be paginated to 30 items by default. You can specify further pages with the page parameter. For some resources, you can also set a custom page size up to 100 with the per_page parameter. Note that for technical reasons not all endpoints respect the per_page parameter, see events for example.

$ curl 'https://api.github.com/user/repos?page=2&per_page=100'

Note that page numbering is 1-based and that omitting the page parameter will return the first page.

For more information on pagination, check out our guide on Traversing with Pagination.

Link header

Note: It's important to form calls with Link header values instead of constructing your own URLs.

The Link header includes pagination information. For example:

Link: <https://api.github.com/user/repos?page=3&per_page=100>; rel="next",
  <https://api.github.com/user/repos?page=50&per_page=100>; rel="last"

The example includes a line break for readability.

This Link response header contains one or more Hypermedia link relations, some of which may require expansion as URI templates.

The possible rel values are:

Name Description
next The link relation for the immediate next page of results.
last The link relation for the last page of results.
first The link relation for the first page of results.
prev The link relation for the immediate previous page of results.

Additional context

UI responsiveness is highly reduced when a folder contains lots of items. This will allow to improve that.

@davidbrochart
Copy link
Contributor

Looks like we could use https://uriyyo-fastapi-pagination.netlify.app for Jupyverse.

@krassowski
Copy link
Collaborator

Cross-referencing previous related discussion: #538 #538 jupyterlab/jupyterlab#8700

@kevin-bates
Copy link
Member

Thanks for opening this issue.

Do we know where the cost is occurring? (I apologize that I haven't caught up on the related discussions in Lab.) If I touch 10,000 files, the GET request is around 1.7 secs (which isn't cheap, although direct os.listdir calls appear to be 7-8 ms, so this cost is probably in building the response model). Does the presentation add to that and by what amount?

Is the idea that we'd capture the result-set and hold on to it across (stateful) requests? Since os.listdir (and friends) are not idempotent relative to their results, it seems we'd need a way to persist/cache the results. We also need to sort the results prior to the initial response - which implies they all need to be present.

If we were to forgo state management, we could recalculate on each access but could encounter "missing" or "duplicate" items if file-system changes affected a previously traversed page. This approach doesn't seem very user-friendly.

I also see Lab repeating the GET request about every 10 seconds or so, presumably in an attempt to keep its presentation consistent with the actual "file system". I'm assuming this would constitute another series of traversals (i.e., a fresh result-set) and essentially enable the ability to "see changes" that occurred during the previous traversal (as is the case today). As a result, we'd need to be able to distinguish the difference between a fresh GET request vs. a traversal GET request that happens to use the same parameters (page=1) as a fresh request.

The question is how to return the last page.

I'm not following the concern here. Isn't the last page specified in the link header?

@davidbrochart
Copy link
Contributor

Is the idea that we'd capture the result-set and hold on to it across (stateful) requests?

I don't think so, we want to keep the server state-less. I think that if a client wants a paginated result, it takes a risk of getting inconsistent results.

@kevin-bates
Copy link
Member

I don't think so, we want to keep the server state-less.

I agree with this in principle, although the server is already stateful when it comes to kernel management (unless you proxy to a Gateway server) and Sessions for that matter. Introducing state is definitely a commitment and something we should be in agreement with - which is why I asked if we know where the cost really is. If the cost is building the response then perhaps we can introduce an option that responds with a "raw payload". But if the true cost is in sorting and rendering the results, it seems like this should be addressed in the client (which appears to be what much of the discussion is about in the lab issues).

If we retained a stateless approach but implemented paging, we're looking at the 7-8 ms (for large directories) plus sorting of the entire results per page requested, along with the cost to build the response for that page. This would make things worse than they are now since we'd be multiplying the overall cost by a factor of N (pages) that would otherwise be amortized in a stateful approach. This seems like an expensive proposition only to reduce the payload to a page of results.

@davidbrochart
Copy link
Contributor

There is no sorting needed on the server side, right?
For better performances, the server can use scandir.

@kevin-bates
Copy link
Member

If the results of the os.xxx calls are not idempotent (scandir also states The entries are yielded in arbitrary order), then we need to sort so that the pages are laid out in an ordered fashion. Otherwise, you could find the same files on each page of results.

@davidbrochart
Copy link
Contributor

davidbrochart commented Sep 7, 2022

But I don't know if "arbitrary order" means "not reproducible order"? If there is no particular order but two consecutive calls on the same file system yield the same result, that's fine.

@davidbrochart
Copy link
Contributor

And from what I can see, it is reproducible.

@kevin-bates
Copy link
Member

kevin-bates commented Sep 7, 2022

True, (and great!) although I think the end-user experience will be odd since you'd see results from each page interleaved amongst the existing pages because that display is sorted. As a result, the user wouldn't see the correct set of entries until all entries have been fetched.

@davidbrochart
Copy link
Contributor

I'd say the end-user experience if the responsibility of the client. It could display a spinner showing that the directory content is being rendered, for instance.

@ellisonbg
Copy link
Contributor

ellisonbg commented Sep 7, 2022 via email

@kevin-bates
Copy link
Member

I am not following the discussion here closely, but wanted to voice
agreement that the REST APIs should remain stateless

Yes, I believe we're all in agreement here. It's really a matter of protecting the server from making things worse and paging w/o state, IMHO, makes things worse - for both the server and end-user.

@fcollonval @krassowski - what would Lab do with paging? Would it retain a partial result set based on where the user's cursor is within the results - discarding other results that are (sufficiently) "out of view"? Would it assume the results are ordered relative to the complete result set? Or would it essentially build a complete result set, retaining all previous sets (which would imply the crux of the issue is simply the response containing the complete results)?

@fcollonval
Copy link
Member Author

Thanks for the discussion.

Preliminary comment: to clarify the proposal, the new query arguments will be transmitted to the content manager. So virtual file system backed by database can optimize requests.

Do we know where the cost is occurring? (I apologize that I haven't caught up on the related discussions in Lab.) If I touch 10,000 files, the GET request is around 1.7 secs (which isn't cheap, although direct os.listdir calls appear to be 7-8 ms, so this cost is probably in building the response model). Does the presentation add to that and by what amount?

The first bottleneck is usually the JupyterLab frontend (too much DOM nodes to be handled). But in a previous discussion, developers at Siemens saw that the cost of calling os.lstat can becoming non-negligible. So they implemented pagination.

Is the idea that we'd capture the result-set and hold on to it across (stateful) requests? Since os.listdir (and friends) are not idempotent relative to their results, it seems we'd need a way to persist/cache the results. We also need to sort the results prior to the initial response - which implies they all need to be present.

If we were to forgo state management, we could recalculate on each access but could encounter "missing" or "duplicate" items if file-system changes affected a previously traversed page. This approach doesn't seem very user-friendly.

I would go for a stateless request too. I tested in Gmail; there, pagination is stateless because if you receive emails when navigating, the page N will not match the previous page N as elements will be shifted by number of new elements received. I think this is fine. But it will forbid endless scrolling in frontend.

Regarding the sorting of items, this is actually a pretty good point. It seems that we should in addition to pagination add query argument for ordering. But that raises the question of which value would be accepted for ordering; it seems that in Notebook and JupyterLab only the name and the last modification time are available. But the name ordering is not guaranteed by definition of os.listdir (and as mentioned earlier) and getting the modification time for sorting will not improve efficiency for the standard code as os.lstat is the one taking non-negligible execution time. So this is definitely an opened question.

I also see Lab repeating the GET request about every 10 seconds or so, presumably in an attempt to keep its presentation consistent with the actual "file system". I'm assuming this would constitute another series of traversals (i.e., a fresh result-set) and essentially enable the ability to "see changes" that occurred during the previous traversal (as is the case today). As a result, we'd need to be able to distinguish the difference between a fresh GET request vs. a traversal GET request that happens to use the same parameters (page=1) as a fresh request.

Indeed this is to keep the displayed list consistent with the backend content.

The question is how to return the last page.

I'm not following the concern here. Isn't the last page specified in the link header?

The question is how to return the last page from the ContentsManager to produce the link header. Should it be part of the directory model that seem awkward as it depends on the per_page argument? Should it be returned as a second tuple argument (model, last_page) - to make it backward compatible that makes thing more complex and it feels weird if the path is a file? We could maybe expand the directory model with number of items.

what would Lab do with paging? Would it retain a partial result set based on where the user's cursor is within the results - discarding other results that are (sufficiently) "out of view"? Would it assume the results are ordered relative to the complete result set? Or would it essentially build a complete result set, retaining all previous sets (which would imply the crux of the issue is simply the response containing the complete results)?

From all previous elements, the best approach will be a panel displaying only one page at a time (that gets refresh periodically or when we file system actions is triggered from the interface).

Lately to add one more element in the discussion, in JupyterLab the user can filter the list of folder items. To be consistent with pagination, this sounds like yet another query arguments to pass to the backend...


Another approach

A simpler API could probably be to only support pagination to provide directory listing by chunks for the frontend. So the UI can quickly show something but the final list will only be available when all chunks have been obtained. Something similar to endless scrolling list - but with side effect as sorting and filter may change the order of the initial list.

@davidbrochart
Copy link
Contributor

Maybe we should wait and take advantage of the event system to implement that feature. I think it would be a good fit because in the case of huge directories, polling the REST API at regular intervals is extremely inefficient anyway, as the server basically has to send almost the same information repeatedly. With the event system, we can send only the file system changes, which should be very small even for big directories. Initially sending the whole directory content could be done when a client connects, and in chunks.

@kevin-bates
Copy link
Member

I've gone ahead and added this to the agenda for tomorrow's Jupyter Server meeting for those interested in this topic. I hope the interested parties can join.

@fcollonval
Copy link
Member Author

fcollonval commented Sep 8, 2022

take advantage of the event system to implement that feature

I would rather not mix API to achieve this by virtue of the separation of concerns.

@davidbrochart
Copy link
Contributor

I didn't mean to use both APIs, but one or the other.
It's already the case with the contents API in collaborative mode: JupyterLab doesn't use the REST API anymore but the YDocWebSocket.

@fcollonval
Copy link
Member Author

I didn't mean to use both APIs, but one or the other.

The event API will not provide you the initial directory listing, is it? So how will you get it except by using the contents API?

It's already the case with the contents API in collaborative mode: JupyterLab doesn't use the REST API anymore but the YDocWebSocket.

This is partially true as the contents API is still used for anything other than file edition supporting collaboration.

@davidbrochart
Copy link
Contributor

The event API will not provide you the initial directory listing, is it?

Yes, we can decide that it provides the initial directory listing on client connection.

@kevin-bates
Copy link
Member

The event API will not provide you the initial directory listing, is it? So how will you get it except by using the contents API?

Yes, we can decide that it provides the initial directory listing on client connection.

IIUC, lab only issues content requests for the currently viewed directory. So if the user navigates to another directory (even a sub-directory of the former), it must use the Contents service to acquire the initial set of files. It seems that an event-based approach would only satisfy changes to the existing view (directory). Is that correct?

@davidbrochart
Copy link
Contributor

The event system could send initial content and changes of the whole directory tree (including sub-directories).

@vidartf
Copy link
Member

vidartf commented Sep 8, 2022

Some discussion points from today's meeting that might be simpler than introducing pagination:

  • Enable compression for your server. E.g. How can we gzip the assets (specifically the main.bundle) which is loaded by JLab on browser jupyterlab/jupyterlab#4634 Maybe this should be enabled by default?
  • Don't use the default no-store cache busting in JupyterLab for contents requests? If we are able to rely on the etag for the contents API (which we should ensure that jupyter server allows), then both dir listings and file content can be cached by the browser. Is there some history in lab where the caching against the content API did not work as expected? Or is this just an overly cautious approach?
  • Allow clients to add a request for an additional Content-Type in addition to the current JSON, which has a better random-access mechanics. E.g. https://jsonlines.org/ for something that is still similar. That way clients can do local pagination where they only parse the JSON for the page of interest. The client would then have to opt into this and process it correctly (would e.g. require some additions in jupyterlab/services).

@vidartf
Copy link
Member

vidartf commented Sep 8, 2022

Advantages of the above: Relies on standard HTTP headers to be do all the needed changes, so that clients can negotiate with the server without any breakages on either side. E.g.

  • if the server does not support etags, then it should never send 304 (Not Modified) responses
  • most clients support compression, and servers will only compress the response if the client adds gzip etc. to the Accept-Encoding header
  • the server will still reply with JSON as now if the client does not indicate with Content-Type that it would prefer another format, and servers who do not support the newer content-type will send the old one, and clients should then be able to figure out which one they got by inspecting the response.

@vidartf
Copy link
Member

vidartf commented Sep 16, 2022

Note that there is also improvements to be made on the server side by using https://docs.python.org/3/library/os.html#os.scandir instead of os.listdir (relevant for #964 that was closed in favour of this issue).

Using scandir() instead of listdir() can significantly increase the performance of code that also needs file type or file attribute information, because os.DirEntry objects expose this information if the operating system provides it when scanning a directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants