Issues with results when page count gets high #565

kgodey · 2020-07-24T21:58:51Z

Email from an API consumer:

I found that we get some error messages from the CC Search API where the page parameter is high, e.g. +170. For page=169 we still get the results, but for bigger numbers it returns an Internal system error 500 (or 502) even though the page_count is not exceeded.

The example I was going with had the following parameters:

text: "trees"
source: "wikimedia,thorvaldsensmuseum,thingiverse,svgsilh,statensmuseum,spacex,smithsonian,sketchfab,sciencemuseum,rijksmuseum,rawpixel,phylopic,nypl,nasa,museumsvictoria,met,mccordmuseum,iha,geographorguk,floraon,flickr,europeana,eol,digitaltmuseum,deviantart,clevelandmuseum,brooklynmuseum,bio_diversity,behance,animaldiversity,WoRMS,CAPL,500px"
page: 170

When page=169 I get the following metadata:

result_count: 10000
page_count: 250

Another example is:

text: "machine%20learning"
source: "wikimedia,thorvaldsensmuseum,thingiverse,svgsilh,statensmuseum,spacex,smithsonian,sketchfab,sciencemuseum,rijksmuseum,rawpixel,phylopic,nypl,nasa,museumsvictoria,met,mccordmuseum,iha,geographorguk,floraon,flickr,europeana,eol,digitaltmuseum,deviantart,clevelandmuseum,brooklynmuseum,bio_diversity,behance,animaldiversity,WoRMS,CAPL,500px"
page: 201

When page=200 I get the following metadata:

result_count: 10000
page_count: 250

The text was updated successfully, but these errors were encountered:

dhirensr · 2020-07-27T16:06:49Z

I tried to reproduce the same error for the first query = trees ,same sources , page 170 was failing sometimes with 502 bad gateway and sometime's it worked with Postman, and also sometimes it's failing with smaller page numbers like page =100 ,150 etc
I guess this error is being caused due to server getting overloaded and isn't a bug in the code.

aldenstpage · 2020-08-05T22:00:54Z

It looks like the cause of this is inefficient link rot validation. If you jump straight to page 170 before any other similar requests have been made and cached, the result is that the server walks through every image in the prior pages and sends a HEAD request. That means 3400 HEAD requests for q=trees&page=170! This times out and you get a gateway error. The good news is that most people start on page 0 and work their way up to it; if you take that approach, the page works since the cache is warmed up.

It is absolutely necessary to validate all of the prior images exist in order to prevent inconsistent result pagination.

I would classify this as a minor bug, as the overwhelming majority of users do not paginate deeply or make cold jumps to the end of the search results. To fix it would requires overhauling image validation to be more efficient, which is feasible but will take some time due to the complexity in this area.

Here are our options:

Preferred

Explicitly don't let users paginate so deeply. Nobody is going to care if we only expose the first 20 pages of results. Easy and cheap.
- BONUS: Reduce technical debt in the image validation area. I recommend scrapping this bitmask validation cache and instead making Elasticsearch aware of link rot at the last moment; that way we can leave the complexity of ensuring result consistency to Elasticsearch. This will be harder than you think it will be because it is a performance sensitive problem.

Less preferred

Buy bigger EC2 instances with more network bandwidth and cores for making concurrent validation requests. This is probably limited by number of maximum concurrent requests that each server allows. Easy, expensive, uncertain.
Use a cluster of dedicated servers for performing link validation. Hard, expensive.

kgodey · 2020-08-05T22:22:04Z

Nobody is going to care if we only expose the first 20 pages of results. Easy and cheap.

This was a bug that X5gon ran into, they are using our API to power their image search for OER. So someone cares. :)

aldenstpage · 2020-08-05T22:36:03Z

Yes, one exception! But we should have an alternative option for people who are looking to bulk scrape our whole catalog, such as a data dump or a bulk load endpoint. The search endpoint is optimized for finding the best results for your search query, not bulk downloads. I would also expect that the deeper you go, the worse the results are.

You'll find that other search products often limit the result set as well.

kgodey added bug labels Jul 24, 2020

kgodey added not ready for work and removed awaiting triage labels Jul 24, 2020

aldenstpage added help wanted Open to participation from the community and removed not ready for work labels Aug 5, 2020

kgodey self-assigned this Aug 24, 2020

kgodey added blocked and removed help wanted Open to participation from the community labels Aug 24, 2020

kgodey added 🚧 status: blocked Blocked & therefore, not ready for work 🛠 goal: fix Bug fix and removed blocked labels Sep 24, 2020

cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Dec 2, 2020

kgodey added 🙅 status: discontinued Not suitable for work as repo is in maintenance and removed 🚧 status: blocked Blocked & therefore, not ready for work labels Dec 16, 2020

kgodey closed this as completed Dec 16, 2020

obulat mentioned this issue Apr 21, 2021

Issues with results when page count gets high (original #565) WordPress/openverse-api#13

Closed

AetherUnbound mentioned this issue Aug 5, 2022

Only fill in pages if at least one link from first page is not dead WordPress/openverse-api#855

Closed

1 task

zackkrida mentioned this issue Aug 5, 2022

Limit page depth of searches WordPress/openverse-api#857

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with results when page count gets high #565

Issues with results when page count gets high #565

kgodey commented Jul 24, 2020 •

edited

Loading

dhirensr commented Jul 27, 2020 •

edited

Loading

aldenstpage commented Aug 5, 2020 •

edited

Loading

kgodey commented Aug 5, 2020

aldenstpage commented Aug 5, 2020 •

edited

Loading

Issues with results when page count gets high #565

Issues with results when page count gets high #565

Comments

kgodey commented Jul 24, 2020 • edited Loading

dhirensr commented Jul 27, 2020 • edited Loading

aldenstpage commented Aug 5, 2020 • edited Loading

kgodey commented Aug 5, 2020

aldenstpage commented Aug 5, 2020 • edited Loading

kgodey commented Jul 24, 2020 •

edited

Loading

dhirensr commented Jul 27, 2020 •

edited

Loading

aldenstpage commented Aug 5, 2020 •

edited

Loading

aldenstpage commented Aug 5, 2020 •

edited

Loading