Skip to content
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.

Issues with results when page count gets high (original #565) #13

Closed
obulat opened this issue Apr 21, 2021 · 1 comment
Closed

Issues with results when page count gets high (original #565) #13

obulat opened this issue Apr 21, 2021 · 1 comment

Comments

@obulat
Copy link
Contributor

obulat commented Apr 21, 2021

This issue has been migrated from the CC Search Frontend repository

Author: kgodey
Date: Sat Jul 25 2020
Labels: 🏷 status: label work required,🙅 status: discontinued,🛠 goal: fix

Email from an API consumer:

I found that we get some error messages from the CC Search API where the page parameter is high, e.g. +170. For page=169 we still get the results, but for bigger numbers it returns an Internal system error 500 (or 502) even though the page_count is not exceeded.

The example I was going with had the following parameters:

  • text: "trees"
  • source: "wikimedia,thorvaldsensmuseum,thingiverse,svgsilh,statensmuseum,spacex,smithsonian,sketchfab,sciencemuseum,rijksmuseum,rawpixel,phylopic,nypl,nasa,museumsvictoria,met,mccordmuseum,iha,geographorguk,floraon,flickr,europeana,eol,digitaltmuseum,deviantart,clevelandmuseum,brooklynmuseum,bio_diversity,behance,animaldiversity,WoRMS,CAPL,500px"
  • page: 170

When page=169 I get the following metadata:

result_count: 10000
page_count: 250

Another example is:

  • text: "machine%20learning"
  • source: "wikimedia,thorvaldsensmuseum,thingiverse,svgsilh,statensmuseum,spacex,smithsonian,sketchfab,sciencemuseum,rijksmuseum,rawpixel,phylopic,nypl,nasa,museumsvictoria,met,mccordmuseum,iha,geographorguk,floraon,flickr,europeana,eol,digitaltmuseum,deviantart,clevelandmuseum,brooklynmuseum,bio_diversity,behance,animaldiversity,WoRMS,CAPL,500px"
  • page: 201

When page=200 I get the following metadata:

result_count: 10000
page_count: 250

Original Comments:

Issue author dhirensr commented on Mon Jul 27 2020:

I tried to reproduce the same error for the first query = trees ,same sources , page 170 was failing sometimes with 502 bad gateway and sometime's it worked with Postman, and also sometimes it's failing with smaller page numbers like page =100 ,150 etc
I guess this error is being caused due to server getting overloaded and isn't a bug in the code.
source

aldenstpage commented on Thu Aug 06 2020:

It looks like the cause of this is inefficient link rot validation. If you jump straight to page 170 before any other similar requests have been made and cached, the result is that the server walks through every image in the prior pages and sends a HEAD request. That means 3400 HEAD requests for q=trees&page=170! This times out and you get a gateway error. The good news is that most people start on page 0 and work their way up to it; if you take that approach, the page works since the cache is warmed up.

It is absolutely necessary to validate all of the prior images exist in order to prevent inconsistent result pagination.

I would classify this as a minor bug, as the overwhelming majority of users do not paginate deeply or make cold jumps to the end of the search results. To fix it would requires overhauling image validation to be more efficient, which is feasible but will take some time due to the complexity in this area.

Here are our options:

Preferred

  • Explicitly don't let users paginate so deeply. Nobody is going to care if we only expose the first 20 pages of results. Easy and cheap.
    • BONUS: Reduce technical debt in the image validation area. I recommend scrapping this bitmask validation cache and instead making Elasticsearch aware of link rot at the last moment; that way we can leave the complexity of ensuring result consistency to Elasticsearch. This will be harder than you think it will be because it is a performance sensitive problem.

Less preferred

  • Buy bigger EC2 instances with more network bandwidth and cores for making concurrent validation requests. This is probably limited by number of maximum concurrent requests that each server allows. Easy, expensive, uncertain.
  • Use a cluster of dedicated servers for performing link validation. Hard, expensive.
    source

kgodey commented on Thu Aug 06 2020:

Nobody is going to care if we only expose the first 20 pages of results. Easy and cheap.

This was a bug that X5gon ran into, they are using our API to power their image search for OER. So someone cares. :)
source

aldenstpage commented on Thu Aug 06 2020:

Yes, one exception! But we should have an alternative option for people who are looking to bulk scrape our whole catalog, such as a data dump or a bulk load endpoint. The search endpoint is optimized for finding the best results for your search query, not bulk downloads. I would also expect that the deeper you go, the worse the results are.

You'll find that other search products often limit the result set as well.
source

@sarayourfriend
Copy link
Contributor

This issue was fixed (as described) in #859. Further work will happen to improve dead link validation but at the moment this issue as described is no longer a problem.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants