Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

Issues with results when page count gets high #565

Closed
kgodey opened this issue Jul 24, 2020 · 4 comments
Closed

Issues with results when page count gets high #565

kgodey opened this issue Jul 24, 2020 · 4 comments
Assignees
Labels
🛠 goal: fix Bug fix 🙅 status: discontinued Not suitable for work as repo is in maintenance 🏷 status: label work required Needs proper labelling before it can be worked on

Comments

@kgodey
Copy link
Contributor

kgodey commented Jul 24, 2020

Email from an API consumer:

I found that we get some error messages from the CC Search API where the page parameter is high, e.g. +170. For page=169 we still get the results, but for bigger numbers it returns an Internal system error 500 (or 502) even though the page_count is not exceeded.

The example I was going with had the following parameters:

  • text: "trees"
  • source: "wikimedia,thorvaldsensmuseum,thingiverse,svgsilh,statensmuseum,spacex,smithsonian,sketchfab,sciencemuseum,rijksmuseum,rawpixel,phylopic,nypl,nasa,museumsvictoria,met,mccordmuseum,iha,geographorguk,floraon,flickr,europeana,eol,digitaltmuseum,deviantart,clevelandmuseum,brooklynmuseum,bio_diversity,behance,animaldiversity,WoRMS,CAPL,500px"
  • page: 170

When page=169 I get the following metadata:

result_count: 10000
page_count: 250

Another example is:

  • text: "machine%20learning"
  • source: "wikimedia,thorvaldsensmuseum,thingiverse,svgsilh,statensmuseum,spacex,smithsonian,sketchfab,sciencemuseum,rijksmuseum,rawpixel,phylopic,nypl,nasa,museumsvictoria,met,mccordmuseum,iha,geographorguk,floraon,flickr,europeana,eol,digitaltmuseum,deviantart,clevelandmuseum,brooklynmuseum,bio_diversity,behance,animaldiversity,WoRMS,CAPL,500px"
  • page: 201

When page=200 I get the following metadata:

result_count: 10000
page_count: 250
@dhirensr
Copy link
Contributor

dhirensr commented Jul 27, 2020

I tried to reproduce the same error for the first query = trees ,same sources , page 170 was failing sometimes with 502 bad gateway and sometime's it worked with Postman, and also sometimes it's failing with smaller page numbers like page =100 ,150 etc
I guess this error is being caused due to server getting overloaded and isn't a bug in the code.

@aldenstpage aldenstpage added help wanted Open to participation from the community and removed not ready for work labels Aug 5, 2020
@aldenstpage
Copy link
Contributor

aldenstpage commented Aug 5, 2020

It looks like the cause of this is inefficient link rot validation. If you jump straight to page 170 before any other similar requests have been made and cached, the result is that the server walks through every image in the prior pages and sends a HEAD request. That means 3400 HEAD requests for q=trees&page=170! This times out and you get a gateway error. The good news is that most people start on page 0 and work their way up to it; if you take that approach, the page works since the cache is warmed up.

It is absolutely necessary to validate all of the prior images exist in order to prevent inconsistent result pagination.

I would classify this as a minor bug, as the overwhelming majority of users do not paginate deeply or make cold jumps to the end of the search results. To fix it would requires overhauling image validation to be more efficient, which is feasible but will take some time due to the complexity in this area.

Here are our options:

Preferred

  • Explicitly don't let users paginate so deeply. Nobody is going to care if we only expose the first 20 pages of results. Easy and cheap.
    • BONUS: Reduce technical debt in the image validation area. I recommend scrapping this bitmask validation cache and instead making Elasticsearch aware of link rot at the last moment; that way we can leave the complexity of ensuring result consistency to Elasticsearch. This will be harder than you think it will be because it is a performance sensitive problem.

Less preferred

  • Buy bigger EC2 instances with more network bandwidth and cores for making concurrent validation requests. This is probably limited by number of maximum concurrent requests that each server allows. Easy, expensive, uncertain.
  • Use a cluster of dedicated servers for performing link validation. Hard, expensive.

@kgodey
Copy link
Contributor Author

kgodey commented Aug 5, 2020

Nobody is going to care if we only expose the first 20 pages of results. Easy and cheap.

This was a bug that X5gon ran into, they are using our API to power their image search for OER. So someone cares. :)

@aldenstpage
Copy link
Contributor

aldenstpage commented Aug 5, 2020

Yes, one exception! But we should have an alternative option for people who are looking to bulk scrape our whole catalog, such as a data dump or a bulk load endpoint. The search endpoint is optimized for finding the best results for your search query, not bulk downloads. I would also expect that the deeper you go, the worse the results are.

You'll find that other search products often limit the result set as well.

@kgodey kgodey self-assigned this Aug 24, 2020
@kgodey kgodey added blocked and removed help wanted Open to participation from the community labels Aug 24, 2020
@kgodey kgodey added 🚧 status: blocked Blocked & therefore, not ready for work 🛠 goal: fix Bug fix and removed blocked labels Sep 24, 2020
@cc-open-source-bot cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Dec 2, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🛠 goal: fix Bug fix 🙅 status: discontinued Not suitable for work as repo is in maintenance 🏷 status: label work required Needs proper labelling before it can be worked on
Development

No branches or pull requests

4 participants