You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Feb 22, 2023. It is now read-only.
Author: kgodey
Date: Sat Jul 25 2020
Labels: 🏷 status: label work required,🙅 status: discontinued,🛠 goal: fix
Email from an API consumer:
I found that we get some error messages from the CC Search API where the page parameter is high, e.g. +170. For page=169 we still get the results, but for bigger numbers it returns an Internal system error 500 (or 502) even though the page_count is not exceeded.
The example I was going with had the following parameters:
Issue author dhirensr commented on Mon Jul 27 2020:
I tried to reproduce the same error for the first query = trees ,same sources , page 170 was failing sometimes with 502 bad gateway and sometime's it worked with Postman, and also sometimes it's failing with smaller page numbers like page =100 ,150 etc
I guess this error is being caused due to server getting overloaded and isn't a bug in the code. source
aldenstpage commented on Thu Aug 06 2020:
It looks like the cause of this is inefficient link rot validation. If you jump straight to page 170 before any other similar requests have been made and cached, the result is that the server walks through every image in the prior pages and sends a HEAD request. That means 3400 HEAD requests for q=trees&page=170! This times out and you get a gateway error. The good news is that most people start on page 0 and work their way up to it; if you take that approach, the page works since the cache is warmed up.
It is absolutely necessary to validate all of the prior images exist in order to prevent inconsistent result pagination.
I would classify this as a minor bug, as the overwhelming majority of users do not paginate deeply or make cold jumps to the end of the search results. To fix it would requires overhauling image validation to be more efficient, which is feasible but will take some time due to the complexity in this area.
Here are our options:
Preferred
Explicitly don't let users paginate so deeply. Nobody is going to care if we only expose the first 20 pages of results. Easy and cheap.
BONUS: Reduce technical debt in the image validation area. I recommend scrapping this bitmask validation cache and instead making Elasticsearch aware of link rot at the last moment; that way we can leave the complexity of ensuring result consistency to Elasticsearch. This will be harder than you think it will be because it is a performance sensitive problem.
Less preferred
Buy bigger EC2 instances with more network bandwidth and cores for making concurrent validation requests. This is probably limited by number of maximum concurrent requests that each server allows. Easy, expensive, uncertain.
Use a cluster of dedicated servers for performing link validation. Hard, expensive. source
kgodey commented on Thu Aug 06 2020:
Nobody is going to care if we only expose the first 20 pages of results. Easy and cheap.
This was a bug that X5gon ran into, they are using our API to power their image search for OER. So someone cares. :) source
aldenstpage commented on Thu Aug 06 2020:
Yes, one exception! But we should have an alternative option for people who are looking to bulk scrape our whole catalog, such as a data dump or a bulk load endpoint. The search endpoint is optimized for finding the best results for your search query, not bulk downloads. I would also expect that the deeper you go, the worse the results are.
You'll find that other search products often limit the result set as well. source
The text was updated successfully, but these errors were encountered:
This issue was fixed (as described) in #859. Further work will happen to improve dead link validation but at the moment this issue as described is no longer a problem.
This issue has been migrated from the CC Search Frontend repository
Email from an API consumer:
I found that we get some error messages from the CC Search API where the page parameter is high, e.g. +170. For page=169 we still get the results, but for bigger numbers it returns an Internal system error 500 (or 502) even though the page_count is not exceeded.
The example I was going with had the following parameters:
When page=169 I get the following metadata:
Another example is:
When page=200 I get the following metadata:
Original Comments:
Issue author dhirensr commented on Mon Jul 27 2020:
aldenstpage commented on Thu Aug 06 2020:
It is absolutely necessary to validate all of the prior images exist in order to prevent inconsistent result pagination.
I would classify this as a minor bug, as the overwhelming majority of users do not paginate deeply or make cold jumps to the end of the search results. To fix it would requires overhauling image validation to be more efficient, which is feasible but will take some time due to the complexity in this area.
Here are our options:
Preferred
Less preferred
source
kgodey commented on Thu Aug 06 2020:
This was a bug that X5gon ran into, they are using our API to power their image search for OER. So someone cares. :)
source
aldenstpage commented on Thu Aug 06 2020:
You'll find that other search products often limit the result set as well.
source
The text was updated successfully, but these errors were encountered: