Severe issue in Search()._scrape() #610

bumatic · 2023-09-27T09:13:15Z

internetarchive Version: '3.5.0' (Python version and OS seem irrelevant for this issue)

When searching for queries that return more than 10.000 items, i.e. mediatype:software, the following error is raised always:

if i != num_found:
    raise ReadTimeout('The server failed to return results in the'
                      f' allotted amount of time for {r.request.url}')

When backtracking the issue, I encountered that the named r.request.url is correctly retrieved. In effect my results contained one more item than suggested by the API. In the case of the query mediatype:software its 1043904 for i while num_found is 1043903.

I don’t know why the API returns more results than it indicates for the query, but raising a ReadTimeout error based on the condition i != num_found is too restrictive especially since self._handle_scrape_error(j) is invoked earlier which should catch errors.

Nevertheless, I assume that this condition was included for a reason, which I cannot figure out right now. Therefore I can only suggest rough ideas for resolving this issue. Two that come to mind are removing the if conditional altogether (and potentially enhancing self._handle_scrape_error(j)) or weakening the condition to if i < num_found:

P.s. I checked for duplicate issues, but could not find any. A complete traceback can be provided. However, since I identified the problem, it seemed redundant to me. Let me know, if I am wrong and you want me to post it anyway.

The text was updated successfully, but these errors were encountered:

jjjake · 2023-10-02T17:56:52Z

Thanks for the report @bumatic. This was added to deal with an issue on the archive.org side of things (a timeout happening on the backend leading to the search API failing silently). The aggressive checking of doc count is to avoid someone thinking they dumped a full result set when in fact they haven't.

Let me look into this more and give it some thinking. Thanks again for reporting, and sorry for the trouble.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Severe issue in Search()._scrape() #610

Severe issue in Search()._scrape() #610

bumatic commented Sep 27, 2023

jjjake commented Oct 2, 2023

Severe issue in Search()._scrape() #610

Severe issue in Search()._scrape() #610

Comments

bumatic commented Sep 27, 2023

jjjake commented Oct 2, 2023