You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
internetarchive Version: '3.5.0' (Python version and OS seem irrelevant for this issue)
When searching for queries that return more than 10.000 items, i.e. mediatype:software, the following error is raised always:
ifi!=num_found:
raiseReadTimeout('The server failed to return results in the'f' allotted amount of time for {r.request.url}')
When backtracking the issue, I encountered that the named r.request.url is correctly retrieved. In effect my results contained one more item than suggested by the API. In the case of the query mediatype:software its 1043904 for i while num_found is 1043903.
I don’t know why the API returns more results than it indicates for the query, but raising a ReadTimeout error based on the condition i != num_found is too restrictive especially since self._handle_scrape_error(j) is invoked earlier which should catch errors.
Nevertheless, I assume that this condition was included for a reason, which I cannot figure out right now. Therefore I can only suggest rough ideas for resolving this issue. Two that come to mind are removing the if conditional altogether (and potentially enhancing self._handle_scrape_error(j)) or weakening the condition to if i < num_found:
P.s. I checked for duplicate issues, but could not find any. A complete traceback can be provided. However, since I identified the problem, it seemed redundant to me. Let me know, if I am wrong and you want me to post it anyway.
The text was updated successfully, but these errors were encountered:
Thanks for the report @bumatic. This was added to deal with an issue on the archive.org side of things (a timeout happening on the backend leading to the search API failing silently). The aggressive checking of doc count is to avoid someone thinking they dumped a full result set when in fact they haven't.
Let me look into this more and give it some thinking. Thanks again for reporting, and sorry for the trouble.
internetarchive Version: '3.5.0' (Python version and OS seem irrelevant for this issue)
When searching for queries that return more than 10.000 items, i.e.
mediatype:software
, the following error is raised always:When backtracking the issue, I encountered that the named
r.request.url
is correctly retrieved. In effect my results contained one more item than suggested by the API. In the case of the querymediatype:software
its 1043904 fori
whilenum_found
is 1043903.I don’t know why the API returns more results than it indicates for the query, but raising a ReadTimeout error based on the condition
i != num_found
is too restrictive especially sinceself._handle_scrape_error(j)
is invoked earlier which should catch errors.Nevertheless, I assume that this condition was included for a reason, which I cannot figure out right now. Therefore I can only suggest rough ideas for resolving this issue. Two that come to mind are removing the if conditional altogether (and potentially enhancing
self._handle_scrape_error(j)
) or weakening the condition toif i < num_found:
P.s. I checked for duplicate issues, but could not find any. A complete traceback can be provided. However, since I identified the problem, it seemed redundant to me. Let me know, if I am wrong and you want me to post it anyway.
The text was updated successfully, but these errors were encountered: