Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

search_inside API does not work for some items with spaces in the document name #1260

Open
pidgezero-one opened this issue Oct 22, 2023 · 3 comments

Comments

@pidgezero-one
Copy link

pidgezero-one commented Oct 22, 2023

Searching the text contents of some items is failing with a message that simply states "Sorry, there was an error with your search. Please try again."

It appears to be tied to the fulltext/inside.php endpoint, and I suspect it is an issue of query params not being encoded somewhere in the API backend.

Evidence / Screenshot (if possible)

image

Here is the error thrown by ia-sentry.min.js in the developer console after executing a search:

Search Inside Response Error Whoops! Traceback (most recent call last):
  File "./inside.py", line 158, in <module>
    reply = urllib.request.urlopen(es_url).read()
  File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.8/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.8/urllib/request.py", line 542, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.8/urllib/request.py", line 1397, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "/usr/lib/python3.8/urllib/request.py", line 1354, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/usr/lib/python3.8/http/client.py", line 1256, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.8/http/client.py", line 1267, in _send_request
    self.putrequest(method, url, **skips)
  File "/usr/lib/python3.8/http/client.py", line 1101, in putrequest
    self._validate_path(url)
  File "/usr/lib/python3.8/http/client.py", line 1201, in _validate_path
    raise InvalidURL(f"URL can't contain control characters. {url!r} "
http.client.InvalidURL: URL can't contain control characters. '/api/v1/searchinside?exists=true&ia_id=nintendo-magazine-system-uk-43-april-1996&filename=Nintendo Magazine System (UK) 43 April 1996_abbyy.gz' (found at least ' ')

This looks like the doc name passed to the filename query param somewhere in the backend is not being encoded.

The corresponding network request does appear to encode the doc name from the given payload:

item_id: nintendo-magazine-system-uk-43-april-1996
doc: Nintendo Magazine System (UK) 43 April 1996
path: /27/items/nintendo-magazine-system-uk-43-april-1996
q: "mario rpg"
pre_tag: {{{
post_tag: }}}
callback: jQuery36107855882184027965_1697993173593

URL: https://ia601906.us.archive.org/fulltext/inside.php?item_id=nintendo-magazine-system-uk-43-april-1996&doc=Nintendo%20Magazine%20System%20(UK)%2043%20April%201996&path=/27/items/nintendo-magazine-system-uk-43-april-1996&q=%22mario%20rpg%22&pre_tag=%7B%7B%7B&post_tag=%7D%7D%7D&callback=jQuery36107855882184027965_1697993173593

Relevant url?

Example of an item experiencing this error: https://archive.org/details/nintendo-magazine-system-uk-43-april-1996/

(This isn't explicitly an openlibrary.org url, so apologies if this is the wrong repo to file this issue under, but this endpoint is covered under the openlibrary API documentation, and I don't know which of the 200+ repos owned by internetarchive might contain "inside.py" or "inside.php".)

Steps to Reproduce

  1. I was searching in archive.org for items containing the text "mario rpg" (in quotes): https://archive.org/search?query=%22mario+rpg%22&sin=TXT
  2. The above linked item appears as a search result. Navigating into the item immediately shows the above screenshotted error.
  3. Loading the item URL in a new tab and searching for the quoted text displays the same error.
  • Actual: The developer console displays the above ia-sentry.min.js Javascript error and the network console should display a successful 200 response from the above /inside.php request.
  • Expected: The left side pane would show search results successfully.

Details

  • Logged in (Y/N)? Y
  • Browser type/version? Chrome 118.0.5993.70 ARM64
  • Operating system? Mac OS Ventura 13.5.2
  • Environment (prod/dev/local)? prod

Proposal & Constraints

The URL constructed in inside.py should be fully encoded, including the params. (probably simple enough to not need an example, but in the interest of meeting the requirements: https://stackoverflow.com/a/69811079/5306408)

Related files

I can't actually find them. There's an inside.php and inside.py and one or the other is likely to be the culprit, but they don't appear to be in this repository. I opened the issue here because that endpoint is covered by the openlibrary API docs.

Stakeholders

@mekarpeles mekarpeles transferred this issue from internetarchive/openlibrary Oct 23, 2023
@mekarpeles
Copy link
Member

Transferring this issue to BookReader repo :)

@pidgezero-one
Copy link
Author

@mekarpeles Thank you! Although, I don't see an inside.php or inside.py in this repo either.

@shivanshsin0203
Copy link

Hello everyone can anyone please help in the project deployment process I am facing some issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants