Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use full_path to show search results #5821

Closed
stsewd opened this issue Jun 18, 2019 · 4 comments · Fixed by #7255
Closed

Use full_path to show search results #5821

stsewd opened this issue Jun 18, 2019 · 4 comments · Fixed by #7255
Labels
Improvement Minor improvement to code Needed: design decision A core team decision is required

Comments

@stsewd
Copy link
Member

stsewd commented Jun 18, 2019

This is a follow up of the discussion around #5785

Currently we are indexing the page name as path in ES, page name and path are different concepts.

  • Page name doesn't include the file extension, path does.
  • Page name is something used in sphinx internals to identify a page (doesn't matter the build output I guess)

We used to build the original path (with extension) using the doctype, but that was removed.

When showing search results we rely on sphinx

https://github.com/rtfd/readthedocs.org/blob/a908684874e7ae71ba9b0c93178491d3339c68ec/readthedocs/core/static-src/core/js/doc-embed/search.js#L41-L42

This isn't always correct, there are cases where this gives 404 in the results (see https://anymail.readthedocs.io/en/stable/search/?q=Anymail+integrates+several+transactional&check_keywords=yes&area=default)

But if we ever extend to mkdocs or any other builder, we are going to face the problem of not knowing the original path (with extension!)

My thoughts:

  • If page name is important, we can keep it, but maybe rename it to page_name, so isn't confusing.
  • Use full_path to show results from the search api, so we don't depend on any external things to add the correct extension.
@stsewd stsewd added Improvement Minor improvement to code Needed: design decision A core team decision is required labels Jun 18, 2019
@ericholscher
Copy link
Member

ericholscher commented Jun 18, 2019

Proper names

Yea, I think we need to do a full audit of our code and figure out all the different names we call these concepts. I suggest:

  • page for the Sphinx concept (doc without extension)
  • path for the Filesystem concept (path to file on disk)

Right now, in the embed API we're actually cutting the HTML off and expecting a page instead of a path:

https://github.com/rtfd/readthedocs.org/blob/826093735bbbcbc9a484f41a3c62488958aa8051/readthedocs/templates/projects/project_embed.html#L59

Similarly in other code we're expecting a page:

https://github.com/rtfd/readthedocs.org/blob/826093735bbbcbc9a484f41a3c62488958aa8051/readthedocs/api/v2/views/core_views.py#L27

So we need to define these concepts, and choose what to do with them.

Search next steps

My suggestion for a path forward:

  • Store both the page and path in the ImportedFile object
  • Index both in Elastic, and try to use path's where possible.

Larger Picture

We also run into issues with this in other parts of our code:

  • The SphinxDomain models store page data
  • The upcoming Analytics code will need to choose whether to store page's or path's as the canonical source of a document

There's also the question of mkdocs, which I don't believe has these concepts, but does output multiple types of HTML structures as well.

Edit on GitHub

We also have an issue here when we try and convert from HTML Files to RST files, on the Edit on GitHub links. We need to track the actual RST files that content has come from, which is generally what the page data is tracking.

There is a problem where we can have an HTML file and not know where it came from:

  • foo.rst -> foo/index.html
  • foo/index.rst -> foo/index.html

We need to have the page, along with knowing the output HTML file, in order to be able to know what RST file to link to for a specific HTML file in this case.

@stsewd
Copy link
Member Author

stsewd commented Jun 18, 2019

Thanks for the explanation, now it's more clear for me the use of page name!

@ericholscher
Copy link
Member

We just ran into this again. Intersphinx data is storing URL's (so something like /api/), but we are storing actual HTML names in the ImportedFile models (/api/index.html). This caused us not to be able to map them on sphinx_htmldir builds.

@humitos
Copy link
Member

humitos commented Aug 21, 2019

page for the Sphinx concept (doc without extension)

Sphinx calls this docname internally. I would be good if we use the same name for the same concept to avoid miss understandings.

Another useful concepts that may add some context here are: labelid (which are the ones used in :ref: role) and sectionname.

Besides, I'm not 100% sure, but refdocname seems to be what Sphinx uses for the filename path (containing the .html).

stsewd added a commit that referenced this issue May 12, 2020
We were returning an invalid link from the api,
and the cliente was in charge of guessing the correct link.

This also drops our builder as dependency for search 🎉

Half fixex/ #5821
Fixes #6102
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvement Minor improvement to code Needed: design decision A core team decision is required
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants