Use full_path to show search results #5821

stsewd · 2019-06-18T17:18:50Z

This is a follow up of the discussion around #5785

Currently we are indexing the page name as path in ES, page name and path are different concepts.

Page name doesn't include the file extension, path does.
Page name is something used in sphinx internals to identify a page (doesn't matter the build output I guess)

We used to build the original path (with extension) using the doctype, but that was removed.

When showing search results we rely on sphinx

https://github.com/rtfd/readthedocs.org/blob/a908684874e7ae71ba9b0c93178491d3339c68ec/readthedocs/core/static-src/core/js/doc-embed/search.js#L41-L42

This isn't always correct, there are cases where this gives 404 in the results (see https://anymail.readthedocs.io/en/stable/search/?q=Anymail+integrates+several+transactional&check_keywords=yes&area=default)

But if we ever extend to mkdocs or any other builder, we are going to face the problem of not knowing the original path (with extension!)

My thoughts:

If page name is important, we can keep it, but maybe rename it to page_name, so isn't confusing.
Use full_path to show results from the search api, so we don't depend on any external things to add the correct extension.

The text was updated successfully, but these errors were encountered:

ericholscher · 2019-06-18T17:58:06Z

Proper names

Yea, I think we need to do a full audit of our code and figure out all the different names we call these concepts. I suggest:

page for the Sphinx concept (doc without extension)
path for the Filesystem concept (path to file on disk)

Right now, in the embed API we're actually cutting the HTML off and expecting a page instead of a path:

https://github.com/rtfd/readthedocs.org/blob/826093735bbbcbc9a484f41a3c62488958aa8051/readthedocs/templates/projects/project_embed.html#L59

Similarly in other code we're expecting a page:

https://github.com/rtfd/readthedocs.org/blob/826093735bbbcbc9a484f41a3c62488958aa8051/readthedocs/api/v2/views/core_views.py#L27

So we need to define these concepts, and choose what to do with them.

Search next steps

My suggestion for a path forward:

Store both the page and path in the ImportedFile object
Index both in Elastic, and try to use path's where possible.

Larger Picture

We also run into issues with this in other parts of our code:

The SphinxDomain models store page data
The upcoming Analytics code will need to choose whether to store page's or path's as the canonical source of a document

There's also the question of mkdocs, which I don't believe has these concepts, but does output multiple types of HTML structures as well.

Edit on GitHub

We also have an issue here when we try and convert from HTML Files to RST files, on the Edit on GitHub links. We need to track the actual RST files that content has come from, which is generally what the page data is tracking.

There is a problem where we can have an HTML file and not know where it came from:

foo.rst -> foo/index.html
foo/index.rst -> foo/index.html

We need to have the page, along with knowing the output HTML file, in order to be able to know what RST file to link to for a specific HTML file in this case.

stsewd · 2019-06-18T18:11:33Z

Thanks for the explanation, now it's more clear for me the use of page name!

ericholscher · 2019-07-25T18:44:03Z

We just ran into this again. Intersphinx data is storing URL's (so something like /api/), but we are storing actual HTML names in the ImportedFile models (/api/index.html). This caused us not to be able to map them on sphinx_htmldir builds.

humitos · 2019-08-21T14:34:31Z

page for the Sphinx concept (doc without extension)

Sphinx calls this docname internally. I would be good if we use the same name for the same concept to avoid miss understandings.

Another useful concepts that may add some context here are: labelid (which are the ones used in :ref: role) and sectionname.

Besides, I'm not 100% sure, but refdocname seems to be what Sphinx uses for the filename path (containing the .html).

We were returning an invalid link from the api, and the cliente was in charge of guessing the correct link. This also drops our builder as dependency for search 🎉 Half fixex/ #5821 Fixes #6102

stsewd added Improvement Minor improvement to code Needed: design decision A core team decision is required labels Jun 18, 2019

dojutsu-user mentioned this issue Jun 25, 2019

Add section linking for the search result #5829

Merged

ericholscher mentioned this issue Jun 26, 2019

Add a relation from SphinxDomain -> HTMLFile #5848

Merged

dojutsu-user mentioned this issue Jul 25, 2019

Fix logic involving creation of Sphinx Domains #5997

Merged

ericholscher mentioned this issue Aug 20, 2019

Return full path URL (including .html) on /api/v2/docurl/ endpoint #6082

Merged

dojutsu-user mentioned this issue Aug 26, 2019

Projects built with htmldir have invalid link for homepage in search results #6102

Closed

stsewd mentioned this issue May 12, 2020

Search: generate full link from the server side #7070

Merged

stsewd mentioned this issue Jun 4, 2020

Search: Use serializers to parse search results #7157

Merged

stsewd mentioned this issue Jul 2, 2020

Search: implement stable API #7255

Merged

stsewd closed this as completed in #7255 Aug 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use full_path to show search results #5821

Use full_path to show search results #5821

stsewd commented Jun 18, 2019 •

edited

Loading

ericholscher commented Jun 18, 2019 •

edited

Loading

stsewd commented Jun 18, 2019

ericholscher commented Jul 25, 2019

humitos commented Aug 21, 2019

Use full_path to show search results #5821

Use full_path to show search results #5821

Comments

stsewd commented Jun 18, 2019 • edited Loading

ericholscher commented Jun 18, 2019 • edited Loading

Proper names

Search next steps

Larger Picture

Edit on GitHub

stsewd commented Jun 18, 2019

ericholscher commented Jul 25, 2019

humitos commented Aug 21, 2019

stsewd commented Jun 18, 2019 •

edited

Loading

ericholscher commented Jun 18, 2019 •

edited

Loading