docs: improve search for documentation #6952

tmair · 2022-05-02T11:15:39Z

Description:
This PR ports the search web worker and its integration from angular.io to the RxJS documentation site. The main changes are outlined on the original commit in the angular repository: angular/angular@fccffc6

In Addition to the changes ported from the angular documentation the following changes where made:

Adaption of the default lunr stop word filter to exclude from, of and every as a stop word (this allows for a search of the of function)
~~disabling of the ignore word list for class and interface members (this allows to find results for next as next is a member of the Observable interface)~~ This is now also fixed on angular.io

Note:
This PR needs to be rebased after the dependency issues in #6913 are merged.

Related issue (if exists):
This should fix #6500
The should be no regression for #4536

jakovljevic-mladen · 2022-05-30T19:37:13Z

@tmair, this is great PR, thank you for this one. I will review it as soon as I find some time - I should be able to do it in the next couple of days.

jakovljevic-mladen

This is a great PR, thank you a lot for making it. It took me a while until I got into this stuff so I could do the proper review. Changes look good to me, however, I've got some questions/comments.

One of them is related to the duplicated results (not related to this PR, but could be handy if we fixed the issue). For example, in this picture:

You can see that we have repeatWhen appearing two times (the on is the link to this path: /api/index/function/repeatWhen, and the other one to this path: /api/operators/repeatWhen). Is it possible to remove docs with /api/operators/ path (since this path is deprecated)?

How does the search work?

The search is split into a preprocessing step at build time and a index creation step (inside a web worker) on runtime.

The preprocessing happens in generateKeywords.js and performs the following steps:

calculate the set of pages that needs to be indexed for searching
For each non ignored property (ignoredProperties) of each page document, extract keywords used to build the search index (mainTokens). Here the ignore words list is used to filter out common english terminology (e.g. you, from, every, ... )
Extract all public members from the code and add those to the search index (memberTokens) ignoring the ignore words list.
Extract keywords from the headings of the documents (headingTokens) using the ignore word list
Generate the search data that will be downloaded by the service worker to build the search index. The search data is already stemmed (https://lunrjs.com/guides/core_concepts.html#stemming) to further reduce the amount of data needed to transfer to the service worker.

The service worker performs the following steps, before it can be used to answer any search results:

Load the search data computed by the preprocessing step described above
Create a processing pipeline by not including a stemming step, as this was already performed during preprocessing. However a stopWord processing step is performed.
Process the data through the processing pipeline to create a search index

During the preprocessing and indexing of the search data there are two places where words are ignored. The ignore word list, will be used to filter common english terms during the preprocessing phase (not used with member tokens). The stop word filter will perform the same operation when building the index on the service worker.

Improvements to the searching process done in this PR

Ignore words and stop words filtering

Because it is strange to have two ignore word lists and filtering steps, I have decided to remove the stop word filter from the index builder pipeline completly, as it is does the work twice and is quite confusing. However I have left all words (including from, every, of, with) included in the ignore word filter, as the search results are not great when those words are not ignored. For example from is used whenever there is an example included within a documentation page because of the import ... from ... syntax. The relevant pages are still found, because the title of the page is not filtered. Therefore from will still return the search results where from is included within the title of the page.

Searching for `from`

Regarding the different search results for from, here is what is happening during the search:

First search for a document that contains all of the search terms (divided by spaces) provided (in our case from)
If the search has at least one result, return that result set (and that is what is happening for from. Thus we only have one search result)
Search for a document that contains any of the search terms
If the search has at least one result, return that result set
Search for documents that contain any of the search terms and contain the first search term somewhere in the tile

So for from as there is only one result, we return that single result.

I modified the initial search in such a way, that it will add more weight to search terms within the title of the page. Maybe that will result in better search results.

Removing depreacted entrypoints from search results

I also removed the deprecated entrypoints from the search results (e.g. when searching for buffer). This will also reduce the size of the generated search data.

benlesh · 2023-01-21T07:01:59Z

@jakovljevic-mladen is this still something you're considering merging?

jakovljevic-mladen · 2023-01-21T13:49:53Z

@benlesh, yes, but I'm still waiting for #6913 to be merged. I should've wrote it in the comments earlier so that @tmair knows as well, I'm so sorry about that.

benlesh · 2023-03-07T21:54:43Z

@jakovljevic-mladen #6913 is approved and ready to merge. Sorry for the delay.

jakovljevic-mladen

@tmair, thank you a lot for making these changes. I read your exhaustive explanation of how search works and I agree with these changes. During my testing phase couple of months ago, I had issues when I was searching term from where my results were different to what you were describing. So, I assumed this is due to conflicting dependencies. Now that #6913 is merged, I rebased this PR against master, did npm install and tried search and it worked the way you described it.

Since I did rebase, I'd like you to review this PR to check if merge conflicts are resolved correctly. Also, we have issues with merging to master due to issue with Node 14 which is now removed, but pipelines still show this as a required step.

tmair · 2023-03-13T11:34:32Z

@jakovljevic-mladen I reviewed the PR and also did a short local test of the new search implementation. So far everything looks good.

The only thing that was different for me was a npm install will alter the package-lock.json. I don't know if that is an issue, since you have already commited the changes to the lock file.

jakovljevic-mladen · 2023-03-13T13:47:34Z

I don't know if that is an issue

It shouldn't be an issue. Doing an npm install changes package-lock.json almost every time, but we have the latest package-lock.json, so we should be fine. I'll merge as soon as we fix the Node 14 build issue.

* docs: update search to aio search * chore: use custom stop word filter for lunr * docs: applay review suggestions * docs: remove stop word filtering during search * docs: improve first search query to search only in title * docs: remove deprecated import paths from search results * chore: commit package-lock.json --------- Co-authored-by: Mladen Jakovljević <[email protected]> (cherry picked from commit 0175187)

jakovljevic-mladen · 2023-03-27T10:25:08Z

Thanks for the fix @tmair and sorry for waiting this long for the merge. It is cherry-picked to 7.x branch and deployed to the rxjs.dev.

tmair force-pushed the update-search branch from a8ba29d to b290007 Compare May 24, 2022 06:54

jakovljevic-mladen reviewed Jul 11, 2022

View reviewed changes

tmair force-pushed the update-search branch from b290007 to ecabaed Compare October 1, 2022 17:23

tmair marked this pull request as ready for review October 1, 2022 17:34

tmair changed the title ~~Improve documentation search~~ docs: improve search for documentation Oct 1, 2022

benlesh assigned jakovljevic-mladen Mar 7, 2023

tmair and others added 7 commits March 13, 2023 09:19

docs: update search to aio search

0a6221d

chore: use custom stop word filter for lunr

6933763

docs: applay review suggestions

73e07c7

docs: remove stop word filtering during search

8d2425c

docs: improve first search query to search only in title

5e1c4a8

docs: remove deprecated import paths from search results

cc602b2

chore: commit package-lock.json

82436e0

jakovljevic-mladen force-pushed the update-search branch from ecabaed to 82436e0 Compare March 13, 2023 08:34

jakovljevic-mladen approved these changes Mar 13, 2023

View reviewed changes

jakovljevic-mladen merged commit 0175187 into ReactiveX:master Mar 27, 2023

tmair deleted the update-search branch March 27, 2023 16:26

scottwad mentioned this pull request Jun 28, 2023

[Snyk] Upgrade rxjs from 7.5.1 to 7.8.1 scottwad/rxjs#4

Open

+                  "forth",
+                  "found",
+                  "four",
+                  "from",

+                'we',
+                'were',
+                'what',
+                'when',

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: improve search for documentation #6952

docs: improve search for documentation #6952

tmair commented May 2, 2022 •

edited

Loading

jakovljevic-mladen commented May 30, 2022

jakovljevic-mladen left a comment

jakovljevic-mladen Jul 11, 2022

tmair Oct 1, 2022

jakovljevic-mladen Jul 11, 2022

tmair Oct 1, 2022

tmair commented Jul 13, 2022 •

edited

Loading

tmair commented Oct 1, 2022 •

edited

Loading

benlesh commented Jan 21, 2023

jakovljevic-mladen commented Jan 21, 2023 •

edited

Loading

benlesh commented Mar 7, 2023

jakovljevic-mladen left a comment

tmair commented Mar 13, 2023

jakovljevic-mladen commented Mar 13, 2023

jakovljevic-mladen commented Mar 27, 2023

docs: improve search for documentation #6952

docs: improve search for documentation #6952

Conversation

tmair commented May 2, 2022 • edited Loading

jakovljevic-mladen commented May 30, 2022

jakovljevic-mladen left a comment

Choose a reason for hiding this comment

jakovljevic-mladen Jul 11, 2022

Choose a reason for hiding this comment

tmair Oct 1, 2022

Choose a reason for hiding this comment

jakovljevic-mladen Jul 11, 2022

Choose a reason for hiding this comment

tmair Oct 1, 2022

Choose a reason for hiding this comment

tmair commented Jul 13, 2022 • edited Loading

tmair commented Oct 1, 2022 • edited Loading

How does the search work?

Improvements to the searching process done in this PR

Ignore words and stop words filtering

Searching for from

Removing depreacted entrypoints from search results

benlesh commented Jan 21, 2023

jakovljevic-mladen commented Jan 21, 2023 • edited Loading

benlesh commented Mar 7, 2023

jakovljevic-mladen left a comment

Choose a reason for hiding this comment

tmair commented Mar 13, 2023

jakovljevic-mladen commented Mar 13, 2023

jakovljevic-mladen commented Mar 27, 2023

tmair commented May 2, 2022 •

edited

Loading

tmair commented Jul 13, 2022 •

edited

Loading

tmair commented Oct 1, 2022 •

edited

Loading

Searching for `from`

jakovljevic-mladen commented Jan 21, 2023 •

edited

Loading