Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: improve search for documentation #6952

Merged
merged 7 commits into from
Mar 27, 2023

Conversation

tmair
Copy link
Contributor

@tmair tmair commented May 2, 2022

Description:
This PR ports the search web worker and its integration from angular.io to the RxJS documentation site. The main changes are outlined on the original commit in the angular repository: angular/angular@fccffc6

In Addition to the changes ported from the angular documentation the following changes where made:

  • Adaption of the default lunr stop word filter to exclude from, of and every as a stop word (this allows for a search of the of function)
  • disabling of the ignore word list for class and interface members (this allows to find results for next as next is a member of the Observable interface) This is now also fixed on angular.io

Note:
This PR needs to be rebased after the dependency issues in #6913 are merged.

Related issue (if exists):
This should fix #6500
The should be no regression for #4536

@jakovljevic-mladen
Copy link
Member

@tmair, this is great PR, thank you for this one. I will review it as soon as I find some time - I should be able to do it in the next couple of days.

Copy link
Member

@jakovljevic-mladen jakovljevic-mladen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great PR, thank you a lot for making it. It took me a while until I got into this stuff so I could do the proper review. Changes look good to me, however, I've got some questions/comments.

One of them is related to the duplicated results (not related to this PR, but could be handy if we fixed the issue). For example, in this picture:

image

You can see that we have repeatWhen appearing two times (the on is the link to this path: /api/index/function/repeatWhen, and the other one to this path: /api/operators/repeatWhen). Is it possible to remove docs with /api/operators/ path (since this path is deprecated)?

Other questions/comments in files.

docs_app/.gitignore Outdated Show resolved Hide resolved
"forth",
"found",
"four",
"from",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this line is breaking existing functionality of the generated keywords. Please take a look at this gif (where I compared changes from this PR and official site):

from

However, removing this line won't make it much better:

image

Querying "from" will find 244 results instead of only one (like in the gif). So, I'm not sure what would be the best approach here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the explanation of this issue in my comment below.

'we',
'were',
'what',
'when',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Searching for "when" will result in some results (because of repeatWhen and others).
image

How do these stop words work? If "when" is declared as a stop word, shouldn't there be no results in that case?

Anyways, I think that "when" should be removed from stop words as there are some operators that have this word in their name. "with" and "while" as well (e.g. withLatestFrom and skipWhile) and possibly others (I haven't checked the entire list).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop words will match only the stop word. So when the word when is included within a term that term is not filtered.

@tmair
Copy link
Contributor Author

tmair commented Jul 13, 2022

Hi @jakovljevic-mladen,

thanks for the review. I will look at your questions but right now I don't have the time for it. It will take me 3 to 4 weeks to get back to it, but I will continue the work on it

@tmair
Copy link
Contributor Author

tmair commented Oct 1, 2022

I rebased the branch on the current master and tried to incorporate your review feedback into my branch.

Since the search is somewhat difficult to understand I have made some notes on how the search works. I hope this helps to understand the code base better and should also serve as a reference for future work in this area.

How does the search work?

The search is split into a preprocessing step at build time and a index creation step (inside a web worker) on runtime.

The preprocessing happens in generateKeywords.js and performs the following steps:

  1. calculate the set of pages that needs to be indexed for searching
  2. For each non ignored property (ignoredProperties) of each page document, extract keywords used to build the search index (mainTokens). Here the ignore words list is used to filter out common english terminology (e.g. you, from, every, ... )
  3. Extract all public members from the code and add those to the search index (memberTokens) ignoring the ignore words list.
  4. Extract keywords from the headings of the documents (headingTokens) using the ignore word list
  5. Generate the search data that will be downloaded by the service worker to build the search index. The search data is already stemmed (https://lunrjs.com/guides/core_concepts.html#stemming) to further reduce the amount of data needed to transfer to the service worker.

The service worker performs the following steps, before it can be used to answer any search results:

  1. Load the search data computed by the preprocessing step described above
  2. Create a processing pipeline by not including a stemming step, as this was already performed during preprocessing. However a stopWord processing step is performed.
  3. Process the data through the processing pipeline to create a search index

During the preprocessing and indexing of the search data there are two places where words are ignored. The ignore word list, will be used to filter common english terms during the preprocessing phase (not used with member tokens). The stop word filter will perform the same operation when building the index on the service worker.

Improvements to the searching process done in this PR

Ignore words and stop words filtering

Because it is strange to have two ignore word lists and filtering steps, I have decided to remove the stop word filter from the index builder pipeline completly, as it is does the work twice and is quite confusing. However I have left all words (including from, every, of, with) included in the ignore word filter, as the search results are not great when those words are not ignored. For example from is used whenever there is an example included within a documentation page because of the import ... from ... syntax. The relevant pages are still found, because the title of the page is not filtered. Therefore from will still return the search results where from is included within the title of the page.

Searching for from

Regarding the different search results for from, here is what is happening during the search:

  1. First search for a document that contains all of the search terms (divided by spaces) provided (in our case from)
  2. If the search has at least one result, return that result set (and that is what is happening for from. Thus we only have one search result)
  3. Search for a document that contains any of the search terms
  4. If the search has at least one result, return that result set
  5. Search for documents that contain any of the search terms and contain the first search term somewhere in the tile

So for from as there is only one result, we return that single result.

I modified the initial search in such a way, that it will add more weight to search terms within the title of the page. Maybe that will result in better search results.

Removing depreacted entrypoints from search results

I also removed the deprecated entrypoints from the search results (e.g. when searching for buffer). This will also reduce the size of the generated search data.

@tmair tmair marked this pull request as ready for review October 1, 2022 17:34
@tmair tmair changed the title Improve documentation search docs: improve search for documentation Oct 1, 2022
@benlesh
Copy link
Member

benlesh commented Jan 21, 2023

@jakovljevic-mladen is this still something you're considering merging?

@jakovljevic-mladen
Copy link
Member

jakovljevic-mladen commented Jan 21, 2023

@benlesh, yes, but I'm still waiting for #6913 to be merged. I should've wrote it in the comments earlier so that @tmair knows as well, I'm so sorry about that.

@benlesh
Copy link
Member

benlesh commented Mar 7, 2023

@jakovljevic-mladen #6913 is approved and ready to merge. Sorry for the delay.

Copy link
Member

@jakovljevic-mladen jakovljevic-mladen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmair, thank you a lot for making these changes. I read your exhaustive explanation of how search works and I agree with these changes. During my testing phase couple of months ago, I had issues when I was searching term from where my results were different to what you were describing. So, I assumed this is due to conflicting dependencies. Now that #6913 is merged, I rebased this PR against master, did npm install and tried search and it worked the way you described it.

Since I did rebase, I'd like you to review this PR to check if merge conflicts are resolved correctly. Also, we have issues with merging to master due to issue with Node 14 which is now removed, but pipelines still show this as a required step.

@tmair
Copy link
Contributor Author

tmair commented Mar 13, 2023

@jakovljevic-mladen I reviewed the PR and also did a short local test of the new search implementation. So far everything looks good.

The only thing that was different for me was a npm install will alter the package-lock.json. I don't know if that is an issue, since you have already commited the changes to the lock file.

@jakovljevic-mladen
Copy link
Member

I don't know if that is an issue

It shouldn't be an issue. Doing an npm install changes package-lock.json almost every time, but we have the latest package-lock.json, so we should be fine. I'll merge as soon as we fix the Node 14 build issue.

@jakovljevic-mladen jakovljevic-mladen merged commit 0175187 into ReactiveX:master Mar 27, 2023
jakovljevic-mladen pushed a commit that referenced this pull request Mar 27, 2023
* docs: update search to aio search

* chore: use custom stop word filter for lunr

* docs: applay review suggestions

* docs: remove stop word filtering during search

* docs: improve first search query to search only in title

* docs: remove deprecated import paths from search results

* chore: commit package-lock.json

---------

Co-authored-by: Mladen Jakovljević <[email protected]>

(cherry picked from commit 0175187)
@jakovljevic-mladen
Copy link
Member

Thanks for the fix @tmair and sorry for waiting this long for the merge. It is cherry-picked to 7.x branch and deployed to the rxjs.dev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Searching for "fromEvent" shows wrong/random results
3 participants