rustdoc: Smaller Search Index #13431

lifthrasiir · 2014-04-09T18:24:53Z

This is a series of inter-related commits which depend on #13402 (Prune the paths that do not appear in the index). Please consider this as an early review request; I'll rebase this when the parent PR get merged and rebase is required.

This PR aims at reducing the search index without removing the actual information. In my measurement with both library and compiler docs, the search index is 52% smaller before gzipped, and 16% smaller after gzipped:

 1719473 search-index-old.js
 1503299 search-index.js (after #13402, 13% gain)
  724955 search-index-new.js (after this PR, 52% gain w.r.t. #13402)

  262711 search-index-old.js.gz
  214205 search-index.js.gz (after #13402, 18.5% gain)
  179396 search-index-new.js.gz (after this PR, 16% gain w.r.t. #13402)

Both the uncompressed and compressed size of the search index have been accounted. While the former would be less relevant when #12597 (Web site should be transferring data compressed) is resolved, the uncompressed index will be around for a while anyway and directly affects the UX of docs. Moreover, LZ77 (and gzip) can only remove some repeated strings (since its search window is limited in size), so optimizing for the uncompressed size often has a positive effect on the compressed size as well.

Each commit represents the following incremental improvements, in the order:

Parent paths were referred by its AST NodeId, which tends to be large. We don't need the actual node ID, so we remap them to the smaller sequential numbers. This also means that the list of paths can be a flat array instead of an object.
We remap each item type to small predefined numbers. This is strictly intended to reduce the uncompressed size of the search index.
We use arrays instead of objects and reconstruct the original objects in the JavaScript code. Since this removes a lot of boilerplates, this affects both the uncompressed and compressed size.
(I've found that a centralized searchIndex is easier to handle in JS, so I shot one global variable down.)
Finally, the repeated paths in the consecutive items are omitted (replaced by an empty string). This also greatly affects both the uncompressed and compressed size.

There had been several unsuccessful attempts to reduce the search index. Especially, I explicitly avoided complex optimizations like encoding paths in a compressed form, and only applied the optimizations when it had a substantial gain compared to the changes. Also, while I've tried to be careful, the lack of proper (non-smoke) tests makes me a bit worry; any advice on testing the search indices would be appreciated.

alexcrichton · 2014-04-10T14:54:56Z

Impressive wins, nice job!

I do wish we had some tests for this, but I think the generic "I search for something" should be sufficient for now.

lifthrasiir · 2014-04-10T16:54:15Z

@alexcrichton I've surgeried the item types commit and subsequent commits (it seems Github has lost track of prior commits, the original commit is 2e8063459ccc348ce1ec25bb318570d3e4301918). Is this good enough?

Regarding the search index tests, by the way, I'm thinking about the simple equivalence tests between two search indices (sans the item and path orders). I think we don't have tests for inlining docs for publicly reexported private items (#11391) either; I've filed #13444 for this matter.

alexcrichton · 2014-04-10T16:57:07Z

Thanks for filing an issue! This looks great to me!

…nts. `allPaths` is now a flat array in effect. This decreases the size of the search index by about 4--5% (gzipped or not).

Has negligible improvements with gzip, but saves about 7% without it. This also has an effect of changing the tie-breaking order of item types.

`buildIndex` JS function recovers them into the original object form. This greatly reduces the size of the uncompressed search index (27%), while this effect is less visible after gzipped (~5%).

…archIndex`.

Since the items roughly follow the lexical order, there are many consecutive items with the same path value which can be easily compressed. For the library and compiler docs, this commit decreases the index size by 26% and 6% before and after gzip, respectively.

lifthrasiir · 2014-04-14T01:42:47Z

The merge seems to be blocked by 0bf4e90. Rebased now.

…crichton This is a series of inter-related commits which depend on #13402 (Prune the paths that do not appear in the index). Please consider this as an early review request; I'll rebase this when the parent PR get merged and rebase is required. ---- This PR aims at reducing the search index without removing the actual information. In my measurement with both library and compiler docs, the search index is 52% smaller before gzipped, and 16% smaller after gzipped: ``` 1719473 search-index-old.js 1503299 search-index.js (after #13402, 13% gain) 724955 search-index-new.js (after this PR, 52% gain w.r.t. #13402) 262711 search-index-old.js.gz 214205 search-index.js.gz (after #13402, 18.5% gain) 179396 search-index-new.js.gz (after this PR, 16% gain w.r.t. #13402) ``` Both the uncompressed and compressed size of the search index have been accounted. While the former would be less relevant when #12597 (Web site should be transferring data compressed) is resolved, the uncompressed index will be around for a while anyway and directly affects the UX of docs. Moreover, LZ77 (and gzip) can only remove *some* repeated strings (since its search window is limited in size), so optimizing for the uncompressed size often has a positive effect on the compressed size as well. Each commit represents the following incremental improvements, in the order: 1. Parent paths were referred by its AST `NodeId`, which tends to be large. We don't need the actual node ID, so we remap them to the smaller sequential numbers. This also means that the list of paths can be a flat array instead of an object. 2. We remap each item type to small predefined numbers. This is strictly intended to reduce the uncompressed size of the search index. 3. We use arrays instead of objects and reconstruct the original objects in the JavaScript code. Since this removes a lot of boilerplates, this affects both the uncompressed and compressed size. 4. (I've found that a centralized `searchIndex` is easier to handle in JS, so I shot one global variable down.) 5. Finally, the repeated paths in the consecutive items are omitted (replaced by an empty string). This also greatly affects both the uncompressed and compressed size. There had been several unsuccessful attempts to reduce the search index. Especially, I explicitly avoided complex optimizations like encoding paths in a compressed form, and only applied the optimizations when it had a substantial gain compared to the changes. Also, while I've tried to be careful, the lack of proper (non-smoke) tests makes me a bit worry; any advice on testing the search indices would be appreciated.

nrc · 2016-07-18T19:59:31Z

What is the motivation for having a smaller search index? Does it make any impact on performance?

steveklabnik · 2016-07-18T20:01:14Z

@nrc i would imagine that everyone hitting https://doc.rust-lang.org/std appreciates having to download less data, and yeah, should be faster as well

nrc · 2016-07-19T03:35:03Z

Ah right, we download the whole search index.

@steveklabnik other than time to download, what would get faster? Is JSON faster to deserialise if it is smaller (I mean, denser really, obviously if there is less work to do then it will be faster)?

Split def_path_res into two parts `def_path_res` previously had two jobs: 1. looking up the crates to find the path in 2. looking up path in said crates This splits that job up into two functions, keeping `def_path_res` as an adapter between the both, to avoid repeating the first step when repeatedly looking up items in the same crate. changelog: none

lifthrasiir added 5 commits April 14, 2014 09:59

rustdoc: Use smaller sequential numbers instead of NodeIds for pare…

ab6915d

…nts. `allPaths` is now a flat array in effect. This decreases the size of the search index by about 4--5% (gzipped or not).

rustdoc: Represent item types as a small number in the search index.

f1de04c

Has negligible improvements with gzip, but saves about 7% without it. This also has an effect of changing the tie-breaking order of item types.

rustdoc: Use an array instead of an object for the search index.

f6854ab

`buildIndex` JS function recovers them into the original object form. This greatly reduces the size of the uncompressed search index (27%), while this effect is less visible after gzipped (~5%).

rustdoc: Get rid of allPaths global variable by merging it into `se…

9eb336a

…archIndex`.

bors closed this Apr 14, 2014

bors merged commit 8f5d71c into rust-lang:master Apr 14, 2014

jonas-schievink mentioned this pull request Jul 18, 2016

IndexItem should be serialised as an object rather than an array #34678

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rustdoc: Smaller Search Index #13431

rustdoc: Smaller Search Index #13431

lifthrasiir commented Apr 9, 2014

alexcrichton commented Apr 10, 2014

lifthrasiir commented Apr 10, 2014

alexcrichton commented Apr 10, 2014

lifthrasiir commented Apr 14, 2014

nrc commented Jul 18, 2016

steveklabnik commented Jul 18, 2016

nrc commented Jul 19, 2016 •

edited

Loading

rustdoc: Smaller Search Index #13431

rustdoc: Smaller Search Index #13431

Conversation

lifthrasiir commented Apr 9, 2014

alexcrichton commented Apr 10, 2014

lifthrasiir commented Apr 10, 2014

alexcrichton commented Apr 10, 2014

lifthrasiir commented Apr 14, 2014

nrc commented Jul 18, 2016

steveklabnik commented Jul 18, 2016

nrc commented Jul 19, 2016 • edited Loading

nrc commented Jul 19, 2016 •

edited

Loading