Any tips for decreasing index file size? #258

jmooring · 2022-03-14T05:36:02Z

jmooring
Mar 14, 2022

I recognize that I am pressing the limits, but I wanted to understand where Stork is a good fit today, and where it might be a good fit tomorrow. My starting point was, "The client publishes one short article per week. Where will they be in 5 years? Let's double that and test Stork."

Live site: https://jmooring.github.io/hugo-stork
Source: https://github.com/jmooring/hugo-stork (published site is in the gh-pages branch)

There are 500 articles, with an average of about 520 words per article. Once the site is loaded, the search is ~~fast~~ ~~really fast~~ instantaneous.

But as you might guess, with a 6MB index file, the site doesn't load as fast as I might like. The Lighthouse (mobile) report:

Yeah, I don't trust Lighthouse that much either. But clients do.

Does anyone have any tips for how to intelligently decrease the size of the index with the tools we have today?

In the future, it seems like #250 would decrease the index size, though I don't know if it would produce a meaningful reduction.

As I understand it, the index includes the full text of each file it indexes. This is great because you have context when displaying the search results, but on this site that's about 1.5 million characters. Perhaps there's an opportunity for improving compression, or an option to index the entire file but only display a short summary in the search results (the summary would be a separate element in the file object).

Any advice would be appreciated.

jmooring · 2022-03-14T20:45:45Z

jmooring
Mar 14, 2022
Author

Follow up...

This test site is (obviously) served by GitHub Pages, which serves files with the .st extension as application/vnd.sailingtracker.track. GitHub does not compress (gzip) this media type.

So, in the GitHub action I added the .json suffix to the index file:
https://github.com/jmooring/hugo-stork/blob/main/.github/workflows/gh-pages.yml#L35

This reduces the download from 6.0MB to 2.3MB; we are definitely moving in the right direction.

But the the Lighthouse (mobile) report remains essentially the same:

Brotli can compress this further, to about 1.6MB, but that's not a configurable option for GitHub Pages. My next step is a Netlify test.

It would be convenient if the default index extension corresponded to a media type that is compressed by default on Apache, NGINX, etc.

0 replies

jameslittle230 · 2022-03-15T17:04:55Z

jameslittle230
Mar 15, 2022
Maintainer

Thanks for this investigation, and I'm excited to see how you're using Stork! I'll pick out pieces and respond individually:

Perhaps there's an opportunity for improving compression, or an option to index the entire file but only display a short summary in the search results

Definitely something I want to look at - it's been on my list for a while. I think this change would reduce index size by about 20%.

it seems like #250 would decrease the index size, though I don't know if it would produce a meaningful reduction.

This will provide the most meaningful size reduction, since much of an index file is made up of the mapping between words and results. I'll play with some different ways to reduce the number of words indexed in the output file -- I'm excited by some of the ideas in the listed issue but want to think more about how they'd affect the configuration API.

Compression & JSON

Definitely something I want to work on more. At minimum, the indexer should gzip the bag of bytes before saving the file, and the WASM module should unzip it upon registration. Getting the server & browser to do this automatically by saving the file as *.json was very clever! If that works for you, I would say there's no need to change it - the .st file extension doesn't actually do anything, I just like it because it distinguishes a search index file from some other data file on the filesystem.

2 replies

jameslittle230 Mar 15, 2022
Maintainer

You might also play with the excerpts_per_result setting in your index configuration to see if that reduces your index size somewhat.

jmooring Mar 16, 2022
Author

Thanks for the response! All test have been performed with excerpts_per_result set to 3.

jmooring · 2022-03-16T08:44:14Z

jmooring
Mar 16, 2022
Author

Update: the test site is now available on both GitHub Pages and Netlify.

Although Brotli is capable of reducing the index file from 6.0 MB to 1.5 MB, Netlify's usage of Brotli is less aggressive, producing a 2.1 MB file. That's only a 10% improvement when compared to gzip compression. I've opened a related topic on the Netlify forum:

https://answers.netlify.com/t/serving-pre-compressed-brotli-files/53515

The Lighthouse (mobile) report for the site served by Netlify is essentially the same as what I am seeing for the site served by GitHub, so I'm not going to post the results.

No need to respond to this; I just wanted to provide an update.

0 replies

jmooring · 2022-03-17T08:23:19Z

jmooring
Mar 17, 2022
Author

I was curious if splitting a large file into chunks would improve performance (download in parallel), so I ran some rudimentary tests. Short answer: splitting a large file into chunks appears to hurt rather than help. Details here:
https://github.com/jmooring/hugo-stork#bag-of-bytes.

0 replies

jmooring · 2022-03-23T11:58:39Z

jmooring
Mar 23, 2022
Author

For the test site referenced above, v1.4.1 and v1.4.2 produce different index files (as expected), though the file size is identical.

-rw------- 1 user user 6012433 Mar 23 04:45 stork-index-1.4.1.st
-rw------- 1 user user 6012433 Mar 23 04:45 stork-index-1.4.2.st

@jameslittle230 Does this make sense to you?

But the index produced by v1.4.2 has better potential for compression:

-rw------- 1 user user 2246506 Mar 23 04:46 stork-index-1.4.1.st.gz
-rw------- 1 user user 1927850 Mar 23 04:46 stork-index-1.4.2.st.gz

That's a 14% reduction with gzip compression. The reduction with Brotli compression is comparable.

0 replies

jmooring · 2022-03-23T12:38:17Z

jmooring
Mar 23, 2022
Author

I think it would make sense to add the size of the gzipped index to the benchmarks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any tips for decreasing index file size? #258

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Any tips for decreasing index file size? #258

jmooring Mar 14, 2022

Replies: 6 comments · 2 replies

jmooring Mar 14, 2022 Author

jameslittle230 Mar 15, 2022 Maintainer

jameslittle230 Mar 15, 2022 Maintainer

jmooring Mar 16, 2022 Author

jmooring Mar 16, 2022 Author

jmooring Mar 17, 2022 Author

jmooring Mar 23, 2022 Author

jmooring Mar 23, 2022 Author

jmooring
Mar 14, 2022

Replies: 6 comments 2 replies

jmooring
Mar 14, 2022
Author

jameslittle230
Mar 15, 2022
Maintainer

jameslittle230 Mar 15, 2022
Maintainer

jmooring Mar 16, 2022
Author

jmooring
Mar 16, 2022
Author

jmooring
Mar 17, 2022
Author

jmooring
Mar 23, 2022
Author

jmooring
Mar 23, 2022
Author