Index Size for Impact indexes #2057

thibault-formal · 2023-01-30T11:38:26Z

thibault-formal
Jan 30, 2023

Hi,
I am indexing a bunch of SPLADE models with Anserini, following the reproduction guide. To index, I simply do:

$PATH_ANSERINI/target/appassembler/bin/IndexCollection -collection JsonVectorCollection -input jsonl_collection/ -index lucene_index/ -generator DefaultLuceneDocumentGenerator -threads 16 -impact -pretokenized

I notice that, even for very sparse SPLADE models, index size remains quite large , e.g. 1.3GB for a model with average doc size of 20. In contrast, BM25 index weighs around 0.6GB.
Is there something I am missing regarding impact indexes or quantization? I would expect the SPLADE index to be much smaller.

Thanks,
Thibault

thibault-formal · 2023-01-30T11:47:18Z

thibault-formal
Jan 30, 2023
Author

Could this be related to the -optimize option? (cc @cadurosar)

2 replies

lintool Jan 30, 2023
Maintainer

Hi @thibault-formal - SPLADE indexes being larger in size... this is actually expected and known, see, for example https://dl.acm.org/doi/10.1145/3576922 and https://arxiv.org/abs/2110.11540

The -optimize option merges multiple index segments down into a single segment. This is needed if you want to get an accurate count of the vocab size.

Hope this helps!

thibault-formal Jan 31, 2023
Author

Hi @lintool , thanks for the answer!
I am quite aware of this paper and this issue in general; but the SPLADEv2 model used here is doing a lot of expansion, so document vectors end up having more terms than original documents, which explains the larger indexes.
What is bothering me, however, is that when I index very sparse models (with for instance an average size of 20 for document vectors), index size is still large. I don't think I am doing something wrong (like storing the raw documents), so I am wondering where it might come from...

JMMackenzie · 2023-02-28T03:28:30Z

JMMackenzie
Feb 28, 2023

Hey @thibault-formal - it's probably something to do with the compression codec on the frequencies. You might like to try indexing a quantized BM25 index to see if that makes the size closer to what you are observing. If so, then my assumption would be that the integer codec for compressing frequencies assumes very small numbers and struggles to compress larger quantized values.

This is totally a guess though, just based on prior experience. I just stumbled across this conversation so I thought I'd add my 2c.

1 reply

thibault-formal Feb 28, 2023
Author

Hi @JMMackenzie - Thanks for the answer, I will have a look later in the week!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index Size for Impact indexes #2057

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Index Size for Impact indexes #2057

thibault-formal Jan 30, 2023

Replies: 2 comments · 3 replies

thibault-formal Jan 30, 2023 Author

lintool Jan 30, 2023 Maintainer

thibault-formal Jan 31, 2023 Author

JMMackenzie Feb 28, 2023

thibault-formal Feb 28, 2023 Author

thibault-formal
Jan 30, 2023

Replies: 2 comments 3 replies

thibault-formal
Jan 30, 2023
Author

lintool Jan 30, 2023
Maintainer

thibault-formal Jan 31, 2023
Author

JMMackenzie
Feb 28, 2023

thibault-formal Feb 28, 2023
Author