Index Size for Impact indexes #2057
Replies: 2 comments 3 replies
-
Could this be related to the |
Beta Was this translation helpful? Give feedback.
-
Hey @thibault-formal - it's probably something to do with the compression codec on the frequencies. You might like to try indexing a quantized BM25 index to see if that makes the size closer to what you are observing. If so, then my assumption would be that the integer codec for compressing frequencies assumes very small numbers and struggles to compress larger quantized values. This is totally a guess though, just based on prior experience. I just stumbled across this conversation so I thought I'd add my 2c. |
Beta Was this translation helpful? Give feedback.
-
Hi,
I am indexing a bunch of SPLADE models with Anserini, following the reproduction guide. To index, I simply do:
$PATH_ANSERINI/target/appassembler/bin/IndexCollection -collection JsonVectorCollection -input jsonl_collection/ -index lucene_index/ -generator DefaultLuceneDocumentGenerator -threads 16 -impact -pretokenized
I notice that, even for very sparse SPLADE models, index size remains quite large , e.g. 1.3GB for a model with average doc size of 20. In contrast, BM25 index weighs around 0.6GB.
Is there something I am missing regarding impact indexes or quantization? I would expect the SPLADE index to be much smaller.
Thanks,
Thibault
Beta Was this translation helpful? Give feedback.
All reactions