Index is apparently limited to 4 GB #351

moio · 2020-10-01T06:10:01Z

👋 Hound developers!

I am trying to index a pretty large repo (144GB - all current sources of openSUSE), and unsurprisingly the index turns out to be larger than 4GB, thus I hit this fatal message:

hound/codesearch/index/write.go

Line 561 in e3b1b43

log.Fatalf("index is larger than 4GB")

Would it be possible / how hard would it be to support larger indexes?

I only had a brief look at read.go and it seems to me that 32 bit offsets are part of the index file format, so changing that would require re-indexing/converting/supporting two file formats, is that correct?

Thanks for all your efforts on Hound!

The text was updated successfully, but these errors were encountered:

salemhilal · 2020-10-01T14:57:53Z

Oh yikes, I'm sorry you're running into that. It does look like this would involve supporting / moving to a 64-bit-based index file. That's not work we have slated, but I think a PR would be appreciated. We're actively running it on what I thought was a large repository, but it looks like the repo itself is only about 8 gigs.

rfan-debug · 2020-10-27T16:10:51Z

This seems to be an important issue.

salemhilal · 2020-10-28T17:17:36Z

@rfan-debug it's likely something we'll have to do at some point. Are you interested in tackling it?

rfan-debug · 2020-10-28T17:22:45Z

I think giving it a fix is not difficult. However, I am not sure how to test it reliably if i change any code.

It seems that we don't have sufficient integration test..

salemhilal · 2020-10-29T22:43:56Z

I think that's part of what makes this issue tricky. If you're willing to write unit or integration tests, I'd definitely welcome that as well.

rfan-debug · 2020-11-02T01:14:08Z

I think the unit test is sufficient for the current codesearch but we lack real integration test.

I skimmed over the codesearch. I found the root cause of the 4GB limit is from the data type uint32 everywhere. I need some time to check all the places of where uint32 is used for indexing and replace it with uint64. Certainly, we need to add some functionality on the bit operations on 64-bit data types.

Now i think a good way to build up the integration test set is:

Use the current 32-bit code to build a code search system on a codebase (e.g. hound itself)
Add 100 example queries and record its results.
Migrate the datatype in codesearch from 32bit to 64bit
Verify the 100 example query's result on the new system.

Urmeli0815 · 2021-01-01T21:57:09Z

I gave it a shot because I also thought that it would be straight forward but it's more difficult than expected.

The biggest hurdle is that the index size is tightly bound to the max-size of an array/slice. So a 64bit-sized index couldn't directly be mapped to a []byte because the max-size of an array/slice is MaxInt32. And with that the quite complex operations on slices need a migration.

I think a better approach would be to have the ability to have different backend implementations for the Index-Type. E.g. I could imagine that an implementation with an SQLite or bbolt backend would be quite easy and would automatically support very large index files.

salemhilal added enhancement help wanted labels Oct 1, 2020

salemhilal mentioned this issue Dec 20, 2022

Bug: Big-ish files not being indexed #444

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index is apparently limited to 4 GB #351

Index is apparently limited to 4 GB #351

moio commented Oct 1, 2020

salemhilal commented Oct 1, 2020

rfan-debug commented Oct 27, 2020

salemhilal commented Oct 28, 2020

rfan-debug commented Oct 28, 2020

salemhilal commented Oct 29, 2020

rfan-debug commented Nov 2, 2020

Urmeli0815 commented Jan 1, 2021 •

edited

Loading

Index is apparently limited to 4 GB #351

Index is apparently limited to 4 GB #351

Comments

moio commented Oct 1, 2020

salemhilal commented Oct 1, 2020

rfan-debug commented Oct 27, 2020

salemhilal commented Oct 28, 2020

rfan-debug commented Oct 28, 2020

salemhilal commented Oct 29, 2020

rfan-debug commented Nov 2, 2020

Urmeli0815 commented Jan 1, 2021 • edited Loading

Urmeli0815 commented Jan 1, 2021 •

edited

Loading