Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct intersections with serialized RoaringBitmaps #281

Merged
merged 12 commits into from
Jun 8, 2024

Conversation

Kerollmops
Copy link
Member

@Kerollmops Kerollmops commented Jun 5, 2024

This PR is related to #263 and implements methods to do intersections directly with serialized data. This avoids deserializing everything in memory to reduce it just after which drastically reduces the amount of allocated memory and intensive intersection operations.

To Do

  • Improve function documentation.
  • Use the containers offsets when available.
  • Convert operation methods to use the BitAndAssign/BitAnd traits (too much work for now, let's ship)

@Kerollmops Kerollmops force-pushed the intersection-with-serialized branch from 0a11dd0 to 88b848b Compare June 6, 2024 14:41
@Kerollmops Kerollmops marked this pull request as ready for review June 7, 2024 22:12
@Kerollmops Kerollmops merged commit 6391a97 into main Jun 8, 2024
4 checks passed
@Kerollmops Kerollmops deleted the intersection-with-serialized branch June 8, 2024 01:46
meili-bors bot added a commit to meilisearch/meilisearch that referenced this pull request Jun 11, 2024
4682: Speed Up Filter ANDs operations r=Kerollmops a=Kerollmops

This PR fixes #4659 and improves the way we do AND operations by using the latest [RoaringBitmap feature to do intersections with serialized bitmaps](RoaringBitmap/roaring-rs#281). Doing so drastically reduces the time spent reading, copying bytes in memory to use and keep a subset of the containers in the bitmap.

### Some Example Results

With a 45M documents dataset running on a good NVMe. This example filter was taking 77ms and with this PR only 13ms (6x speedup):

```sql
artist = 'The Beatles' AND (duration 150 TO 500 OR duration NOT EXISTS) AND genres IN [Rock, 'Rock and Roll'] AND rating > 4 AND released_year 1960 TO 1990
```

By reordering the filter AND clauses we can reach a constant 8ms execution time. However, note that it is a manual operation. On the other side the previous filter pipeline is still at a constant 45ms execution time with this filter. (6x speedup)

```sql
artist = 'The Beatles' AND genres IN [Rock, 'Rock and Roll'] AND released_year 1960 TO 1990 AND (duration 150 TO 500 OR duration NOT EXISTS)
```

### To Do
- [x] Rebase on `release-v1.9.0`.
- [ ] ~Skip branches of the facet/filter tree when nothing is in common with the universe~ slower this way.
- [x] When the universe is required use the universe given in parameter if possible.

Co-authored-by: Clément Renault <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants