Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full-text search #40

Closed
joepio opened this issue Nov 21, 2020 · 11 comments
Closed

Full-text search #40

joepio opened this issue Nov 21, 2020 · 11 comments
Labels
plugin Should probably be an Atomic Plugin
Milestone

Comments

@joepio
Copy link
Member

joepio commented Nov 21, 2020

Being able to search through data inside your personal atomic server seems like a nice feature to have. In this issue, I'd like to explore the requirements and some possible approaches for implementing a full text search service.

Wishes

  • Find specific resources and their URLs extremely fast. This enables using it in things like semantic document editors, so instead of simply typing a word, we're inserting a specific URL of some thing. This will mean that items that are often used will come out 'on top'. To do this, we need a sorting algorithm that uses historical searches and selections. We also need some form of autocompletion.
  • Be able to either ignore or include types of resources. Perhaps I'm looking for a specific person, and I don't want to see all documents with that persons name inside it.
  • A persons first name in a "firstName" property should weigh more than that string in a "description". Perhaps make character count diminish relevance score.
  • Iterative indexing on Commits / when resources are added to the store.

Approaches / implementation ideas

  • The API itself should use atomic data, too. The Collections model is probably useful here, since it introduces paginated content.
  • The Sonic crate offers performant search features, and simply returns a URL. The developer is planning on making it embeddable.
  • It would be nice if it works as an installable plugin. This would require that plugins function as some sort of middleware handler, offer a custom endpoint, be able to write / access data.
@joepio joepio added the plugin Should probably be an Atomic Plugin label Jan 8, 2021
@AlexMikhalev
Copy link
Collaborator

@joepio may be https://endler.dev/2019/tinysearch/ is a better approach?
What is the type of data we are searching for?

@joepio
Copy link
Member Author

joepio commented Nov 9, 2021

Thanks for sharing, @AlexMikhalev! I think tinysearch might be a bit too static for this usecase - the index should be updated very frequently (every time a user creates a commit).

I'm currently considering Tantivy, a lucene alternative, as it can be fully embedded and seems really fast.

@jonassmedegaard
Copy link

qualities that I appreciate in web-based search engines are a) support for offline search (which also means less round-trips and less tracking while online), and b) support for OpenSearch

For a) I generally like lunr - and seem to recall that there's a crate integrating that.

@jonassmedegaard
Copy link

yes, not the original Lunr but a derivative: https://crates.io/crates/elasticlunr-rs

@joepio
Copy link
Member Author

joepio commented Nov 10, 2021

That one's also statically indexed, which is not a good option for atomic-server, I'm afraid. I need incremental indexing, so new commits are also searchable as users update stuff.

@AlexMikhalev
Copy link
Collaborator

Sonic is also statically re-building fst trees - they don't support incremental indexing either, but sonic hides it from the user.

@joepio
Copy link
Member Author

joepio commented Nov 10, 2021

Hmm, interesting. Maybe if indexing is fast enough (takes <60secs for 1gb atomic data) then we could do it max once every minute.

@AlexMikhalev
Copy link
Collaborator

qualities that I appreciate in web-based search engines are a) support for offline search (which also means less round-trips and less tracking while online), and b) support for OpenSearch

For a) I generally like lunr - and seem to recall that there's a crate integrating that.

Do you see atomic server deployed on server or "at the edge" - at users device? if it's on users device it's allready "offline first".

@joepio joepio added this to the v1.0.0 milestone Nov 10, 2021
@jonassmedegaard
Copy link

  • on an untrusted system (i.e. a remote server) I would prefer efficient network traffic and an option to do client-side search (i.e. hand me a lookup table and JavaScript code to traverse it, allowing me to only "reveal" the hits I want to get, not the queries leading me to the hits)
  • on a separate trusted system (i.e. a local server) I might prefer efficient network traffic or efficient processing, depending on network bandwidth versus capacity of the server (might be a corporate rack server or a 32-bit ARM device)
  • directly on a user-facing system (i.e. a laptop or a phone) I would prefer efficient processing (think battery-powered) and direct APIs (no JavaScript or web browsing but socket-based more direct connections

joepio added a commit that referenced this issue Nov 10, 2021
@joepio
Copy link
Member Author

joepio commented Nov 10, 2021

I made quite a bunch of progress in one night, and it's actually working quite well! I'm really amazed that it took so little time to implement. Hats off to tantivy for creating this awesome library.

Screenshot 2021-11-10 at 22 25 55

joepio added a commit that referenced this issue Nov 11, 2021
joepio added a commit that referenced this issue Nov 11, 2021
joepio added a commit that referenced this issue Nov 11, 2021
joepio added a commit that referenced this issue Nov 11, 2021
joepio added a commit that referenced this issue Nov 11, 2021
joepio added a commit that referenced this issue Nov 11, 2021
joepio added a commit that referenced this issue Nov 12, 2021
joepio added a commit that referenced this issue Nov 13, 2021
joepio added a commit that referenced this issue Nov 13, 2021
joepio added a commit that referenced this issue Nov 13, 2021
joepio added a commit that referenced this issue Nov 13, 2021
joepio added a commit that referenced this issue Nov 13, 2021
joepio added a commit that referenced this issue Nov 13, 2021
joepio added a commit that referenced this issue Nov 13, 2021
joepio added a commit that referenced this issue Nov 13, 2021
@joepio
Copy link
Member Author

joepio commented Nov 13, 2021

Allrighty, tests are passing, time to release a new version! It's pretty decent, but there's still room for improvement. See #210 for some of the things that can be improved upon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plugin Should probably be an Atomic Plugin
Projects
None yet
Development

No branches or pull requests

3 participants