Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a config property/env variable specific index field #117

Merged

Conversation

marko-bekhta
Copy link
Collaborator

I was looking into Max's comment quarkusio/quarkusio.github.io#1825 (comment) and I've tried a few things like shingle filter and some other stuff..
The idea I've ended up with is that we add a config property/env variable specific field. Using a regex tokenizer we'll put only tokens that are config properties or env variables. Then, while searching, we'd use a keyword tokenizer on this particular field, which would mean if we find a match there we can boost it with > 1 boost since the search term seems to be looking suspiciously similar to a config property.

@yrodiere
Copy link
Member

yrodiere commented Jan 8, 2024

I need to have a closer look, but first: why don't we currently have a match exactly? I'd expect quarkus.websocket.max-frame-size to match quarkus.websocket.max-frame-size even if it's tokenized at some point... There's something odd going on, no?

@marko-bekhta
Copy link
Collaborator Author

marko-bekhta commented Jan 8, 2024

hey, yes sure.
So the original problem (the odd part happening) is the content extraction part, basically the changes from the second commit are to solve that specific issue:

-            Element content = body.selectFirst(".content .grid__item");
+            Element content = body.selectFirst(".guide");

if you go to a quarkus page https://quarkus.io/guides/all-config and then try running a selection query (that is currently used) from a dev tools console document.querySelector('.content .grid__item') you'd get a

Quarkus is open. All dependencies of this project are available under the [Apache Software License 2.0](https://www.apache.org/licenses/LICENSE-2.0) or compatible license.

section from the footer ... so it just didn't index anything from the page, hence no results.
then the other commit with a regex tokenizer and everything else ... I was just thinking that usually those env variables or config properties are long and people may be searching only by part of them so that is what I was trying to cover there 😃

since we won't be searching for something that is < 2 symbols there's no need to generate that 1 gram token
@marko-bekhta marko-bekhta force-pushed the i103-searching-for-config-properties branch from 9b08c6a to 5edfff7 Compare January 8, 2024 10:25
@yrodiere
Copy link
Member

yrodiere commented Jan 8, 2024

I see, thanks for the explanation. I'd have expected just raising the max_gram on the existing autocomplete analyzer to be enough, but if not, and if this does not make indexing outrageously slow... +1 let's do this.

I'll try this locally and then will merge.

@marko-bekhta
Copy link
Collaborator Author

marko-bekhta commented Jan 8, 2024

yeah, give it a try and see what you think 😃. I wasn't sure about increasing the max on the current analyzer, since it would probably breakup more words than just properties... but then I didn't really check how much relly loooooong words we have in the guides 😃 . And then there's also the fact that a config property would be tokenized and not treated as a single keyword... I was thinking: if I'm searching for a config and start with something like quarkus.foo-bar then I probably would want a result where a property is present to show up higher than a doc where we just have foo and some text then bar in Quarkus

*EDIT: oh and because of that regex tokenizer it should only include the env variables and config properties, no other text

yrodiere added a commit to yrodiere/search.quarkus.io that referenced this pull request Jan 8, 2024
The last change to that size was supposed to set it to 512MiB
but mistakenly set it to 512GiB... which is way too much.

Currently we only use ~500MB per index, and this will rise to ~1GB per
index after quarkusio#117, so 5GiB should be more than enough and leave room for
changes to analysis config and indexing of more content in the
foreseable future.
Copy link
Member

@yrodiere yrodiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so this almost doubles the size of indexes (from 500MB to 1GB), which is kind of odd considering the new field supposedly only indexes config properties...

But I guess that's okay, since we have more than enough room to accommodate these relatively still small indexes (see #119 :x )

Will merge, thanks!

@yrodiere yrodiere merged commit 7730eb6 into quarkusio:main Jan 8, 2024
1 check passed
yrodiere added a commit to yrodiere/search.quarkus.io that referenced this pull request Jan 8, 2024
The last change to that size was supposed to set it to 512MiB
but mistakenly set it to 512GiB... which is way too much.

Currently we only use ~500MB per index, and this will rise to ~1GB per
index after quarkusio#117, so 5GiB should be more than enough and leave room for
changes to analysis config and indexing of more content in the
foreseable future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants