Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vespa schema changes for query control & general quality of life #163

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

kdutia
Copy link
Member

@kdutia kdutia commented Dec 18, 2024

Description

A batch of changes to the schema making use of inheritance, input parameters and summaries. These aim to mean we can have more control over search without requiring schema changes in future. Recommended that the easiest way to go through this PR is by commit.

Also:

  • set document languages to english
  • adds text fields with _bolding suffixes which give the bolded version of each when a search is done
  • adds a new hybrid profile with nativeRank - Vespa's alternative to BM25 that gives you a little bit more control and produces normalised scores

Sidenote: I tried to test this on the backend using the test pypi published package from this PR's CI but the Vespa dependency seemed to be broken on that. Not sure whether this is just me or it's actually broken 🤷

Proposed version

Please select the option below that is most relevant from the list below. This
will be used to generate the next tag version name during auto-tagging.

  • Skip auto-tagging
  • Patch
  • Minor version
  • Major version

Visit the Semver website to understand the
difference between MAJOR, MINOR, and PATCH versions.

Notes:

  • If none of these options are selected, auto-tagging will fail (integrated soon)
  • Where multiple options are selected, the most senior option ticked will be
    used -- e.g. Major > Minor > Patch
  • If you are selecting the version in the list above using the textbox, make
    sure your selected option is marked [x] with no spaces in between the
    brackets and the x

Type of change

Please select the option(s) below that are most relevant:

  • Bug fix
  • New feature
  • Breaking change

How Has This Been Tested?

Please describe the tests that you added to verify your changes.

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

Copy link

linear bot commented Dec 18, 2024

@kdutia kdutia changed the title set of schema changes Vespa schema changes for query control & general quality of life Dec 18, 2024
@kdutia kdutia marked this pull request as ready for review December 18, 2024 12:17
@kdutia kdutia requested a review from a team as a code owner December 18, 2024 12:17
@kdutia kdutia marked this pull request as draft December 18, 2024 12:20
@kdutia kdutia marked this pull request as ready for review December 18, 2024 12:31
@kdutia kdutia marked this pull request as draft December 18, 2024 13:10
Copy link
Contributor

@olaughter olaughter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! Left some comments below. I'm looking forward to seeing how we get on with nativerank!

@@ -5,6 +5,15 @@ schema document_passage {
stemming: none
}

field language type string {
indexing: "en" | set_language
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the docs for this:

The recommended use is to have one field in the document containing the language code, and that field should be the first field in the document, as it will only affect the fields defined after it in the schema.

https://docs.vespa.ai/en/reference/indexing-language-reference.html#set_language

This feels weird to me to have a document config at a field level, but it is what it is! I'm wondering if we need to move the language field above text_block_not_stemmed?

summary text_block_page {}
summary text_block_coords {}
summary concepts {}
document-summary search_summary_with_tokens inherits search_summary {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh this is nice!

Comment on lines +23 to +26
field language type string {
indexing: "en" | set_language
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As on the document passage, wondering if this needs to be at the top of the doc

@@ -173,16 +173,6 @@ schema document_passage {
tokens
}
}

rank-profile exact inherits default {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleaning this up now we use exact_not_stemmed instead 🎉

}
function name_score() {
expression: attribute(name_weight) * bm25(family_name_index)
query(description_closeness_weight) double: 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting my head around this, does setting this to 0.0 make it have no effect?

bolding: true
}

field family_description_bolding type string {
indexing: input family_description_index | index
indexing: input family_description_index | summary | index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to hang onto the index attribute for these when they come from an index field?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants