Vespa schema changes for query control & general quality of life #163

kdutia · 2024-12-18T11:10:27Z

Description

A batch of changes to the schema making use of inheritance, input parameters and summaries. These aim to mean we can have more control over search without requiring schema changes in future. Recommended that the easiest way to go through this PR is by commit.

Also:

set document languages to english
adds text fields with _bolding suffixes which give the bolded version of each when a search is done
adds a new hybrid profile with nativeRank - Vespa's alternative to BM25 that gives you a little bit more control and produces normalised scores

Sidenote: I tried to test this on the backend using the test pypi published package from this PR's CI but the Vespa dependency seemed to be broken on that. Not sure whether this is just me or it's actually broken 🤷

Proposed version

Please select the option below that is most relevant from the list below. This
will be used to generate the next tag version name during auto-tagging.

Skip auto-tagging
Patch
Minor version
Major version

Visit the Semver website to understand the
difference between MAJOR, MINOR, and PATCH versions.

Notes:

If none of these options are selected, auto-tagging will fail (integrated soon)
Where multiple options are selected, the most senior option ticked will be
used -- e.g. Major > Minor > Patch
If you are selecting the version in the list above using the textbox, make
sure your selected option is marked [x] with no spaces in between the
brackets and the x

Type of change

Please select the option(s) below that are most relevant:

Bug fix
New feature
Breaking change

How Has This Been Tested?

Please describe the tests that you added to verify your changes.

Before submitting

I've read and followed all steps in the Making a pull request
section of the CONTRIBUTING docs.
I've updated or added any relevant docstrings following the syntax described in the
Writing docstrings section of the CONTRIBUTING docs.
If this PR fixes a bug, I've added a test that will fail without my fix.
If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

linear · 2024-12-18T11:10:31Z

SCI-155 last set of schema changes

olaughter

This looks good to me! Left some comments below. I'm looking forward to seeing how we get on with nativerank!

olaughter · 2024-12-18T16:20:04Z

tests/local_vespa/test_app/schemas/document_passage.sd

@@ -5,6 +5,15 @@ schema document_passage {
        stemming: none
    }

+    field language type string {
+        indexing: "en" | set_language


Looking at the docs for this:

The recommended use is to have one field in the document containing the language code, and that field should be the first field in the document, as it will only affect the fields defined after it in the schema.

https://docs.vespa.ai/en/reference/indexing-language-reference.html#set_language

This feels weird to me to have a document config at a field level, but it is what it is! I'm wondering if we need to move the language field above text_block_not_stemmed?

olaughter · 2024-12-18T16:21:36Z

tests/local_vespa/test_app/schemas/document_passage.sd

-        summary text_block_page {}
-        summary text_block_coords {}
-        summary concepts {}
+    document-summary search_summary_with_tokens inherits search_summary {


Ohh this is nice!

olaughter · 2024-12-18T16:22:31Z

tests/local_vespa/test_app/schemas/family_document.sd

+    field language type string {
+        indexing: "en" | set_language
+    }
+


As on the document passage, wondering if this needs to be at the top of the doc

olaughter · 2024-12-18T16:24:56Z

tests/local_vespa/test_app/schemas/document_passage.sd

@@ -173,16 +173,6 @@ schema document_passage {
            tokens
        }
    }
-
-    rank-profile exact inherits default {


Nice cleaning this up now we use exact_not_stemmed instead 🎉

olaughter · 2024-12-18T16:26:38Z

tests/local_vespa/test_app/schemas/family_document.sd

-        }
-        function name_score() {
-            expression: attribute(name_weight) * bm25(family_name_index)
+            query(description_closeness_weight) double: 0.0


Getting my head around this, does setting this to 0.0 make it have no effect?

olaughter · 2024-12-18T16:32:14Z

tests/local_vespa/test_app/schemas/family_document.sd

        bolding: true
    }

    field family_description_bolding type string {
-        indexing: input family_description_index | index
+        indexing: input family_description_index | summary | index


Any reason to hang onto the index attribute for these when they come from an index field?

set document language to english

9048c5b

kdutia added 8 commits December 18, 2024 11:14

remove unused 'exact' rank-profile

9b8ea4a

add weights with defaults to hybrid profile

59084c5

hybrid_no_closeness schema inherits from hybrid

401105e

remove hybrid_no_description_embedding rank-profile

6a7c6b0

search_summary_with_tokens inherits search_summary

780bfa4

add all features to rank profile summary-features

7e2411d

add nativerank profiles

4f238b1

add field variants with bolding

7ab8260

kdutia changed the title ~~set of schema changes~~ Vespa schema changes for query control & general quality of life Dec 18, 2024

bump version to 1.12.0

d6e21d8

kdutia marked this pull request as ready for review December 18, 2024 12:17

kdutia requested a review from a team as a code owner December 18, 2024 12:17

kdutia marked this pull request as draft December 18, 2024 12:20

add summary to family_name_bolding & family_description_bolding

3aec5fc

kdutia marked this pull request as ready for review December 18, 2024 12:31

kdutia marked this pull request as draft December 18, 2024 13:10

olaughter reviewed Dec 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vespa schema changes for query control & general quality of life #163

Vespa schema changes for query control & general quality of life #163

kdutia commented Dec 18, 2024 •

edited

Loading

linear bot commented Dec 18, 2024

olaughter left a comment

olaughter Dec 18, 2024

olaughter Dec 18, 2024

olaughter Dec 18, 2024

olaughter Dec 18, 2024

olaughter Dec 18, 2024

olaughter Dec 18, 2024

Vespa schema changes for query control & general quality of life #163

Are you sure you want to change the base?

Vespa schema changes for query control & general quality of life #163

Conversation

kdutia commented Dec 18, 2024 • edited Loading

Description

Proposed version

Type of change

How Has This Been Tested?

Before submitting

linear bot commented Dec 18, 2024

olaughter left a comment

Choose a reason for hiding this comment

olaughter Dec 18, 2024

Choose a reason for hiding this comment

olaughter Dec 18, 2024

Choose a reason for hiding this comment

olaughter Dec 18, 2024

Choose a reason for hiding this comment

olaughter Dec 18, 2024

Choose a reason for hiding this comment

olaughter Dec 18, 2024

Choose a reason for hiding this comment

olaughter Dec 18, 2024

Choose a reason for hiding this comment

kdutia commented Dec 18, 2024 •

edited

Loading