Multilingual Search with multilingual embeddings

This sample application demonstrates multilingual search using multilingual embeddings.

Quick start

The following is a quick start recipe for getting started with this application.

Docker Desktop installed and running. 4 GB available memory for Docker is recommended. Refer to Docker memory for details and troubleshooting
Alternatively, deploy using Vespa Cloud
Operating system: Linux, macOS or Windows 10 Pro (Docker requirement)
Architecture: x86_64 or arm64
Homebrew to install Vespa CLI, or download a vespa cli release from GitHub releases.

Validate Docker resource settings, should be minimum 4 GB:

$ docker info | grep "Total Memory"
or
$ podman info | grep "memTotal"

Install Vespa CLI:

$ brew install vespa-cli

For local deployment using docker image:

$ vespa config set target local

Pull and start the vespa docker container image:

$ docker pull vespaengine/vespa
$ docker run --detach --name vespa --hostname vespa-container \
  --publish 8080:8080 --publish 19071:19071 \
  vespaengine/vespa

Verify that configuration service (deploy api) is ready:

$ vespa status deploy --wait 300

Download this sample application:

$ vespa clone multilingual-search my-app && cd my-app

This sample app embedder configuration in services.xml points to a quantized model.

Alternatively, export your own model, see also the export script in simple-semantic-search.

Deploy the application :

$ vespa deploy --wait 300

Deployment note

It is possible to deploy this app to Vespa Cloud.

Evaluation

The following reproduces the results reported on the MIRACL Swahili(sw) dataset.

Install trec_eval:

$ git clone --depth 1 --branch v9.0.8 https://github.com/usnistgov/trec_eval && cd trec_eval && make install && cd ..

Index the dataset, this also embeds the texts and is compute intensive. On an M1 laptop, this step takes about 1052 seconds (125 operations/s).

$ zstdcat ext/sw-feed.jsonl.zst | vespa feed -

The evaluation script queries Vespa (requires pandas and requests libraries):

$ pip3 install pandas requests

E5 multilingual embedding model

Using the multilingual embedding model

$ python3 ext/evaluate.py --endpoint http://localhost:8080/search/ \
 --query_file ext/topics.miracl-v1.0-sw-dev.tsv \
 --ranking semantic --hits 100 --language sw

Compute NDCG@10 using trec_eval with the dev relevance judgments:

$ trec_eval -mndcg_cut.10 ext/qrels.miracl-v1.0-sw-dev.tsv semantic.run

Which should produce the following:

ndcg_cut_10           	all 	0.6848

BM25

Using traditional keyword search with BM25 ranking:

$ python3 ext/evaluate.py --endpoint http://localhost:8080/search/ \
 --query_file ext/topics.miracl-v1.0-sw-dev.tsv \
 --ranking bm25 --hits 100 --language sw

Compute NDCG@10 using trec_eval with the same relevance judgments:

$ trec_eval -mndcg_cut.10 ext/qrels.miracl-v1.0-sw-dev.tsv bm25.run

ndcg_cut_10           	all	0.424

Cleanup

Tear down the running container:

$ docker rm -f vespa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Multilingual Search with multilingual embeddings

Quick start

Deployment note

Evaluation

E5 multilingual embedding model

BM25

Cleanup

Files

README.md

Latest commit

History

README.md

File metadata and controls

Multilingual Search with multilingual embeddings

Quick start

Deployment note

Evaluation

E5 multilingual embedding model

BM25

Cleanup