Skip to content

Documentation

Raphael Waltenspül edited this page Jan 7, 2025 · 32 revisions

Here, we document the inner workings of vitrivr-engine, introduce concpets employed and aim on providing a good overview of the components of vitrivr-engine.

Terminology

This chapter introduced common terminology.

Introduction

In content-based multimedia retrieval, the aim is to search within multimedia collections (e.g. video, image, audio, 3d objects) on a content, hence semantic level. This is a non-trivial problem due to the so-called semantic gap - the stark difference of semantic understanding of content between human and machines. Recent developments in foundation models has reduced this, yet, to efficiently search within large collections of multimedia data, various techniques are employed.

Ingestion / Offline Phase

In (multimedia) retrieval, there a common distinction is between two phases; the ingestion phase (also known as offline phase), during which the multimedia content is being analysed and representations of the content is stored in an efficient way for later use.

Retrieval / Online Phase

The retrieval phase (also known as online phase) describes actions performed after ingestion, when (user) queries to the system are analysed in the same manner, as the multimedia data has been and the comparison of query and content is operated on those represntations. The outcome usually is represented by a list of results, each with an accompanying similarity score, which indicates how similar the results are. Commonly, a similarity score of 1 represents identity, while a similarity score of 0 indicates the greates dissimilarity.

Feature

In multimedia retrieval, a feature stands for the means on how to represent the multimedia content.

Toy Example

A very primitive feature is the average colour: Given an image (either an image or a frame from a video), one calculates the average colour by averaging the inidividual pixels' RGB values. While on its own this is not very expressive, demonstrates on how features work.

During ingestion, the average colour is calculated for all the input data (again, this could be for example a bunch of images or a couple of representative frames from a video) and stored in the database as three-element vectors (R,G,B).

During retrieval time, the query consists of a single three-element vector (R,G,B) and a Nearest Neighbour Search (NNS) is performed on those average colour vectors. The distance then is converted to a similarity score s on the interval $$s \in [0,1]$$ for all items in the database.

Further Reading

  • Basics: Wikipedia
  • Research: vitrivr
  • Book: Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval, ACM Press Books, 1999 (1st edition), 2011 (2nd edition)

There are a lot of (research) publications out there which cover (multimedia) retrieval in great detail.

Data Model vitrivr-engine

vitrivr-engine's data model is based on almost a decade of research in multimedia retrieval. Influenced by its predecessor, the retrieval engine Cineast, the aim of the data model is to be as flexible as possible while still providing foundational guidelines for consumer of vitrivr-engine.

Retrievable

In vitrivr-engine, a retrievable is the unit of retrieval and the logical representation of multimedia data. Depending on the type of multimedia, one (e.g. image) or more (e.g. video) retrievables exist.

For an image file, a single retrievable of the type SOURCE:IMAGE is created. For a video file, a single retrievable of the type SOURCE:VIDEO is created and a couple of retrievables with the type SEGMENT are created, depending on the segmentation strategy. Having a 30s video and a 1s fixed length segmentation, 31 retrievables are the result, one per second plus the one for the file. The one-second-segment retrievables have a partOf relationship towards the source retrievable.

Descriptor

The descriptor describes a retrievable in vitrivr-engine. The fundamental concept is, that a retrievable's content is represented by descriptors, which are based on features.

For an image file and the average colour example: The source retrievable is described by one average colour descriptor. For a 30s veideo file and the average colour example: Each of the 30 one-second-segment retrievables are described by one average colour descriptor, the source retrievable is not described.

Overview of Descriptors

In vitrivr-engine, there are four distinct high-level types of descriptors:

  • Vector descriptors have a type (e.g. float) and a length. Ideal for NNS.
  • Struct descriptors have pre-defined sub-fields of various types.
  • Scalar descriptors consist of a single typed value.
  • Tensor descriptors represent a mathematical tensor. Not yet implemented [June, 2024]

Schema

vitrivr-engine operates on the notion of a named schema, similarly to a database or a collection, essentially providing, among other things, a namespace.

{
  "schemas": {
    "my-schema"
  }
}

Database Connection

Each schema has to have a database connection which describes where the schema is persisted (and read from). The database which is supported by vitrivr-engine is CottontailDB.

{
  "database": "CottontailConnectionProvider",
  "parameters": {
    "Host": "127.0.0.1",
    "port": "1865"
  }
}

Field

In vitrivr-engine, the term field represents features which are to be used. In particular, each field is uniquely named and might be parameterised.

Note: In technical terms, each field has to be backed by an Analyser, whose output is a descriptor. During ingestion, the analyser produces the representing descriptor of a retrievable, during retrieval the analysis step involves the execution of a query using the derived descriptor.

"uniqueName": {
  "factory": "FactoryClass",
  "parameters":{
    "key": "value"
  }
}

A note about fields in vitrivr-engine: Due to its highly modular architecture, a handful of features to be used as fields are shipped with vitrivr-engine. The toy example is the AverageColor. Depending on use case, custom features can be added.

See analysier / field overview.

Exporter

In constrast to an analyser / a field, in vitrivr-engine, an exporter produces exports new, derived data.

"uniqueName": {
    "factory": "FactoryClass",
    "resolverName": "resolverName",
    "paramters": {
        "key": "value"
    }
}

Resolver

A resolver is responsible to resolve a physical ressource based on information present in a retrievable.

"uniqueName": {
    "factory": "FactoryClass",
    "paramters": {
        "key": "value"
    }
}

Schema Configuration

The schema configuration is the foundation of vitrivr-engine and therefore required on startup. The configuration consists of blocks for the database connection (one), fields (many), exporters (many), and resolvers (many):

{
    "schemas": {
        "schema-name": {
            "connection": {
                "database": "CottontailConnectionProvider",
                "parameters": {
                    "Host": "127.0.0.1",
                    "port": "1865"
                }
            },
            "fields": {
                "my-field-1": {
                    "factory": "AnalyserFactory"
                },
                "my-other-field": {
                    "factory": "AnotherAnalyserFactory"
                }
            },
            "resolvers": {
                "my-resolver": {
                    "factory": "ResolverFactory",
                    "parameters": {
                        "key": "value"
                    }
                }
            },
            "exporters": {
                "my-exporter": {
                    "factory": "ExporterFactory",
                    "resolverName": "my-resolver",
                    "parameters": {
                        "key1": "value1",
                        "key2": "value2"
                    }
                }
            },
            "extractionPipelines": {
                "my-video-pipeline": {
                    "path": "./videos.json"
                },
                "my-image-pipeline": {
                    "path": "./images.json"
                }
            }
        }
    }
}

The newly introduces property extractionPipelines is a list of names ingestion pipelines and the path to the JSON file containing the pipeline configuration. This is useful, if pre-defined ingestion pipelines are to be used. However, there is also the possiblity to provide the pipeline configuration on-the-fly, which is why this property is optional.

Creating indexes

pgVector

In general an index can be added by adding it to the schema.json config, e.g.:

"whisperasr": {
 "factory": "ASR",
  "indexes": [{"attributes":["value"],"type":"FULLTEXT","parameters":{"type":"gin", "language": "english"}}],
   "parameters": {
    "host": "http://10.34.64.83:8888/",
    "model": "whisper",            
    "timeoutSeconds": "100",
    "retries": "1000"
    }      
 },

In pgVector we provide the following indexes for the query types FullText, NNS and SCALAR.

SCALAR Search

Nearest Neighbor Search NNS

Index hnsw

The hierarchical navigable small world index (HNSW) can be set up for a VECTOR field by adding the following configuration to the field in the schema config.

Parameters:

  • attributes allowed max. 1. The value vector is the attribute name in database.
  • type: "NNS" describes the query type.
  • parameters.type: "hnsw" describes the indextype.
  • distance: describes the distance metric for this index. The hnsw index provides:
    • "manhatten"
    • "euclidean"
    • "cosine"
    • "hamming"
    • "jaccard"
  • m: the max number of connections per layer (16 by default)
  • efConstruction: the size of the dynamic candidate list for constructing the graph (64 by default)
  • efSearch: Specify the size of the dynamic candidate list for search (100 by default)
  "indexes": [
    {
      "attributes": [
        "vector"
      ],
      "type": "NNS",
      "parameters": {
        "type": "hnsw",
        "distance": "cosine",
        "m": "4",
        "efConstruction": "10"
        "efSearch": "1000"
      }
    }
  ]

FullText Search

Index gin

The generalized inverted index (GIN) can be set up for a FULLTEXT field by adding the following configuration to the field in the schema config.

Parameters:

  • attributes The value value is the attribute name in database. All attributes will be concatenated with delimiters " || ' ' || " to create a document.
  • type: "FULLTEXT" describes the query type.
  • parameters.type: "gin" describes the index type.
  • english: the Language for fulltext index. (Default "english")
  "indexes": [
    {
      "attributes": [
        "value"
      ],
      "type": "FULLTEXT",
      "parameters": {
        "type": "gin",
        "language": "english"
      }
    }
  ]

Cottontail

index create <schema>.<descriptor_field> <descriptor> LUCENE
index rebuild <name-vom-index>
index create warren.ptt.descriptor_asr descriptor LUCENE
index rebuild warren.ptt.descriptor_asr.idx_descriptor_lucene

Ingestion

During ingestion, the multimedia data is analysed and features are extracted. Ingestion in vitrivr-engine is based on an ingestion pipeline definition, centered around so-called operators. The previously introduced analysers are one kind of such operators, which extract feature(s) corresponding to their field. Other operators include the previously introduced exporters.

Ingestion is schema-dependent and always directly linked to one specific schema.

Ingestion Context

The ingestion context provides vital information -- the context -- to an ingestion pipeline. Specifically, there is a global and a local context. The former provides key-value pairs for operators of the pipeline, while the latter provides key-value pairs to specific operators based on their name. More so, the local context may override the global one (e.g. if there is a global "limit":"100" key-value pair and a certain local context provides a "limit":"50" key-value pair for one operator, this operator will have a limit of 50, in case it supports a limit.

The ingestion context additionally has two essential properties, contentFactory and resovlerName:

Content Factory

The ingestion context also includes a mandatory property, contentFactory, which requires the name of a ContentFactoriesFactory class. The purpose of this factory is to produce ContentFactorys, which in turn produce Content - vitrivr-engine's representation of the media.

vitrivr-engine provides two such factories:

Class Description Local Context Properties
InMemoryContentFactory Produces content and stores it in-memory, which works fine for small datasets.
CachedContentFactory Produces content and caches the contents on disk. Designed for large datasets with large individual items (e.g. long high-res videos. content.location: The path location for the cache, defaults to a temporary directory called vitrivr-cache

The content.location local context property notation should be read as:

{
  "context":{
    "contentFactory":"CachedContentFactory",
    "resolverName":"<resolver-name>",
    "local":{
      "content":{
        "location":"<path-to-cache>"
      }
    }
  }
}

Fill in the placeholders <resolver-name> and <path-to-cache> as necessary.

Resolver Name

Another special ingestion context property is the resolverName property, which has to reference a resolver defined on the schema. The reason being that certain components may produce data which is relevant for retrieval and ingestion and the shared resolver ensures a common path.

Ingestion Operator

The ingestion operator is first defined and then used as one component within a pipeline. Operators do have various types:

  • ENUMERATOR enumerates sources and therefore serves as the starting point
  • DECODER decode the content into consumable elements
  • EXTRACTOR extract features and have to be backed by a field
  • EXPORTER export derived data from the multimedia data
  • TRANSFORMER transform the incoming retrievables to outgoing retrievables, possibly filtering them

The base structure of an ingestion operator is as follows:

{
  "type": "<type>",
  "<addressKey>":"<provider>"
}

Where the <type> represents one of the above introduced types, <addressKey> is one of factory (enumerator, decoder, transformer), fieldName (extractor), or exporterName (exporter). Some operators do have additional key-value configuration.

See Ingestion Operator Overview for further information on concrete implementations.

Enumerator

The enumerator emits elements based on its configuration.

{
  "type":"ENUMERATOR",
  "factory":"FactoryClass",
  "mediaTypes":["<mt>"]
}

Where <mt> stands for one of the following mediaTypes: IMAGE (images), VIDEO (videos), AUDIO (audio), MESH (3d objects). An enumerator can emit multiple media types, if necessary.

Decoder

The decoder segments the media data into content and provides therefore the segmentation to work on.

{
  "type":"DECODER",
  "factory":"FacotryClass"
}

Extractor

The extractor is backed by a field. It analyses the media content and extracts the feature representation.

{
  "type":"EXTRACTOR",
  "fieldName":"my-field"
}

my-field must be a field name defined on the schema.

Exporter

An exporter produces derived data, e.g. thumbnails from a video.

{
  "type":"EXPORTER",
  "exporterName":"my-exporter"
}

Where my-exporter is the name of an exporter defined on the schema.

Tansformer

In vitrivr-engine, a trasformer consumes retrievables and emits them, not necessarily one-to-one. That means, there might be a filter transformer which filters retrievables on a property.

{
  "type":"TRANSFORMER",
  "factory":"FactoryClass"
}

Ingestion Operations

In the ingestion configuration, a the operations define the ingestion operator pipeline / directed graph. An operation is a named node in the graph:

"operation-name": {
    "operator":"operator-name",
    "inputs": ["<input-stages>"],
    "merge":"<merge-stragety>"
  }

operator-name and <input-stages> must reference a previously defined operator, as well as other existing operations. <merge-strategy> must be one of MERGE, COMBINE or CONCAT, see below

The inputs and merge properties are optional with the following rules:

  • if the operator-name references an enumerator, then no inputs are expected, as the enumerator is the start node of the pipeline graph
  • if there is more than one element in the inputs list, then the merge property is required.

Branching and Merging

By defining the operations accordingly, there are two thing that can happen implicitly.

Branching: If an operations is used as input for multiple other operations, this results in a branching. This is handled automatically by wrapping the associated Operator in a BroadcastOperator.

Merging: If an operation has multiple inputs, this results in a merging, which combines multiple flows of Retrievables into a single flow. The merging strategy (MergeType) must be specified explicitly in the operation.

Currently, vitrivr-engine supports three type of merging strategies:

  • MERGE : Merges the Retrievables from the input operations in order the arrive. No deduplication and ordering is performed.
  • COMBINE : Merges Retrievables from the input operations and emits a Retrievable, once it was received on every input.
  • CONCAT : Collects Retrievables from the incoming flows in order of occurence, i.e., operation 1, then operation 2 etc.

Persistance

In order to persist the results of the ingestion, an operation (or multiple ones) have to be specified in the special output proprety of the ingestion configuration. If multiple operations are specified as output, then additionally, a mergeType has to be defined, see merging.

Ingestion Configuration

An ingestion pipeline is stored as a JSON file.

It's properties are as follows:

For a simple example, see the Getting Started guide's ingestion pipeline. For a more advanced example, see the Example guide's ingestion pipeline.

Retrieval

During retrieval time, queries are sent to vitrivr-engine with the aim to retrieve information, based on previous ingestion. Centered around retrieval operators, vitrivr-engine comes with its own query language which consists of four core components:

  • The inputs define the query payload
  • The operations define the order of retrieval operators as a pipeline
  • The query context provides, similar to the ingestion context, vital contextual information
  • The output specifies which operation is returned

Retrieval in vitrivr-engine is schema dependent and directly linked to one schema.

Retrieval Context

Similar to the ingestion context, the retrieval context consists of a local and global component. See ingestion context for more information.

Query Input

Essentially the payload of the query, the input is a typed, named component of a query. The types supported are:

  • TEXT for textual input
  • IMAGE for image input
  • VECTOR for vector input
  • ID to query for an ID
  • BOOLEAN for boolean input
  • NUMERIC for numerical input
  • DATE for datetime inputs

See Query Input Overview for further information.

Query Operators

There are three types of query operators, which do have a certain similarity to the ingestion operators by design:

  • RETRIEVERs are the EXTACTORs counterpart, backed by a field and perform retrieval
  • TRANSFORMERs transform the retrievables, similar to ingestion TRANSFORMERs
  • AGGREGATORs aggregate multiple retrievables.

See Query Operator Overview for further information.

Retriever Query Operator

The retriever operator retrieves retrievables from the storage layer based on its analyser's capacity. Retrievers are by definition backed by a field and hence, the semantics very much dependent on the field.

{
  "type":"RETRIEVER",
  "field":"fieldname",
  "input":"<inputname>"
}

Where fieldname is the name of a field defined on the schema and <inputname> is the name of an input.

Retrievers may have additional properties set in the local or global query context.

Simple Boolean Query

A special notation for StructDescriptors (see Analyser Overview) is in place to formulate simple Boolean queries. Given an input with a comparison specified, the dot (.) notation as in the following example results in a simple Boolean query on the subfield:

{
  "type":"RETRIEVER",
  "field":"fieldname.subfieldname"
  "intput":"input-with-comparison"
}

Assuming the input-with-comparsion is defined as follows:

{
  "type":"NUMERICAL",
  "data":"10000",
  "comparison":">="
}

And given that fieldname.subfieldname is numerical (e.g. the FileSourceMetadata.size subfield), the simple Boolean query reads as

Give me retrievables of fieldname where the subfield's value is larger or equal than 10000

Transformer Query Operator

The transformer operator takes retrievables, processes them and emits retrievables again. This is not necessarily a one-to-one operation. Common transformations include, among others, the expansion of relationships as well as the lookup of certain (sub)fields.

{
  "type":"TRANSFORMER",
  "transformerName":"TransformerClass"
  "input":"<input-stage>"
}

Transformers may have additional properties set in the global or local context.

See Query Transformer Overview for further information.

Aggregator Query Operator

The aggregator operator aggregates incoming retrievables based on its aggregation strategy, inherent to the aggregator.

{
  "type":"AGGREGATOR",
  "aggregatorName":"AggregatorClass",
  "inputs":["<input-operations>"]
}

Where the <input-operations> are previously defined operations.

Aggregators may have additional properties set in the global or local context.

See Query Aggregator Overview for further information.

Query Configuration

The query configuration is provided as JSON. It consists of the following properties:

For a simple example, see the Getting Started guide's query. For a more advanced example, see the Example guide's query.