-
Notifications
You must be signed in to change notification settings - Fork 3
Documentation
Here, we document the inner workings of vitrivr-engine, introduce concpets employed and aim on providing a good overview of the components of vitrivr-engine.
This chapter introduced common terminology.
In content-based multimedia retrieval, the aim is to search within multimedia collections (e.g. video, image, audio, 3d objects) on a content, hence semantic level. This is a non-trivial problem due to the so-called semantic gap - the stark difference of semantic understanding of content between human and machines. Recent developments in foundation models has reduced this, yet, to efficiently search within large collections of multimedia data, various techniques are employed.
In (multimedia) retrieval, there a common distinction is between two phases; the ingestion phase (also known as offline phase), during which the multimedia content is being analysed and representations of the content is stored in an efficient way for later use.
The retrieval phase (also known as online phase) describes actions performed after ingestion, when (user) queries to the system are analysed in the same manner, as the multimedia data has been and the comparison of query and content is operated on those represntations. The outcome usually is represented by a list of results, each with an accompanying similarity score, which indicates how similar the results are. Commonly, a similarity score of 1 represents identity, while a similarity score of 0 indicates the greates dissimilarity.
In multimedia retrieval, a feature stands for the means on how to represent the multimedia content.
A very primitive feature is the average colour: Given an image (either an image or a frame from a video), one calculates the average colour by averaging the inidividual pixels' RGB values. While on its own this is not very expressive, demonstrates on how features work.
During ingestion, the average colour is calculated for all the input data (again, this could be for example a bunch of images or a couple of representative frames from a video) and stored in the database as three-element vectors (R,G,B).
During retrieval time, the query consists of a single three-element vector (R,G,B) and a Nearest Neighbour Search (NNS) is performed on those average colour vectors. The distance then is converted to a similarity score s
on the interval
- Basics: Wikipedia
- Research: vitrivr
- Book: Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval, ACM Press Books, 1999 (1st edition), 2011 (2nd edition)
There are a lot of (research) publications out there which cover (multimedia) retrieval in great detail.
vitrivr-engine's data model is based on almost a decade of research in multimedia retrieval. Influenced by its predecessor, the retrieval engine Cineast, the aim of the data model is to be as flexible as possible while still providing foundational guidelines for consumer of vitrivr-engine.
In vitrivr-engine, a retrievable is the unit of retrieval and the logical representation of multimedia data. Depending on the type of multimedia, one (e.g. image) or more (e.g. video) retrievables exist.
For an image file, a single retrievable of the type SOURCE:IMAGE
is created.
For a video file, a single retrievable of the type SOURCE:VIDEO
is created and a couple of retrievables with the type SEGMENT
are created, depending on the segmentation strategy. Having a 30s video and a 1s fixed length segmentation, 31 retrievables are the result, one per second plus the one for the file. The one-second-segment retrievables have a partOf
relationship towards the source retrievable.
The descriptor describes a retrievable in vitrivr-engine. The fundamental concept is, that a retrievable's content is represented by descriptors, which are based on features.
For an image file and the average colour example: The source retrievable is described by one average colour descriptor. For a 30s veideo file and the average colour example: Each of the 30 one-second-segment retrievables are described by one average colour descriptor, the source retrievable is not described.
In vitrivr-engine, there are four distinct high-level types of descriptors:
- Vector descriptors have a type (e.g. float) and a length. Ideal for NNS.
- Struct descriptors have pre-defined sub-fields of various types.
- Scalar descriptors consist of a single typed value.
- Tensor descriptors represent a mathematical tensor. Not yet implemented [June, 2024]
vitrivr-engine operates on the notion of a named schema, similarly to a database or a collection, essentially providing, among other things, a namespace.
{
"schemas": {
"my-schema"
}
}
Each schema has to have a database connection which describes where the schema is persisted (and read from). The database which is supported by vitrivr-engine is CottontailDB.
{
"database": "CottontailConnectionProvider",
"parameters": {
"Host": "127.0.0.1",
"port": "1865"
}
}
In vitrivr-engine, the term field represents features which are to be used. In particular, each field is uniquely named and might be parameterised.
Note: In technical terms, each field has to be backed by an Analyser
, whose output is a descriptor. During ingestion, the analyser produces the representing descriptor of a retrievable, during retrieval the analysis step involves the execution of a query using the derived descriptor.
"uniqueName": {
"factory": "FactoryClass",
"parameters":{
"key": "value"
}
}
A note about fields in vitrivr-engine: Due to its highly modular architecture, a handful of features to be used as fields are shipped with vitrivr-engine. The toy example is the AverageColor
. Depending on use case, custom features can be added.
See analysier / field overview.
In constrast to an analyser / a field, in vitrivr-engine, an exporter produces exports new, derived data.
"uniqueName": {
"factory": "FactoryClass",
"resolverName": "resolverName",
"paramters": {
"key": "value"
}
}
A resolver is responsible to resolve a physical ressource based on information present in a retrievable.
"uniqueName": {
"factory": "FactoryClass",
"paramters": {
"key": "value"
}
}
The schema configuration is the foundation of vitrivr-engine and therefore required on startup. The configuration consists of blocks for the database connection (one), fields (many), exporters (many), and resolvers (many):
{
"schemas": {
"schema-name": {
"connection": {
"database": "CottontailConnectionProvider",
"parameters": {
"Host": "127.0.0.1",
"port": "1865"
}
},
"fields": {
"my-field-1": {
"factory": "AnalyserFactory"
},
"my-other-field": {
"factory": "AnotherAnalyserFactory"
}
},
"resolvers": {
"my-resolver": {
"factory": "ResolverFactory",
"parameters": {
"key": "value"
}
}
},
"exporters": {
"my-exporter": {
"factory": "ExporterFactory",
"resolverName": "my-resolver",
"parameters": {
"key1": "value1",
"key2": "value2"
}
}
},
"extractionPipelines": {
"my-video-pipeline": {
"path": "./videos.json"
},
"my-image-pipeline": {
"path": "./images.json"
}
}
}
}
}
The newly introduces property extractionPipelines
is a list of names ingestion pipelines and the path to the JSON file containing the pipeline configuration.
This is useful, if pre-defined ingestion pipelines are to be used. However, there is also the possiblity to provide the pipeline configuration on-the-fly, which is why this property is optional.
In general an index can be added by adding it to the schema.json
config, e.g.:
"whisperasr": {
"factory": "ASR",
"indexes": [{"attributes":["value"],"type":"FULLTEXT","parameters":{"type":"gin", "language": "english"}}],
"parameters": {
"host": "http://10.34.64.83:8888/",
"model": "whisper",
"timeoutSeconds": "100",
"retries": "1000"
}
},
In pgVector we provide the following indexes for the query types FullText
, NNS
and SCALAR
.
The hierarchical navigable small world index (HNSW) can be set up for a VECTOR
field by adding the following configuration to the field in the schema config.
Parameters:
-
attributes
allowed max. 1. The valuevector
is the attribute name in database. -
type
: "NNS" describes the query type. -
parameters.type
: "hnsw" describes the indextype. -
distance
: describes the distance metric for this index. Thehnsw
index provides:- "manhatten"
- "euclidean"
- "cosine"
- "hamming"
- "jaccard"
-
m
: the max number of connections per layer (16 by default) -
efConstruction
: the size of the dynamic candidate list for constructing the graph (64 by default) -
efSearch
: Specify the size of the dynamic candidate list for search (100 by default)
"indexes": [
{
"attributes": [
"vector"
],
"type": "NNS",
"parameters": {
"type": "hnsw",
"distance": "cosine",
"m": "4",
"efConstruction": "10"
"efSearch": "1000"
}
}
]
The generalized inverted index (GIN) can be set up for a FULLTEXT
field by adding the following configuration to the field in the schema config.
Parameters:
-
attributes
The valuevalue
is the attribute name in database. All attributes will be concatenated with delimiters" || ' ' || "
to create a document. -
type
: "FULLTEXT" describes the query type. -
parameters.type
: "gin" describes the index type. -
english
: the Language for fulltext index. (Default "english")
"indexes": [
{
"attributes": [
"value"
],
"type": "FULLTEXT",
"parameters": {
"type": "gin",
"language": "english"
}
}
]
index create <schema>.<descriptor_field> <descriptor> LUCENE
index rebuild <name-vom-index>
index create warren.ptt.descriptor_asr descriptor LUCENE
index rebuild warren.ptt.descriptor_asr.idx_descriptor_lucene
During ingestion, the multimedia data is analysed and features are extracted. Ingestion in vitrivr-engine is based on an ingestion pipeline definition, centered around so-called operators. The previously introduced analysers are one kind of such operators, which extract feature(s) corresponding to their field. Other operators include the previously introduced exporters.
Ingestion is schema-dependent and always directly linked to one specific schema.
The ingestion context provides vital information -- the context -- to an ingestion pipeline.
Specifically, there is a global
and a local
context.
The former provides key-value pairs for operators of the pipeline, while the latter provides key-value pairs
to specific operators based on their name. More so, the local context may override the global one (e.g. if there is a global "limit":"100"
key-value pair and a certain local context provides a "limit":"50"
key-value pair for one operator, this operator will have a limit
of 50
, in case it supports a limit.
The ingestion context additionally has two essential properties, contentFactory
and resovlerName
:
The ingestion context also includes a mandatory property, contentFactory
, which requires the name of a ContentFactoriesFactory
class.
The purpose of this factory is to produce ContentFactory
s, which in turn produce Content
- vitrivr-engine's representation of the media.
vitrivr-engine provides two such factories:
Class | Description | Local Context Properties |
---|---|---|
InMemoryContentFactory |
Produces content and stores it in-memory, which works fine for small datasets. | |
CachedContentFactory |
Produces content and caches the contents on disk. Designed for large datasets with large individual items (e.g. long high-res videos. |
content.location : The path location for the cache, defaults to a temporary directory called vitrivr-cache
|
The content.location
local context property notation should be read as:
{
"context":{
"contentFactory":"CachedContentFactory",
"resolverName":"<resolver-name>",
"local":{
"content":{
"location":"<path-to-cache>"
}
}
}
}
Fill in the placeholders <resolver-name>
and <path-to-cache>
as necessary.
Another special ingestion context property is the resolverName
property, which has to reference a resolver defined on the schema.
The reason being that certain components may produce data which is relevant for retrieval and ingestion and the shared resolver ensures a common path.
The ingestion operator is first defined and then used as one component within a pipeline. Operators do have various types:
-
ENUMERATOR
enumerates sources and therefore serves as the starting point -
DECODER
decode the content into consumable elements -
EXTRACTOR
extract features and have to be backed by a field -
EXPORTER
export derived data from the multimedia data -
TRANSFORMER
transform the incoming retrievables to outgoing retrievables, possibly filtering them
The base structure of an ingestion operator is as follows:
{
"type": "<type>",
"<addressKey>":"<provider>"
}
Where the <type>
represents one of the above introduced types, <addressKey>
is one of factory
(enumerator, decoder, transformer), fieldName
(extractor), or exporterName
(exporter).
Some operators do have additional key-value configuration.
See Ingestion Operator Overview for further information on concrete implementations.
The enumerator emits elements based on its configuration.
{
"type":"ENUMERATOR",
"factory":"FactoryClass",
"mediaTypes":["<mt>"]
}
Where <mt>
stands for one of the following mediaTypes
: IMAGE
(images), VIDEO
(videos), AUDIO
(audio), MESH
(3d objects). An enumerator can emit multiple media types, if necessary.
The decoder segments the media data into content and provides therefore the segmentation to work on.
{
"type":"DECODER",
"factory":"FacotryClass"
}
The extractor is backed by a field. It analyses the media content and extracts the feature representation.
{
"type":"EXTRACTOR",
"fieldName":"my-field"
}
my-field
must be a field name defined on the schema.
An exporter produces derived data, e.g. thumbnails from a video.
{
"type":"EXPORTER",
"exporterName":"my-exporter"
}
Where my-exporter
is the name of an exporter defined on the schema.
In vitrivr-engine, a trasformer consumes retrievables and emits them, not necessarily one-to-one. That means, there might be a filter transformer which filters retrievables on a property.
{
"type":"TRANSFORMER",
"factory":"FactoryClass"
}
In the ingestion configuration, a the operations define the ingestion operator pipeline / directed graph. An operation is a named node in the graph:
"operation-name": {
"operator":"operator-name",
"inputs": ["<input-stages>"],
"merge":"<merge-stragety>"
}
operator-name
and <input-stages>
must reference a previously defined operator, as well as other existing operations.
<merge-strategy>
must be one of MERGE
, COMBINE
or CONCAT
, see below
The inputs
and merge
properties are optional with the following rules:
- if the
operator-name
references an enumerator, then no inputs are expected, as the enumerator is the start node of the pipeline graph - if there is more than one element in the
inputs
list, then themerge
property is required.
By defining the operations accordingly, there are two thing that can happen implicitly.
Branching: If an operations
is used as input for multiple other operations
, this results in a branching. This is handled automatically by wrapping the associated Operator
in a BroadcastOperator
.
Merging: If an operation
has multiple inputs, this results in a merging, which combines multiple flows of Retrievable
s into a single flow. The merging strategy (MergeType
) must be specified explicitly in the operation
.
Currently, vitrivr-engine
supports three type of merging strategies:
-
MERGE
: Merges theRetrievable
s from the input operations in order the arrive. No deduplication and ordering is performed. -
COMBINE
: MergesRetrievable
s from the input operations and emits aRetrievable
, once it was received on every input. -
CONCAT
: CollectsRetrievable
s from the incoming flows in order of occurence, i.e., operation 1, then operation 2 etc.
In order to persist the results of the ingestion, an operation (or multiple ones) have to be specified in the special output
proprety of the ingestion configuration.
If multiple operations are specified as output, then additionally, a mergeType
has to be defined, see merging.
An ingestion pipeline is stored as a JSON file.
It's properties are as follows:
-
schema
: The schema the ingestion operatos on -
context
: The global and local ingestion context -
operators
: The ingestion operators -
operations
: The ingestion operations -
output
: The persistance operations -
mergeType
: Optional merge strategy
For a simple example, see the Getting Started guide's ingestion pipeline. For a more advanced example, see the Example guide's ingestion pipeline.
During retrieval time, queries are sent to vitrivr-engine with the aim to retrieve information, based on previous ingestion. Centered around retrieval operators, vitrivr-engine comes with its own query language which consists of four core components:
- The inputs define the query payload
- The operations define the order of retrieval operators as a pipeline
- The query context provides, similar to the ingestion context, vital contextual information
- The output specifies which operation is returned
Retrieval in vitrivr-engine is schema dependent and directly linked to one schema.
Similar to the ingestion context, the retrieval context consists of a local
and global
component.
See ingestion context for more information.
Essentially the payload of the query, the input is a typed, named component of a query. The types supported are:
-
TEXT
for textual input -
IMAGE
for image input -
VECTOR
for vector input -
ID
to query for an ID -
BOOLEAN
for boolean input -
NUMERIC
for numerical input -
DATE
for datetime inputs
See Query Input Overview for further information.
There are three types of query operators, which do have a certain similarity to the ingestion operators by design:
-
RETRIEVER
s are theEXTACTOR
s counterpart, backed by a field and perform retrieval -
TRANSFORMER
s transform the retrievables, similar to ingestionTRANSFORMER
s -
AGGREGATOR
s aggregate multiple retrievables.
See Query Operator Overview for further information.
The retriever operator retrieves retrievables from the storage layer based on its analyser's capacity. Retrievers are by definition backed by a field and hence, the semantics very much dependent on the field.
{
"type":"RETRIEVER",
"field":"fieldname",
"input":"<inputname>"
}
Where fieldname
is the name of a field defined on the schema and <inputname>
is the name of an input.
Retrievers may have additional properties set in the local or global query context.
A special notation for StructDescriptor
s (see Analyser Overview) is in place to formulate simple Boolean queries.
Given an input with a comparison
specified, the dot (.
) notation as in the following example results in a simple Boolean query on the subfield:
{
"type":"RETRIEVER",
"field":"fieldname.subfieldname"
"intput":"input-with-comparison"
}
Assuming the input-with-comparsion
is defined as follows:
{
"type":"NUMERICAL",
"data":"10000",
"comparison":">="
}
And given that fieldname.subfieldname
is numerical (e.g. the FileSourceMetadata
.size
subfield), the simple Boolean query reads as
Give me retrievables of fieldname where the subfield's value is larger or equal than 10000
The transformer operator takes retrievables, processes them and emits retrievables again. This is not necessarily a one-to-one operation. Common transformations include, among others, the expansion of relationships as well as the lookup of certain (sub)fields.
{
"type":"TRANSFORMER",
"transformerName":"TransformerClass"
"input":"<input-stage>"
}
Transformers may have additional properties set in the global or local context.
See Query Transformer Overview for further information.
The aggregator operator aggregates incoming retrievables based on its aggregation strategy, inherent to the aggregator.
{
"type":"AGGREGATOR",
"aggregatorName":"AggregatorClass",
"inputs":["<input-operations>"]
}
Where the <input-operations>
are previously defined operations.
Aggregators may have additional properties set in the global or local context.
See Query Aggregator Overview for further information.
The query configuration is provided as JSON. It consists of the following properties:
-
context
: The global and local query context -
inputs
: The input payloads -
operations
: The query operators, as named operations -
output
: The operation name that is eventually emitted to the caller
For a simple example, see the Getting Started guide's query. For a more advanced example, see the Example guide's query.
Found an issue in the wiki? Post it!
Have a question? Ask it
Disclaimer: Please keep in mind, vitrivr and vitrivr-engine are predominantly research prototypes.