This extension "airport
" enables the use of Arrow Flight with DuckDB.
Arrow Flight is an RPC framework for high-performance data services based on Apache Arrow and is built on top of gRPC and the Arrow IPC format.
airport_list_flights(location, criteria, auth_token="token_value", secret="secret_name")
Description: This function returns a list of Arrow Flights that are available at a particular endpoint.
Parameter Name | Type | Description |
---|---|---|
location |
VARCHAR |
This is the location of the Flight server |
criteria |
VARCHAR |
This is free-form criteria to pass to the Flight server |
Parameter Name | Type | Description |
---|---|---|
auth_token |
VARCHAR |
A bearer value token to present to the server, the header is formatted like Authorization: Bearer <auth_token> |
secret |
VARCHAR |
This is the name of the DuckDB secret to use to supply the value for the auth_token |
> select * from airport_list_flights('http://127.0.0.1:8815', null);
flight_descriptor = [uploaded.parquet]
endpoint = [{'ticket': uploaded.parquet, 'location': [grpc://0.0.0.0:8815], 'expiration_time': NULL, 'app_metadata': }]
ordered = false
total_records = 3
total_bytes = 363
app_metadata =
schema = Character: string
In addition to the criteria
parameter, the Airport extension will pass additional GRPC headers.
The airport-duckdb-json-filters
header is sent on the GRPC requests. The header contains a JSON serialized representation of all of the conditional filters that are going applied to the results.
To illustrate this through an example:
select * from airport_list_flights('grpc://localhost:8815/', null) where total_bytes = 5;
The GRPC header airport-duckdb-json-filters
will be set to
{
"filters": [
{
"expression_class": "BOUND_COMPARISON",
"type": "COMPARE_EQUAL",
"left": {
"expression_class": "BOUND_COLUMN_REF",
"type": "BOUND_COLUMN_REF",
"alias": "total_bytes",
"return_type": {
"id": "BIGINT"
}
},
"right": {
"expression_class": "BOUND_CONSTANT",
"type": "VALUE_CONSTANT",
"value": {
"type": {
"id": "BIGINT"
},
"is_null": false,
"value": 5
}
}
}
]
}
The airport-duckdb-json-filters
header will not contain newlines, but the JSON has been reformatted in this document for ease of comprehension.
It is up to the implementer of the server to use this header to apply optimizations. The Airport DuckDB extension will still apply the filters to the result returned by the server. This means that the filter logic is purely advisory. In the author's experience, if Arrow Flight servers implement the filtering logic server side it can unlock some impressive optimizations. The JSON schema of the serialized filter expressions is not guaranteed to remain unchanged across DuckDB versions, the serialization is performed by the DuckDB code.
The header airport-duckdb-column-ids
will contain a comma-separated list of column indexes that are used in the query. The Arrow Flight server can return nulls for columns that are not requested. This can be used to reduce the amount of data that is transmitted in the response.
airport_take_flight(location, descriptor, auth_token="token_value", secret="secret_name")
Description: This function is a table returning function, it returns the contents of the Arrow Flight.
Parameter Name | Type | Description |
---|---|---|
location |
VARCHAR |
This is the location of the Flight server |
descriptor |
ANY |
This is the descriptor of the flight. If it is a VARCHAR or BLOB it is interpreted as a command, if it is an ARRAY or VARCHAR[] it is considered a path-based descriptor. |
Parameter Name | Type | Description |
---|---|---|
auth_token |
VARCHAR |
A bearer value token to present to the server, the header is formatted like Authorization: Bearer <auth_token> |
secret |
VARCHAR |
This is the name of the DuckDB secret to use to supply the value for the auth_token |
select * from airport_take_flight('grpc://localhost:8815/', ['counter-stream']) limit 5;
┌─────────┐
│ counter │
│ int64 │
├─────────┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
│ 4 │
└─────────┘
To create a secret that can be used by airport_take_flight
and airport_list_flight
use the standard DuckDB CREATE SECRET
command.
CREATE SECRET airport_hello_world (
type airport,
token 'test-token',
scope 'grpc+tls://server.example.com/'
);
The Airport extension respects the scope(s) specified in the secret. If a value for auth_token
isn't supplied, but a secret exists with a scope that matches the server location the value for the auth_token
will be used from the secret.
- Integration with the DuckDB catalog.
- Investigate the multithreaded endpoint support.
- Write support? How to do updates of rows against tables?
In the Arrow Flight protocol specification, the schema of Flight is sent uncompressed when flights are listed. When a server has many flights, this can cause the response size of listing flights to be quite large. A change was made to the Arrow Flight protocol to compress these schemas with ZStandard to reduce the amount of bytes transmitted.
This extension applied a change to the C interface of Apache Arrow that enforces the 8-byte alignment of the column data. DuckDB requires that this alignment is present, Apache Arrow and Arrow Flight do not require this.
If other extensions use Apache Arrow please ensure that the patch align-record-batch.patch
is applied from vcpkg-overlay/arrow
.
# Clone this repo with submodules.
# duckdb and extension-ci-tools are submodules.
git clone --recursive [email protected]:rustyconover/duckdb-airport-extension.git
# Clone the vcpkg repo
git clone https://github.com/Microsoft/vcpkg.git
# Bootstrap vcpkg
./vcpkg/bootstrap-vcpkg.sh
export VCPKG_TOOLCHAIN_PATH=`pwd`/vcpkg/scripts/buildsystems/vcpkg.cmake
# Build the extension
make
# If you have ninja installed, you can use it to speed up the build
# GEN=ninja make
The main binaries that will be built are:
./build/release/duckdb
./build/release/test/unittest
./build/release/extension/airport/airport.duckdb_extension
duckdb
is the binary for the duckdb shell with the extension code automatically loaded.unittest
is the test runner of duckdb. Again, the extension is already linked into the binary.airport.duckdb_extension
is the loadable binary as it would be distributed.
To run the extension code, simply start the shell with ./build/release/duckdb
. This duckdb shell will have the extension pre-loaded.
Now we can use the features from the extension directly in DuckDB.
Different tests can be created for DuckDB extensions. The primary way of testing DuckDB extensions should be the SQL tests in ./test/sql
. These SQL tests can be run using:
make test