Skip to content

Commit

Permalink
apacheGH-38594: [Docs][C++][Gandiva] Document how to register Gandiva…
Browse files Browse the repository at this point in the history
… external functions (apache#38763)

### Rationale for this change
Provide a basic documentation on how to register and develop Gandiva external functions, including IR functions and C functions.

### What changes are included in this PR?
A markdown doc providing an overview and detailed steps for integrating external functions into Gandiva. 

### Are these changes tested?
It is a doc change

### Are there any user-facing changes?
No
* Closes: apache#38594

Lead-authored-by: Yue Ni <[email protected]>
Co-authored-by: Yue <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
  • Loading branch information
niyue authored and dgreiss committed Feb 17, 2024
1 parent 37ceb33 commit d292e50
Show file tree
Hide file tree
Showing 5 changed files with 482 additions and 116 deletions.
140 changes: 24 additions & 116 deletions docs/source/cpp/gandiva.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,119 +40,27 @@ pre-compiled into LLVM IR (intermediate representation).
.. _LLVM: https://llvm.org/


Building Expressions
====================

Gandiva provides a general expression representation where expressions are
represented by a tree of nodes. The expression trees are built using
:class:`TreeExprBuilder`. The leaves of the expression tree are typically
field references, created by :func:`TreeExprBuilder::MakeField`, and
literal values, created by :func:`TreeExprBuilder::MakeLiteral`. Nodes
can be combined into more complex expression trees using:

* :func:`TreeExprBuilder::MakeFunction` to create a function
node. (You can call :func:`GetRegisteredFunctionSignatures` to
get a list of valid function signatures.)
* :func:`TreeExprBuilder::MakeIf` to create if-else logic.
* :func:`TreeExprBuilder::MakeAnd` and :func:`TreeExprBuilder::MakeOr`
to create boolean expressions. (For "not", use the ``not(bool)`` function in ``MakeFunction``.)
* :func:`TreeExprBuilder::MakeInExpressionInt32` and the other "in expression"
functions to create set membership tests.

Each of these functions create new composite nodes, which contain the leaf nodes
(literals and field references) or other composite nodes as children. By
composing these, you can create arbitrarily complex expression trees.

Once an expression tree is built, they are wrapped in either :class:`Expression`
or :class:`Condition`, depending on how they will be used.
``Expression`` is used in projections while ``Condition`` is used in filters.

As an example, here is how to create an Expression representing ``x + 3`` and a
Condition representing ``x < 3``:

.. literalinclude:: ../../../cpp/examples/arrow/gandiva_example.cc
:language: cpp
:start-after: (Doc section: Create expressions)
:end-before: (Doc section: Create expressions)
:dedent: 2


Projectors and Filters
======================

Gandiva's two execution kernels are :class:`Projector` and
:class:`Filter`. ``Projector`` consumes a record batch and projects
into a new record batch. ``Filter`` consumes a record batch and produces a
:class:`SelectionVector` containing the indices that matched the condition.

For both ``Projector`` and ``Filter``, optimization of the expression IR happens
when creating instances. They are compiled against a static schema, so the
schema of the record batches must be known at this point.

Continuing with the ``expression`` and ``condition`` created in the previous
section, here is an example of creating a Projector and a Filter:

.. literalinclude:: ../../../cpp/examples/arrow/gandiva_example.cc
:language: cpp
:start-after: (Doc section: Create projector and filter)
:end-before: (Doc section: Create projector and filter)
:dedent: 2

Once a Projector or Filter is created, it can be evaluated on Arrow record batches.
These execution kernels are single-threaded on their own, but are designed to be
reused to process distinct record batches in parallel.

Evaluating projections
----------------------

Execution is performed with :func:`Projector::Evaluate`. This outputs
a vector of arrays, which can be passed along with the output schema to
:func:`arrow::RecordBatch::Make()`.

.. literalinclude:: ../../../cpp/examples/arrow/gandiva_example.cc
:language: cpp
:start-after: (Doc section: Evaluate projection)
:end-before: (Doc section: Evaluate projection)
:dedent: 2

Evaluating filters
------------------

:func:`Filter::Evaluate` produces :class:`SelectionVector`,
a vector of row indices that matched the filter condition. The selection vector
is a wrapper around an arrow integer array, parameterized by bitwidth. When
creating the selection vector (you must initialize it *before* passing to
``Evaluate()``), you must choose the bitwidth, which determines the max index
value it can hold, and the max number of slots, which determines how many indices
it may contain. In general, the max number of slots should be set to your batch
size and the bitwidth the smallest integer size that can represent all integers
less than the batch size. For example, if your batch size is 100k, set the
maximum number of slots to 100k and the bitwidth to 32 (since 2^16 = 64k which
would be too small).

Once ``Evaluate()`` has been run and the :class:`SelectionVector` is
populated, use the :func:`SelectionVector::ToArray()` method to get
the underlying array and then :func:`::arrow::compute::Take()` to materialize the
output record batch.

.. literalinclude:: ../../../cpp/examples/arrow/gandiva_example.cc
:language: cpp
:start-after: (Doc section: Evaluate filter)
:end-before: (Doc section: Evaluate filter)
:dedent: 2

Evaluating projections and filters
----------------------------------

Finally, you can also project while apply a selection vector, with
:func:`Projector::Evaluate()`. To do so, first make sure to initialize the
:class:`Projector` with :func:`SelectionVector::GetMode()` so that the projector
compiles with the correct bitwidth. Then you can pass the
:class:`SelectionVector` into the :func:`Projector::Evaluate()` method.


.. literalinclude:: ../../../cpp/examples/arrow/gandiva_example.cc
:language: cpp
:start-after: (Doc section: Evaluate filter and projection)
:end-before: (Doc section: Evaluate filter and projection)
:dedent: 2
Expression, Projector and Filter
================================
To effectively utilize Gandiva, you will construct expression trees with ``TreeExprBuilder``,
including the creation of function nodes, if-else logic, and boolean expressions.
Subsequently, leverage ``Projector`` or ``Filter`` execution kernels to efficiently evaluate these expressions.
See :doc:`./gandiva/expr_projector_filter` for more details.


External Functions Development
==============================
Gandiva offers the capability of integrating external functions, encompassing
both C functions and IR functions. This feature broadens the spectrum of
functions that can be applied within Gandiva expressions. For developers
looking to customize and enhance their computational solutions,
Gandiva provides the opportunity to develop and register their own external
functions, thus allowing for a more tailored and flexible use of the Gandiva
environment.
See :doc:`./gandiva/external_func` for more details.

.. toctree::
:maxdepth: 2

gandiva/expr_projector_filter
gandiva/external_func
137 changes: 137 additions & 0 deletions docs/source/cpp/gandiva/expr_projector_filter.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
.. Licensed to the Apache Software Foundation (ASF) under one
.. or more contributor license agreements. See the NOTICE file
.. distributed with this work for additional information
.. regarding copyright ownership. The ASF licenses this file
.. to you under the Apache License, Version 2.0 (the
.. "License"); you may not use this file except in compliance
.. with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
.. software distributed under the License is distributed on an
.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
.. KIND, either express or implied. See the License for the
.. specific language governing permissions and limitations
.. under the License.
=========================================
Gandiva Expression, Projector, and Filter
=========================================

Building Expressions
====================

Gandiva provides a general expression representation where expressions are
represented by a tree of nodes. The expression trees are built using
:class:`TreeExprBuilder`. The leaves of the expression tree are typically
field references, created by :func:`TreeExprBuilder::MakeField`, and
literal values, created by :func:`TreeExprBuilder::MakeLiteral`. Nodes
can be combined into more complex expression trees using:

* :func:`TreeExprBuilder::MakeFunction` to create a function
node. (You can call :func:`GetRegisteredFunctionSignatures` to
get a list of valid function signatures.)
* :func:`TreeExprBuilder::MakeIf` to create if-else logic.
* :func:`TreeExprBuilder::MakeAnd` and :func:`TreeExprBuilder::MakeOr`
to create boolean expressions. (For "not", use the ``not(bool)`` function in ``MakeFunction``.)
* :func:`TreeExprBuilder::MakeInExpressionInt32` and the other "in expression"
functions to create set membership tests.

Each of these functions create new composite nodes, which contain the leaf nodes
(literals and field references) or other composite nodes as children. By
composing these, you can create arbitrarily complex expression trees.

Once an expression tree is built, they are wrapped in either :class:`Expression`
or :class:`Condition`, depending on how they will be used.
``Expression`` is used in projections while ``Condition`` is used in filters.

As an example, here is how to create an Expression representing ``x + 3`` and a
Condition representing ``x < 3``:

.. literalinclude:: ../../../../cpp/examples/arrow/gandiva_example.cc
:language: cpp
:start-after: (Doc section: Create expressions)
:end-before: (Doc section: Create expressions)
:dedent: 2


Projectors and Filters
======================

Gandiva's two execution kernels are :class:`Projector` and
:class:`Filter`. ``Projector`` consumes a record batch and projects
into a new record batch. ``Filter`` consumes a record batch and produces a
:class:`SelectionVector` containing the indices that matched the condition.

For both ``Projector`` and ``Filter``, optimization of the expression IR happens
when creating instances. They are compiled against a static schema, so the
schema of the record batches must be known at this point.

Continuing with the ``expression`` and ``condition`` created in the previous
section, here is an example of creating a Projector and a Filter:

.. literalinclude:: ../../../../cpp/examples/arrow/gandiva_example.cc
:language: cpp
:start-after: (Doc section: Create projector and filter)
:end-before: (Doc section: Create projector and filter)
:dedent: 2

Once a Projector or Filter is created, it can be evaluated on Arrow record batches.
These execution kernels are single-threaded on their own, but are designed to be
reused to process distinct record batches in parallel.

Evaluating projections
----------------------

Execution is performed with :func:`Projector::Evaluate`. This outputs
a vector of arrays, which can be passed along with the output schema to
:func:`arrow::RecordBatch::Make()`.

.. literalinclude:: ../../../../cpp/examples/arrow/gandiva_example.cc
:language: cpp
:start-after: (Doc section: Evaluate projection)
:end-before: (Doc section: Evaluate projection)
:dedent: 2

Evaluating filters
------------------

:func:`Filter::Evaluate` produces :class:`SelectionVector`,
a vector of row indices that matched the filter condition. The selection vector
is a wrapper around an arrow integer array, parameterized by bitwidth. When
creating the selection vector (you must initialize it *before* passing to
``Evaluate()``), you must choose the bitwidth, which determines the max index
value it can hold, and the max number of slots, which determines how many indices
it may contain. In general, the max number of slots should be set to your batch
size and the bitwidth the smallest integer size that can represent all integers
less than the batch size. For example, if your batch size is 100k, set the
maximum number of slots to 100k and the bitwidth to 32 (since 2^16 = 64k which
would be too small).

Once ``Evaluate()`` has been run and the :class:`SelectionVector` is
populated, use the :func:`SelectionVector::ToArray()` method to get
the underlying array and then :func:`::arrow::compute::Take()` to materialize the
output record batch.

.. literalinclude:: ../../../../cpp/examples/arrow/gandiva_example.cc
:language: cpp
:start-after: (Doc section: Evaluate filter)
:end-before: (Doc section: Evaluate filter)
:dedent: 2

Evaluating projections and filters
----------------------------------

Finally, you can also project while apply a selection vector, with
:func:`Projector::Evaluate()`. To do so, first make sure to initialize the
:class:`Projector` with :func:`SelectionVector::GetMode()` so that the projector
compiles with the correct bitwidth. Then you can pass the
:class:`SelectionVector` into the :func:`Projector::Evaluate()` method.


.. literalinclude:: ../../../../cpp/examples/arrow/gandiva_example.cc
:language: cpp
:start-after: (Doc section: Evaluate filter and projection)
:end-before: (Doc section: Evaluate filter and projection)
:dedent: 2
49 changes: 49 additions & 0 deletions docs/source/cpp/gandiva/external_func.mmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
%% Licensed to the Apache Software Foundation (ASF) under one
%% or more contributor license agreements. See the NOTICE file
%% distributed with this work for additional information
%% regarding copyright ownership. The ASF licenses this file
%% to you under the Apache License, Version 2.0 (the
%% "License"); you may not use this file except in compliance
%% with the License. You may obtain a copy of the License at
%%
%% http://www.apache.org/licenses/LICENSE-2.0
%%
%% Unless required by applicable law or agreed to in writing,
%% software distributed under the License is distributed on an
%% "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
%% KIND, either express or implied. See the License for the
%% specific language governing permissions and limitations
%% under the License.

graph TD
Rust(Rust) --> CFunction(C function)
Cpp(C++) --> CFunction
OtherLangs(Other langs) --> CFunction

C(C) --clang--> LLVMIR(LLVM IR)
Cpp1(C++) --clang--> LLVMIR
OtherLangs1(Other langs) --rustc/etc--> LLVMIR

LLVMIR --LLVM toolchain--> LLVMBitcode(LLVM bitcode)

CFunction --> Application(application)
LLVMBitcode --> Application

Application --Register--> FunctionRegistry

subgraph Gandiva
BuiltInIRFunctions(built-in IR functions) --> LLVMGenerator(LLVMGenerator)
BuiltInCFunctions(built-in C functions) --> LLVMGenerator

FunctionRegistry(FunctionRegistry) --> LLVMGenerator


LLVMGenerator --> LLVMJITEngine(LLVM JIT engine)

LLVMJITEngine --codegen--> MachineCode(machine code)
end

classDef node stroke-width:0px;
class Rust,Cpp,OtherLangs,C,Cpp1,OtherLangs1,LLVMIR,LLVMBitcode,CFunction,Application,BuiltInIRFunctions,BuiltInCFunctions,FunctionRegistry,LLVMGenerator,LLVMJITEngine,MachineCode node;
classDef subGraph fill:#f5f5f5,stroke:#5a5a5a,stroke-width:2px,rx:10,ry:10;
class Gandiva subGraph;
Binary file added docs/source/cpp/gandiva/external_func.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit d292e50

Please sign in to comment.