From f802bb5b385b1d9a1c861c8604737ad0c08898f8 Mon Sep 17 00:00:00 2001 From: philo Date: Wed, 8 Jun 2022 11:16:17 +0800 Subject: [PATCH 1/5] Initial commit --- docs/Columnar-Expression-Developer-Guide.md | 57 +++++++++++++++++++++ 1 file changed, 57 insertions(+) create mode 100644 docs/Columnar-Expression-Developer-Guide.md diff --git a/docs/Columnar-Expression-Developer-Guide.md b/docs/Columnar-Expression-Developer-Guide.md new file mode 100644 index 000000000..2133d9c5b --- /dev/null +++ b/docs/Columnar-Expression-Developer-Guide.md @@ -0,0 +1,57 @@ +## Columnar Expression Developer Guide + +Currently, the columnar expressions in Gazelle are implemented based on Arrow/gandiva. Developer needs to +implement a columnar expression class in Gazelle scala code and also add code logic to replace Spark expression +with the implemented columnar expression (see ColumnarExpressionConverter.scala). The native code for expression's +core functionality is implemented in Arrow/gandiva. + +We should check whether the desired function is already implemented in gandiva. For functions already implemented, +we need to make few code changes to meet the compatibility with Spark. + +Take `regexp_extract` as example. + +### Arrow/gandiva Native Code + +See [arrow/pull/97](https://github.com/oap-project/arrow/pull/97). + +Since C++ lib google/RE2 is leveraged, we implemented the core function in a gandiva function holder. + +In extract_holder.h, we need to add a constructor and declare the required function and member variable. +``` +static Status Make(const FunctionNode &node, std::shared_ptr *holder); +``` +This function is used to construct ExtractHolder and check the legality of input. It will be called by +`function_holder_registry.h` to register function holder. + +In `gdv_function_stubs.cc`,we need to implement a function called `gdv_fn_regexp_extract_utf8_utf8_int32`. +The overloaded `operator()` in ExtractHolder is the core function to do the extract work. It is called by +`gdv_fn_regexp_extract_utf8_utf8_int32`. + +We need also register `regexp_extract` in `function_registry_string.cc` (for functions handling string). +This exposed function name will be used to create function tree in our Gazelle scala code. In this case, +function holder is required, so we should specify `NativeFunction::kNeedsFunctionHolder` in the registry. + +For unit test, please refer to `extract_holder_test.cc`. + +* `cd arrow/cpp/release-build` (create by yourself if not exists) +* `cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_CSV=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON +-DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_WITH_SNAPPY=ON +-DARROW_FILESYSTEM=ON -DARROW_JSON=ON -DARROW_WITH_PROTOBUF=ON -DARROW_DATASET=ON -DARROW_IPC=ON +-DARROW_WITH_LZ4=ON -DARROW_JEMALLOC=OFF -DARROW_BUILD_TESTS=ON ..` +* `make -j` +* `./release/gandiva-internals-test` + +### Gazelle Scala Code + +See [gazelle_plugin/pull/847](https://github.com/oap-project/gazelle_plugin/pull/847). + +In scala code, we created class `ColumnarRegExpExtract` which extends Spark's `RegExpExtract`. + +We can add `buildCheck` to check the input types. For legal types currently not supported by Gazelle, we can throw +an `UnsupportedOperationException` to let it fallback. If whole stage code is not supported, we should override +`supportColumnarCodegen` to let it return false. + +In `doColumnarCodeGen`, arrow function node is constructed. The function name `regexp_extract` is specified. + +In `replaceWithColumnarExpression` of `ColumnarExpressionConverter.scala`,we need to convert Spark's expression +to the implemented columnar expression. \ No newline at end of file From ef5a79f71a2fce83261de9d0f6449cdacd393c06 Mon Sep 17 00:00:00 2001 From: philo Date: Wed, 8 Jun 2022 11:32:42 +0800 Subject: [PATCH 2/5] Do some changes --- docs/Columnar-Expression-Developer-Guide.md | 34 ++++++++++----------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/docs/Columnar-Expression-Developer-Guide.md b/docs/Columnar-Expression-Developer-Guide.md index 2133d9c5b..cb3db232e 100644 --- a/docs/Columnar-Expression-Developer-Guide.md +++ b/docs/Columnar-Expression-Developer-Guide.md @@ -1,12 +1,12 @@ ## Columnar Expression Developer Guide Currently, the columnar expressions in Gazelle are implemented based on Arrow/gandiva. Developer needs to -implement a columnar expression class in Gazelle scala code and also add code logic to replace Spark expression -with the implemented columnar expression (see ColumnarExpressionConverter.scala). The native code for expression's -core functionality is implemented in Arrow/gandiva. +implement a columnar expression class in Gazelle scala code and also add some logic to replace Spark expression +with the implemented columnar expression. And the native code for expression's core functionality is implemented +in Arrow/gandiva. We should check whether the desired function is already implemented in gandiva. For functions already implemented, -we need to make few code changes to meet the compatibility with Spark. +we may still need to make a few code changes to meet the compatibility with Spark. Take `regexp_extract` as example. @@ -16,22 +16,22 @@ See [arrow/pull/97](https://github.com/oap-project/arrow/pull/97). Since C++ lib google/RE2 is leveraged, we implemented the core function in a gandiva function holder. -In extract_holder.h, we need to add a constructor and declare the required function and member variable. +In `extract_holder.h`, we need declare the below functions other than some other necessary functions. ``` static Status Make(const FunctionNode &node, std::shared_ptr *holder); ``` -This function is used to construct ExtractHolder and check the legality of input. It will be called by -`function_holder_registry.h` to register function holder. +This function is used to construct `ExtractHolder` and check the legality of input if needed. It will +be called by `function_holder_registry.h` to register function holder. -In `gdv_function_stubs.cc`,we need to implement a function called `gdv_fn_regexp_extract_utf8_utf8_int32`. -The overloaded `operator()` in ExtractHolder is the core function to do the extract work. It is called by -`gdv_fn_regexp_extract_utf8_utf8_int32`. +In `gdv_function_stubs.cc`,we need to implement a function called `gdv_fn_regexp_extract_utf8_utf8_int32`, +which will invoke overloaded `operator()` in `ExtractHolder`. The `operator()` is the core function to do +the extract work. We need also register `regexp_extract` in `function_registry_string.cc` (for functions handling string). -This exposed function name will be used to create function tree in our Gazelle scala code. In this case, +This exposed function name will be used to create function tree in Gazelle scala code. In this case, function holder is required, so we should specify `NativeFunction::kNeedsFunctionHolder` in the registry. -For unit test, please refer to `extract_holder_test.cc`. +For unit test, please refer to `extract_holder_test.cc`. Here is the compile steps. * `cd arrow/cpp/release-build` (create by yourself if not exists) * `cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_CSV=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON @@ -45,13 +45,13 @@ For unit test, please refer to `extract_holder_test.cc`. See [gazelle_plugin/pull/847](https://github.com/oap-project/gazelle_plugin/pull/847). -In scala code, we created class `ColumnarRegExpExtract` which extends Spark's `RegExpExtract`. +In scala code, `ColumnarRegExpExtract` is created to replace Spark's `RegExpExtract`. We can add `buildCheck` to check the input types. For legal types currently not supported by Gazelle, we can throw -an `UnsupportedOperationException` to let it fallback. If whole stage code is not supported, we should override +an `UnsupportedOperationException` to let it fallback. If whole stage codegen is not supported, we should override `supportColumnarCodegen` to let it return false. -In `doColumnarCodeGen`, arrow function node is constructed. The function name `regexp_extract` is specified. +In `doColumnarCodeGen`, arrow function node is constructed with gandiva function name `regexp_extract` specified. -In `replaceWithColumnarExpression` of `ColumnarExpressionConverter.scala`,we need to convert Spark's expression -to the implemented columnar expression. \ No newline at end of file +At last, in `replaceWithColumnarExpression` of `ColumnarExpressionConverter.scala`,we need to replace Spark's +expression to the implemented columnar expression. \ No newline at end of file From ab9d0b304421d9a22881b309a09d5fb57d2b537c Mon Sep 17 00:00:00 2001 From: philo Date: Wed, 8 Jun 2022 11:47:56 +0800 Subject: [PATCH 3/5] Refine the doc --- docs/Columnar-Expression-Developer-Guide.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/Columnar-Expression-Developer-Guide.md b/docs/Columnar-Expression-Developer-Guide.md index cb3db232e..ec728669c 100644 --- a/docs/Columnar-Expression-Developer-Guide.md +++ b/docs/Columnar-Expression-Developer-Guide.md @@ -2,11 +2,11 @@ Currently, the columnar expressions in Gazelle are implemented based on Arrow/gandiva. Developer needs to implement a columnar expression class in Gazelle scala code and also add some logic to replace Spark expression -with the implemented columnar expression. And the native code for expression's core functionality is implemented -in Arrow/gandiva. +with the implemented columnar expression. And the native code is implemented in Arrow/gandiva for expression's +core functionality. -We should check whether the desired function is already implemented in gandiva. For functions already implemented, -we may still need to make a few code changes to meet the compatibility with Spark. +Before native code development, we should check whether the desired function is already implemented +in gandiva. If so, we can directly use it or just make a few code changes to meet the compatibility with Spark. Take `regexp_extract` as example. @@ -21,13 +21,13 @@ In `extract_holder.h`, we need declare the below functions other than some other static Status Make(const FunctionNode &node, std::shared_ptr *holder); ``` This function is used to construct `ExtractHolder` and check the legality of input if needed. It will -be called by `function_holder_registry.h` to register function holder. +be called by `function_holder_registry.h` to register the function holder. In `gdv_function_stubs.cc`,we need to implement a function called `gdv_fn_regexp_extract_utf8_utf8_int32`, which will invoke overloaded `operator()` in `ExtractHolder`. The `operator()` is the core function to do the extract work. -We need also register `regexp_extract` in `function_registry_string.cc` (for functions handling string). +We need also register `regexp_extract` in `function_registry_string.cc` (for functions handling strings). This exposed function name will be used to create function tree in Gazelle scala code. In this case, function holder is required, so we should specify `NativeFunction::kNeedsFunctionHolder` in the registry. @@ -47,9 +47,9 @@ See [gazelle_plugin/pull/847](https://github.com/oap-project/gazelle_plugin/pull In scala code, `ColumnarRegExpExtract` is created to replace Spark's `RegExpExtract`. -We can add `buildCheck` to check the input types. For legal types currently not supported by Gazelle, we can throw -an `UnsupportedOperationException` to let it fallback. If whole stage codegen is not supported, we should override -`supportColumnarCodegen` to let it return false. +We can add `buildCheck` to check the input types. For legal types currently not supported in this implementation, +we can throw an `UnsupportedOperationException` to let the expression fallback. If whole stage codegen is not +supported, we should override `supportColumnarCodegen` to let it return false. In `doColumnarCodeGen`, arrow function node is constructed with gandiva function name `regexp_extract` specified. From eb3e8942e46e34f3192557b31e8b4935b6d241cd Mon Sep 17 00:00:00 2001 From: Yuan Zhou Date: Wed, 8 Jun 2022 17:16:42 +0800 Subject: [PATCH 4/5] refine Signed-off-by: Yuan Zhou --- docs/Columnar-Expression-Developer-Guide.md | 38 ++++++++++++++------- 1 file changed, 25 insertions(+), 13 deletions(-) diff --git a/docs/Columnar-Expression-Developer-Guide.md b/docs/Columnar-Expression-Developer-Guide.md index ec728669c..0ce6c1687 100644 --- a/docs/Columnar-Expression-Developer-Guide.md +++ b/docs/Columnar-Expression-Developer-Guide.md @@ -1,4 +1,4 @@ -## Columnar Expression Developer Guide +# Columnar Expression Developer Guide Currently, the columnar expressions in Gazelle are implemented based on Arrow/gandiva. Developer needs to implement a columnar expression class in Gazelle scala code and also add some logic to replace Spark expression @@ -10,8 +10,9 @@ in gandiva. If so, we can directly use it or just make a few code changes to mee Take `regexp_extract` as example. -### Arrow/gandiva Native Code +## Arrow/gandiva Native Code +### functions need to use external C++ libs See [arrow/pull/97](https://github.com/oap-project/arrow/pull/97). Since C++ lib google/RE2 is leveraged, we implemented the core function in a gandiva function holder. @@ -31,27 +32,38 @@ We need also register `regexp_extract` in `function_registry_string.cc` (for fun This exposed function name will be used to create function tree in Gazelle scala code. In this case, function holder is required, so we should specify `NativeFunction::kNeedsFunctionHolder` in the registry. -For unit test, please refer to `extract_holder_test.cc`. Here is the compile steps. +### functions does not need to use external C++ libs +see [arrow/pull/103](https://github.com/oap-project/arrow/pull/103) +The `pmod` function uses standard C libs only so it can be precompiled in Gandiva. The idea is similar +with adding functions using external libs: +- adding the function implemenation in precompile/xxx.c +- register the function pointer in the function registry +Here's also one [detailed guide](https://www.dremio.com/blog/adding-a-user-define-function-to-gandiva/) from Dremio + -* `cd arrow/cpp/release-build` (create by yourself if not exists) -* `cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_CSV=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON +For unit test, please refer to `extract_holder_test.cc`. Here is the compile steps. +``` +cd arrow/cpp/release-build (create by yourself if not exists) +cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_CSV=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_WITH_SNAPPY=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON -DARROW_WITH_PROTOBUF=ON -DARROW_DATASET=ON -DARROW_IPC=ON --DARROW_WITH_LZ4=ON -DARROW_JEMALLOC=OFF -DARROW_BUILD_TESTS=ON ..` -* `make -j` -* `./release/gandiva-internals-test` - -### Gazelle Scala Code +-DARROW_WITH_LZ4=ON -DARROW_JEMALLOC=OFF -DARROW_BUILD_TESTS=ON .. +make -j +./release/gandiva-internals-test +``` +## Gazelle Scala Code See [gazelle_plugin/pull/847](https://github.com/oap-project/gazelle_plugin/pull/847). In scala code, `ColumnarRegExpExtract` is created to replace Spark's `RegExpExtract`. We can add `buildCheck` to check the input types. For legal types currently not supported in this implementation, -we can throw an `UnsupportedOperationException` to let the expression fallback. If whole stage codegen is not -supported, we should override `supportColumnarCodegen` to let it return false. +we can throw an `UnsupportedOperationException` to let the expression fallback. In `doColumnarCodeGen`, arrow function node is constructed with gandiva function name `regexp_extract` specified. +The `supportColumnarCodegen` function is used to check if columnar wholestage codegen support is added, please set +to `false` if it's not implemented. + At last, in `replaceWithColumnarExpression` of `ColumnarExpressionConverter.scala`,we need to replace Spark's -expression to the implemented columnar expression. \ No newline at end of file +expression to the implemented columnar expression. From 9a5a59aeac539b672ee30a02e48c3d5aa12618b2 Mon Sep 17 00:00:00 2001 From: Yuan Zhou Date: Wed, 8 Jun 2022 17:29:51 +0800 Subject: [PATCH 5/5] refine Signed-off-by: Yuan Zhou --- docs/Columnar-Expression-Developer-Guide.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/Columnar-Expression-Developer-Guide.md b/docs/Columnar-Expression-Developer-Guide.md index 0ce6c1687..95efbfc1c 100644 --- a/docs/Columnar-Expression-Developer-Guide.md +++ b/docs/Columnar-Expression-Developer-Guide.md @@ -1,5 +1,7 @@ # Columnar Expression Developer Guide +This is a short guide on adding new columnar expressions in Gazelle. + Currently, the columnar expressions in Gazelle are implemented based on Arrow/gandiva. Developer needs to implement a columnar expression class in Gazelle scala code and also add some logic to replace Spark expression with the implemented columnar expression. And the native code is implemented in Arrow/gandiva for expression's