Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-37753: [C++][Gandiva] Add external function registry support #37787

Closed
wants to merge 1 commit into from

Conversation

niyue
Copy link
Contributor

@niyue niyue commented Sep 19, 2023

Add external function registry support to Gandiva. JSON based function registry can be used for describing the function metadata, and LLVM bitcode can be automatically loaded as pre-compiled external functions.

Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See #37753 for more details.

What changes are included in this PR?

This PR implements basic support of external function registry. It primarily implements two parts to make it work:

  1. define a JSON based function registry format to describe the function metadata, and support loading and parsing such JSON based function registry files from some directory
  2. automatically load and parse LLVM bitcode from files, so that third party pre-compiled functions can be integrated via LLVM bitcode

The overall flow looks like this:
external-func-registry

Are these changes tested?

Some unit tests are added to verify this enhancement

Are there any user-facing changes?

No change to the existing behavior. But some new ways to interfacing the library are added in this PR.

Notes

  • Performance
    • since the function registry is loaded once and stored as a static map internally, there shouldn't too much performance impact typically by externally loading the JSON function registry files.
    • If external function bitcodes are added, loading them from disk may take some additional time (but it won't affect anything if no external function bitcode are added). I am not sure if we should cache such parsed bitcode in memory for better performance, please shed some light on this. Thanks so much.
  • Integration with other programming languages via LLVM IR/bitcode
    • So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
  • Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. I am currently thinking using approach like loading shared libraries during runtime, and find the corresponding function pointer symbols from the shared library, but I don't do any experiment yet so I am not sure if it works this way. Any comment on how this should be done is appreciated.
  • Certain complex data types, such as map, struct, are not allowed to be used in registry yet, as far as I know, Gandiva doesn't support using these types as input/output for function either, but the registry format could be extended to support them in the future if needed
  • JSON schema for the external registry format
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "functions": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": {
            "type": "string"
          },
          "aliases": {
            "type": "array",
            "items": {
              "type": "string"
            }
          },
          "param_types": {
            "type": "array",
            "items": {
              "type": "object",
              "properties": {
                "type": {
                  "type": "string",
                  "enum": ["null", "boolean", "uint8", "int8", "uint16", "int16", "uint32", "int32", "uint64", "int64", "float16", "float32", "float64", "utf8", "large_utf8", "binary", "large_binary", "date32", "date64", "day_time_interval", "month_interval", "timestamp", "decimal", "decimal128", "decimal256", "list"]
                },
                "unit": {
                  "type": "string",
                  "enum": ["second", "milli", "micro", "nano"]
                },
                "precision": {
                  "type": "integer"
                },
                "scale": {
                  "type": "integer"
                },
                "value_type": {
                  "$ref": "#/properties/functions/items/properties/param_types/items"
                }
              },
              "required": ["type"]
            }
          },
          "return_type": {
            "$ref": "#/properties/functions/items/properties/param_types/items"
          },
          "result_nullable": {
            "type": "string",
            "enum": ["ifnull", "never", "internal"]
          },
          "pc_name": {
            "type": "string"
          },
          "needs_context": {
            "type": "boolean"
          },
          "needs_function_holder": {
            "type": "boolean"
          },
          "can_return_errors": {
            "type": "boolean"
          }
        },
        "required": ["name"]
      }
    }
  },
  "required": ["functions"]
}

@github-actions
Copy link

⚠️ GitHub issue #37753 has been automatically assigned in GitHub to PR creator.

namespace gandiva {
namespace rj = rapidjson;

class JsonRegistryParser {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A JSON registry parser is implemented to parse the function metadata from the JSON content.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Sep 19, 2023
TEST_F(TestFunctionRegistry, LookupMultipleFuncs) {
ExtensionDirSetter ext_dir_setter("multiple_functions_registry", []() {
FunctionRegistry::pc_registry_.clear();
FunctionRegistry::pc_registry_map_ = FunctionRegistry::InitPCMap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit tricky to unit test the function registry, I manually set the environment variable for the extension dir and reload the registry to get it tested

@@ -45,6 +47,21 @@ std::vector<NativeFunction> FunctionRegistry::pc_registry_;

SignatureMap FunctionRegistry::pc_registry_map_ = InitPCMap();

std::vector<NativeFunction> LoadExternalFunctionRegistry() {
std::string ext_dir;
const char* ext_dir_env = std::getenv("GANDIVA_EXTENSION_DIR");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can specify an environment variable called GANDIVA_EXTENSION_DIR to ask the library to load the external registry file and corresponding bitcode files from this directory. This directory could contain multiple json registry files and bitcode files, and all of them will be loaded automatically once this env var is set.

@niyue niyue force-pushed the feature/external-func-registry branch 8 times, most recently from 2e00633 to c9aa09c Compare September 21, 2023 02:31
@kou
Copy link
Member

kou commented Sep 21, 2023

(Please mention me when this is ready for review.)

@niyue niyue force-pushed the feature/external-func-registry branch 3 times, most recently from 14cb827 to 068e7a3 Compare September 21, 2023 09:23
@niyue
Copy link
Contributor Author

niyue commented Sep 21, 2023

@kou this PR is ready for review now. Thanks for the help.
There are still 4 CI checks failing. But they seemed to be a failure about test_python_version, which is probably irrelevant with this PR.

@kou
Copy link
Member

kou commented Sep 22, 2023

Could you rebase on the Dev / Source Release and Merge Script failures?
Could you check the failure on AppVeyor?

https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/48091110#L2859

[ RUN      ] TestFunctionRegistry.LookupMultipleFuncs
C:/projects/arrow/cpp/src/gandiva/function_registry_test.cc(105): error: Expected: (say_hello_func) != (nullptr), actual: NULL vs (nullptr)
unknown file: error: SEH exception with code 0xc0000005 thrown in the test body.
Stack trace:
[  FAILED  ] TestFunctionRegistry.LookupMultipleFuncs (9 ms)
[ RUN      ] TestFunctionRegistry.LookupExternalFuncs
C:/projects/arrow/cpp/src/gandiva/function_registry_test.cc(125): error: Expected: (func) != (nullptr), actual: NULL vs (nullptr)
unknown file: error: SEH exception with code 0xc0000005 thrown in the test body.
Stack trace:
[  FAILED  ] TestFunctionRegistry.LookupExternalFuncs (12 ms)
[----------] 5 tests from TestFunctionRegistry (41 ms total)
[----------] 6 tests from TestExternalFunctionRegistry

@niyue niyue force-pushed the feature/external-func-registry branch from 068e7a3 to 96bc782 Compare September 24, 2023 15:28
@niyue
Copy link
Contributor Author

niyue commented Sep 24, 2023

@kou I've rebased to the latest commit (772a01c), but the CI ran into another issue (https://github.com/apache/arrow/actions/runs/6290902281/job/17078761957?pr=37787), which seems to be related with LLVM 17 (#37834)? Do I have to fix that or is there any specific commit I should rebase onto?

Could you check the failure on AppVeyor?
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/48091110#L2859

Do you know what the AppVeyor check in CI is? Because I cannot reproduce this issue locally (A M1 MacBook), and several CI checks passed as well. I would like to know what the AppVeyor check in CI is and how it is different compared to other C++ checks in CI to see if I can figure out why they failed in that environment. Thanks

@kou
Copy link
Member

kou commented Sep 25, 2023

@kou I've rebased to the latest commit (772a01c), but the CI ran into another issue (https://github.com/apache/arrow/actions/runs/6290902281/job/17078761957?pr=37787), which seems to be related with LLVM 17 (#37834)? Do I have to fix that or is there any specific commit I should rebase onto?

Oh. We can ignore the failures. We can work on it in a separated PR.

Could you check the failure on AppVeyor?
https://ci.appveyor.com/project/ApacheSoftwareFoundation/arrow/builds/48091110#L2859

Do you know what the AppVeyor check in CI is? Because I cannot reproduce this issue locally (A M1 MacBook), and several CI checks passed as well. I would like to know what the AppVeyor check in CI is and how it is different compared to other C++ checks in CI to see if I can figure out why they failed in that environment. Thanks

AppVeyor uses Windows and Visual C++. The failure may be related to Windows and/or Visual C++.

@kou kou changed the title GH-37753: [C++][Gandiva] Add external function registry support to gandiva GH-37753: [C++][Gandiva] Add external function registry support Sep 25, 2023
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rebase on main after #37850 is merged and format *.cmake files by archery lint --cmake-format --fix?
See also: https://arrow.apache.org/docs/dev/developers/cpp/development.html#code-style-linting-and-ci

Could you add the JSON schema too?

Could you add document for this feature? Or we can defer it to a follow-up PR by creating a new issue for it.

Could you send an e-mail to [email protected] https://lists.apache.org/[email protected] https://arrow.apache.org/community/#mailing-lists with [DISCUSS][Gandiva] ... subject to get more attention for this specification?

@@ -137,6 +137,10 @@ Status Engine::LoadFunctionIRs() {
if (!functions_loaded_) {
ARROW_RETURN_NOT_OK(LoadPreCompiledIR());
ARROW_RETURN_NOT_OK(DecimalIR::AddFunctions(this));
const char* ext_dir_env = std::getenv("GANDIVA_EXTENSION_DIR");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using ::arrow::internal::GetEnvVarNative() instead?

Can we use ${prefix}/lib/gandiva/extension/ or something as the default extension directory?

cpp/src/gandiva/engine.cc Show resolved Hide resolved
cpp/src/gandiva/engine.cc Outdated Show resolved Hide resolved
cpp/src/gandiva/engine.cc Outdated Show resolved Hide resolved
cpp/src/gandiva/engine.cc Outdated Show resolved Hide resolved
cpp/src/gandiva/function_registry_external.h Outdated Show resolved Hide resolved
cpp/src/gandiva/function_registry_external_test.cc Outdated Show resolved Hide resolved
cpp/src/gandiva/function_registry_external_test.cc Outdated Show resolved Hide resolved
cpp/src/gandiva/precompiled/CMakeLists.txt Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Sep 25, 2023
@niyue niyue force-pushed the feature/external-func-registry branch from 96bc782 to a61c4da Compare September 25, 2023 13:41
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Sep 25, 2023
@niyue niyue force-pushed the feature/external-func-registry branch from a61c4da to 4d59814 Compare September 25, 2023 14:29
…n registry can be used for describing the function metadata, and LLVM bitcode can be automatically loaded as pre-compiled external functions.
@niyue niyue force-pushed the feature/external-func-registry branch from 4d59814 to bd6ce1f Compare September 25, 2023 15:15
@niyue
Copy link
Contributor Author

niyue commented Sep 25, 2023

AppVeyor uses Windows and Visual C++. The failure may be related to Windows and/or Visual C++.

Previously, I made the PR to pass several Windows related checks, such as "C++ / AMD64 Windows MinGW CLANG64 C++ " and "C++ / AMD64 Windows MinGW MINGW64 C++ " (but now they are failing again due to the LLVM 17 issue). Do you know what are the difference between them and the AppVeyor CI check? Considering both of them are Windows environments, do they differ in compiler primarily (clang vs. Visual C++ compiler)?

@niyue
Copy link
Contributor Author

niyue commented Sep 25, 2023

Could you rebase on main after #37850 is merged and format *.cmake files by archery lint --cmake-format --fix? See also: https://arrow.apache.org/docs/dev/developers/cpp/development.html#code-style-linting-and-ci

Sure. I will wait for the #37850 to be merged.

Could you add the JSON schema too?
Could you add document for this feature? Or we can defer it to a follow-up PR by creating a new issue for it.

Do you have any recommended folder(s) to add:

  1. JSON schema
  2. the corresponding doc

There are some existing folders, but none of them seem to be a good fit for either the JSON schema or doc (I don't find any doc related directory under the src/gandiva directory)

Could you send an e-mail to [email protected] https://lists.apache.org/[email protected] https://arrow.apache.org/community/#mailing-lists with [DISCUSS][Gandiva] ... subject to get more attention for this specification?

Sure. I sent an email to the dev mailing list a while ago, and I will collect some feedback and see how I should proceed.

@kou
Copy link
Member

kou commented Sep 25, 2023

Do you know what are the difference between them and the AppVeyor CI check?

They use GCC/Clang (built on MSYS2 that is an UNIX emulation layer on Windows).
AppVeyor use Visual C++.

Considering both of them are Windows environments, do they differ in compiler primarily (clang vs. Visual C++ compiler)?

Yes.

Do you have any recommended folder(s) to add:

  1. JSON schema

How about https://github.com/apache/arrow/tree/main/format (format/gandiva/)?

  1. the corresponding doc

How about https://github.com/apache/arrow/tree/main/docs/source/format (docs/source/format/Gandiva.rst)?

Sure. I sent an email to the dev mailing list a while ago, and I will collect some feedback and see how I should proceed.

Thanks!

The thread: https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod

@niyue
Copy link
Contributor Author

niyue commented Oct 7, 2023

After discussing this proposal in mailing list, I will close this PR in favor of #38116 Thank you all for the help!

@niyue niyue closed this Oct 7, 2023
kou added a commit that referenced this pull request Nov 8, 2023
# Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See #37753 for more details. In this PR, the external function needs to be compiled into LLVM IR for integration.

# What changes are included in this PR?
Two new APIs are added to `FunctionRegistry`:
```C++
/// \brief register a set of functions into the function registry from a given bitcode
  /// file
arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         const std::string& bitcode_path);

  /// \brief register a set of functions into the function registry from a given bitcode
  /// buffer
  arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         std::shared_ptr<arrow::Buffer> bitcode_buffer);
```
Developers can use these two APIs to register external functions. Typically, developers will register a set of function metadatas (`funcs`) for all functions in a LLVM bitcode file, by giving either the path to the LLVM bitcode file or an `arrow::Buffer` containing the LLVM bitcode buffer.

The overall flow looks like this:
![image](https://github.com/apache/arrow/assets/27754/b2b346fe-931f-4253-b198-4c388c57a56b)

# Are these changes tested?

Some unit tests are added to verify this enhancement

# Are there any user-facing changes?

Some new ways to interfacing the library are added in this PR:
* The `Configuration` class now supports accepting a customized function registry, which developers can register their own external functions and uses it as the function registry
* The `FunctionRegistry` class has two new APIs mentioned above
* The `FunctionRegistry` class, after instantiation, now it doesn't have any built-in function registered in it. And we switch to use a new function `GANDIVA_EXPORT std::shared_ptr<FunctionRegistry> default_function_registry();` to retrieve the default function registry, which contains all the Gandiva built-in functions.
    * Some library depending on Gandiva C++ library, such as Gandiva's Ruby binding's `Gandiva::FunctionRegistry` class behavior is changed accordingly

# Notes
* Performance
    * the code generation time grows with the number of externally added function bitcodes (the more functions are added, the slower the codegen will be), even if the externally function is not used in the given expression at all. But this is not a new issue, and it applies to built-in functions as well (the more built-in functions are there, the slower the codegen will be). In my limited testing, this is because `llvm::Linker::linkModule` takes non trivial of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` happens after that, which doesn't help to reduce the time of `linkModule`. We may have to selectively load only necessary IR (primarily selectively doing `linkModule` for these IR), but more metadata may be needed to tell which functions can be found in which IR. This could be a separated PR for improving it, please advice if any one has any idea on improving it. Thanks.
* Integration with other programming languages via LLVM IR/bitcode
    * So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
    * Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. Another issue #38589 was logged for this.
* The discussion thread in dev mail list, https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
     * I submitted another PR previously (#37787) which introduced JSON based function registry, and after discussion, I will close that PR and use this PR instead
* Closes: #37753

Lead-authored-by: Yue Ni <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
JerAguilon pushed a commit to JerAguilon/arrow that referenced this pull request Nov 9, 2023
…apache#38116)

# Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See apache#37753 for more details. In this PR, the external function needs to be compiled into LLVM IR for integration.

# What changes are included in this PR?
Two new APIs are added to `FunctionRegistry`:
```C++
/// \brief register a set of functions into the function registry from a given bitcode
  /// file
arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         const std::string& bitcode_path);

  /// \brief register a set of functions into the function registry from a given bitcode
  /// buffer
  arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         std::shared_ptr<arrow::Buffer> bitcode_buffer);
```
Developers can use these two APIs to register external functions. Typically, developers will register a set of function metadatas (`funcs`) for all functions in a LLVM bitcode file, by giving either the path to the LLVM bitcode file or an `arrow::Buffer` containing the LLVM bitcode buffer.

The overall flow looks like this:
![image](https://github.com/apache/arrow/assets/27754/b2b346fe-931f-4253-b198-4c388c57a56b)

# Are these changes tested?

Some unit tests are added to verify this enhancement

# Are there any user-facing changes?

Some new ways to interfacing the library are added in this PR:
* The `Configuration` class now supports accepting a customized function registry, which developers can register their own external functions and uses it as the function registry
* The `FunctionRegistry` class has two new APIs mentioned above
* The `FunctionRegistry` class, after instantiation, now it doesn't have any built-in function registered in it. And we switch to use a new function `GANDIVA_EXPORT std::shared_ptr<FunctionRegistry> default_function_registry();` to retrieve the default function registry, which contains all the Gandiva built-in functions.
    * Some library depending on Gandiva C++ library, such as Gandiva's Ruby binding's `Gandiva::FunctionRegistry` class behavior is changed accordingly

# Notes
* Performance
    * the code generation time grows with the number of externally added function bitcodes (the more functions are added, the slower the codegen will be), even if the externally function is not used in the given expression at all. But this is not a new issue, and it applies to built-in functions as well (the more built-in functions are there, the slower the codegen will be). In my limited testing, this is because `llvm::Linker::linkModule` takes non trivial of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` happens after that, which doesn't help to reduce the time of `linkModule`. We may have to selectively load only necessary IR (primarily selectively doing `linkModule` for these IR), but more metadata may be needed to tell which functions can be found in which IR. This could be a separated PR for improving it, please advice if any one has any idea on improving it. Thanks.
* Integration with other programming languages via LLVM IR/bitcode
    * So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
    * Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. Another issue apache#38589 was logged for this.
* The discussion thread in dev mail list, https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
     * I submitted another PR previously (apache#37787) which introduced JSON based function registry, and after discussion, I will close that PR and use this PR instead
* Closes: apache#37753

Lead-authored-by: Yue Ni <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…apache#38116)

# Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See apache#37753 for more details. In this PR, the external function needs to be compiled into LLVM IR for integration.

# What changes are included in this PR?
Two new APIs are added to `FunctionRegistry`:
```C++
/// \brief register a set of functions into the function registry from a given bitcode
  /// file
arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         const std::string& bitcode_path);

  /// \brief register a set of functions into the function registry from a given bitcode
  /// buffer
  arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         std::shared_ptr<arrow::Buffer> bitcode_buffer);
```
Developers can use these two APIs to register external functions. Typically, developers will register a set of function metadatas (`funcs`) for all functions in a LLVM bitcode file, by giving either the path to the LLVM bitcode file or an `arrow::Buffer` containing the LLVM bitcode buffer.

The overall flow looks like this:
![image](https://github.com/apache/arrow/assets/27754/b2b346fe-931f-4253-b198-4c388c57a56b)

# Are these changes tested?

Some unit tests are added to verify this enhancement

# Are there any user-facing changes?

Some new ways to interfacing the library are added in this PR:
* The `Configuration` class now supports accepting a customized function registry, which developers can register their own external functions and uses it as the function registry
* The `FunctionRegistry` class has two new APIs mentioned above
* The `FunctionRegistry` class, after instantiation, now it doesn't have any built-in function registered in it. And we switch to use a new function `GANDIVA_EXPORT std::shared_ptr<FunctionRegistry> default_function_registry();` to retrieve the default function registry, which contains all the Gandiva built-in functions.
    * Some library depending on Gandiva C++ library, such as Gandiva's Ruby binding's `Gandiva::FunctionRegistry` class behavior is changed accordingly

# Notes
* Performance
    * the code generation time grows with the number of externally added function bitcodes (the more functions are added, the slower the codegen will be), even if the externally function is not used in the given expression at all. But this is not a new issue, and it applies to built-in functions as well (the more built-in functions are there, the slower the codegen will be). In my limited testing, this is because `llvm::Linker::linkModule` takes non trivial of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` happens after that, which doesn't help to reduce the time of `linkModule`. We may have to selectively load only necessary IR (primarily selectively doing `linkModule` for these IR), but more metadata may be needed to tell which functions can be found in which IR. This could be a separated PR for improving it, please advice if any one has any idea on improving it. Thanks.
* Integration with other programming languages via LLVM IR/bitcode
    * So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
    * Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. Another issue apache#38589 was logged for this.
* The discussion thread in dev mail list, https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
     * I submitted another PR previously (apache#37787) which introduced JSON based function registry, and after discussion, I will close that PR and use this PR instead
* Closes: apache#37753

Lead-authored-by: Yue Ni <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
dgreiss pushed a commit to dgreiss/arrow that referenced this pull request Feb 19, 2024
…apache#38116)

# Rationale for this change

This PR tries to enhance Gandiva by supporting external function registry, so that developers can author third party functions without modifying Gandiva's core codebase. See apache#37753 for more details. In this PR, the external function needs to be compiled into LLVM IR for integration.

# What changes are included in this PR?
Two new APIs are added to `FunctionRegistry`:
```C++
/// \brief register a set of functions into the function registry from a given bitcode
  /// file
arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         const std::string& bitcode_path);

  /// \brief register a set of functions into the function registry from a given bitcode
  /// buffer
  arrow::Status Register(const std::vector<NativeFunction>& funcs,
                         std::shared_ptr<arrow::Buffer> bitcode_buffer);
```
Developers can use these two APIs to register external functions. Typically, developers will register a set of function metadatas (`funcs`) for all functions in a LLVM bitcode file, by giving either the path to the LLVM bitcode file or an `arrow::Buffer` containing the LLVM bitcode buffer.

The overall flow looks like this:
![image](https://github.com/apache/arrow/assets/27754/b2b346fe-931f-4253-b198-4c388c57a56b)

# Are these changes tested?

Some unit tests are added to verify this enhancement

# Are there any user-facing changes?

Some new ways to interfacing the library are added in this PR:
* The `Configuration` class now supports accepting a customized function registry, which developers can register their own external functions and uses it as the function registry
* The `FunctionRegistry` class has two new APIs mentioned above
* The `FunctionRegistry` class, after instantiation, now it doesn't have any built-in function registered in it. And we switch to use a new function `GANDIVA_EXPORT std::shared_ptr<FunctionRegistry> default_function_registry();` to retrieve the default function registry, which contains all the Gandiva built-in functions.
    * Some library depending on Gandiva C++ library, such as Gandiva's Ruby binding's `Gandiva::FunctionRegistry` class behavior is changed accordingly

# Notes
* Performance
    * the code generation time grows with the number of externally added function bitcodes (the more functions are added, the slower the codegen will be), even if the externally function is not used in the given expression at all. But this is not a new issue, and it applies to built-in functions as well (the more built-in functions are there, the slower the codegen will be). In my limited testing, this is because `llvm::Linker::linkModule` takes non trivial of time, which happens to every IR loaded, and the `RemoveUnusedFunctions` happens after that, which doesn't help to reduce the time of `linkModule`. We may have to selectively load only necessary IR (primarily selectively doing `linkModule` for these IR), but more metadata may be needed to tell which functions can be found in which IR. This could be a separated PR for improving it, please advice if any one has any idea on improving it. Thanks.
* Integration with other programming languages via LLVM IR/bitcode
    * So far I only added an external C++ function in the codebase for unit testing purpose. Rust based function is possible but I gave it a try and found another issue (Rust has std lib which needs to be processed in different approach), I will do some exploration for other languages such as zig later.
    * Non pre-compiled functions, may require some different approach to get the function pointer, and we may discuss and work on it in a separated PR later. Another issue apache#38589 was logged for this.
* The discussion thread in dev mail list, https://lists.apache.org/thread/lm4sbw61w9cl7fsmo7tz3gvkq0ox6rod
     * I submitted another PR previously (apache#37787) which introduced JSON based function registry, and after discussion, I will close that PR and use this PR instead
* Closes: apache#37753

Lead-authored-by: Yue Ni <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Gandiva] Support external function registry
2 participants