diff --git a/rfcs/2021-12-20-10517-llvm-backend-for-vrl.md b/rfcs/2021-12-20-10517-llvm-backend-for-vrl.md new file mode 100644 index 00000000000000..5d3e9ae8ec9398 --- /dev/null +++ b/rfcs/2021-12-20-10517-llvm-backend-for-vrl.md @@ -0,0 +1,530 @@ +# RFC 10517 - 2021-12-20 - LLVM Backend for VRL + +Performance is a key aspect of VRL. We aim to provide a language that is ["extremely fast and efficient"](https://github.com/vectordotdev/vector/blob/f1404bea186ba83c4426a32bbef3f633c17cf4d2/website/cue/reference/remap/features/compilation.cue#L6) and ["ergonomically safe in that it makes it difficult to create slow or buggy VRL programs"](https://github.com/vectordotdev/vector/blob/f1404bea186ba83c4426a32bbef3f633c17cf4d2/website/cue/reference/remap/features/ergonomic_safety.cue#L4). Moving towards this goal, we propose to speed up general program execution by using LLVM to eliminate any runtime overhead that is currently associated with interpreting a VRL program. + +## Common Misconceptions + +Below we present a list of statements that have commonly come up when discussing the current execution model of VRL. We illustrate where these thoughts come from and how their understanding is often counter-intuitive. + +### "VRL programs are compiled to and run as native Rust code" [↪](https://github.com/vectordotdev/vector/blob/f1404bea186ba83c4426a32bbef3f633c17cf4d2/website/cue/reference/remap/features/compilation.cue#L4) + +Vector's codebase consists entirely of code written in Rust, so one might be inclined to conclude that anything running inside of Vector is running as "native Rust" code. This is true from a computability point of view - VRL processes events in a way that is semantically indistinguishable from transformations that were hand-written in Rust. However, implementation details are critical for execution time, not just the semantic definition of their computation. + +In particular, removing the hidden indirection of "we implement a mechanism that can perform sufficiently general computation within our program to execute program logic" to "we implement a program that executes program logic" makes a surprisingly large [difference](#the-time-spent-within-the-vrl-interpretervm-itself-is-small-removing-this-overhead-can-hardly-result-in-significant-performance-improvements), even if the aforementioned mechanism is written in a high-performance language. + +In its current implementation, "VRL programs are compiled to a representation that is interpreted in native Rust code" would be a more fitting description. + +### "VRL programs are extremely fast and efficient, with performance characteristics very close to Rust itself" [↪](https://github.com/vectordotdev/vector/blob/f1404bea186ba83c4426a32bbef3f633c17cf4d2/website/cue/reference/remap/features/compilation.cue#L6) + +Many code paths taken during execution of a VRL program have been compiled by Rust, e.g. when inserting or deleting paths of a VRL value, parsing JSON or matching with regular expressions. These paths are highly optimized and don't incur any runtime overhead on top of what the Rust compiler is able to produce. + +However, the top-level control flow of a VRL program is orchestrated at runtime and therefore different from a semantically equivalent transformation that has been implemented in Rust. The main difference lies in that when compiling a Rust program, the CPU knows statically which branches are taken between VRL expressions (minus conditionals and error handling). We elaborate on the **super-proportional** effects this has on performance further [below](#the-time-spent-within-the-vrl-interpretervm-itself-is-small-removing-this-overhead-can-hardly-result-in-significant-performance-improvements). + +### "VRL has no runtime [...]" [↪](https://github.com/vectordotdev/vector/blob/f1404bea186ba83c4426a32bbef3f633c17cf4d2/website/cue/reference/remap/features/compilation.cue#L7) + +There is no way to point the CPU instruction counter to a VRL [`Program`](https://github.com/vectordotdev/vector/blob/f1404bea186ba83c4426a32bbef3f633c17cf4d2/lib/vrl/compiler/src/program.rs#L5-L10) to execute it. Instead, it relies on a runtime to interpret a VRL program by implementing a [`resolve`](https://github.com/vectordotdev/vector/blob/f1404bea186ba83c4426a32bbef3f633c17cf4d2/lib/vrl/compiler/src/expression.rs#L51-L56) method for each expression. Even though we didn't explicitly name this runtime system, it fits the very definition of ["_any behavior not directly attributable to the program itself_"](https://en.wikipedia.org/wiki/Runtime_system#Overview) well. + +### The time spent within the VRL interpreter/VM itself is small, removing this overhead can hardly result in significant performance improvements. + +When inspecting flamegraphs of VRL program execution one can see that, on a very roughly estimate, no more than 25% of the time is spent in the interpret call itself without progressing the program state. So, how can one expect that by removing this overhead, any performance increase bigger than 33% would be attainable? + +The answer has largely to do with the [memory bottleneck in the von Neumann architecture](https://en.wikipedia.org/wiki/Von_Neumann_architecture#Von_Neumann_bottleneck) and the mechanisms modern CPU architecture employ to mitigate it. + +The CPU can improve execution speed of subsequent CPU instructions by using [instruction pipelining](https://en.wikipedia.org/wiki/Instruction_pipelining) as long as these instructions are not fragmented between unpredictable paths. When conditional branches exist but are heavily biased, the CPU can [speculatively execute](https://en.wikipedia.org/wiki/Branch_predictor) instructions and read/write from main memory to hide the [latency of memory access](https://en.wikipedia.org/wiki/Memory_hierarchy#Examples), which is _orders of magnitude_ higher than accessing CPU registers/caches or executing arithmetic operations. + +In the current execution model, the CPU is not able to predict any control flow on the boundary between any VRL (sub)expression, majorly limiting the CPU utilization by [stalling](https://en.wikipedia.org/wiki/CPU_cache#CPU_stalls) the CPU. + +### Optimizing for single core performance is not as important when one can resort to parallelism first. + +When a problem looks [embarrassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) one might think that squeezing out single-core performance does not look very worthwhile when adding more threads would seemingly always have an outsized effect. + +However, even a small amount of synchronization points can have a detrimental impact on optimal performance. According to [Amdahl's law](https://en.wikipedia.org/wiki/Amdahl%27s_law), even when only 5% of a program can not be parallelized, the maximum performance increase with an _infinite_ amount of threads is capped at 20x. + +### LLVM is a virtual machine. + +Judging from its name, one might assume that LL**VM** stands for "... virtual machine"[^1] and that it's merely a more general and sophisticated VM implementation than our upcoming special purpose VRL virtual machine. + +However, even though LLVM provides a virtual instruction set architecture, it is an intermediate representation that exists only during the compilation process between high-level language and machine code **without** any interpretation at runtime. + +One essential part of LLVM are optimization passes that act on LLVM IR, e.g. by inlining functions, merging code branches, promoting memory access to register access, performing constant-folding, batching allocations and more. + +Rust uses LLVM to emit machine code, and we intend to employ the exact same technique. + +## Context / Cross cutting concerns + +There's ongoing work on implementing a VM for VRL: [#10011](https://github.com/vectordotdev/vector/pull/10011). While it reduces the interpretation overhead over the current expression traversal, it doesn't eliminate the overhead entirely. More importantly, it doesn't fundamentally improve behavior for speculative execution / branch prediction, since the CPU can't predict the next instruction in the interpreter loop. + +## Scope + +### In scope + +Migrating the execution model of VRL to direct machine code execution without runtime interpretation overhead. + +### Out of scope + +Any optimization that applies to all execution models for VRL (traversal, VM and LLVM) are not interesting for this consideration, e.g. improving access paths to VRL [`Value`](https://github.com/vectordotdev/vector/blob/f1404bea186ba83c4426a32bbef3f633c17cf4d2/lib/vrl/compiler/src/value.rs#L21-L32)s. + +## Pain + +Performance investigations of various Vector topologies suggested that the single-core performance of VRL is a bottleneck in many cases. + +## Proposal + +### User Experience + +The semantics of VRL stay **unchanged**. Any case where a VRL program is not strictly faster in the LLVM execution model versus traversal or VM is considered a definite bug. + +This is an unconditional win for user experience. + +### Introduction to LLVM + +To get familiar with how LLVM looks like and how its code generation builder works, I recommend reading through the official tutorial ["Kaleidoscope: Code generation to LLVM IR"](https://llvm.org/docs/tutorial/MyFirstLanguageFrontend/LangImpl03.html). + +There exist an adapted version of the [LLVM Kaleidoscope tutorial in Rust](https://github.com/TheDan64/inkwell/blob/master/examples/kaleidoscope/main.rs) for [`inkwell`](https://github.com/TheDan64/inkwell), a crate that exposes a safe wrapper around [LLVM's C API](https://llvm.org/doxygen/group__LLVMC.html). + +Another great reference is Mukul Rathi's ["A Complete Guide to LLVM for Programming Language Creators"](https://mukulrathi.com/create-your-own-programming-language/llvm-ir-cpp-api-tutorial/). + +Godbolt's [Compiler Explorer](https://godbolt.org/) is a great way to understand how compilers emit LLVM. E.g. running + +```rust +#[no_mangle] +pub extern "C" fn foo(n: i32) -> i32 { + n * 42 +} +``` + +through the compiler and setting the `rustc` argument to `--emit=llvm-ir -O` emits + +```llvm +define i32 @foo(i32 %n) unnamed_addr #0 !dbg !6 { + %0 = mul i32 %n, 42, !dbg !10 + ret i32 %0, !dbg !11 +} +``` + +Running `rustc ./program.rs --crate-type=lib --emit=llvm-ir -O` locally will accomplish the same. + +### Implementation + +On a high level, the goal is to produce executable machine code for a VRL program via emitting LLVM IR. When Vector launches, the VRL program is parsed, translated to LLVM IR, compiled to machine code via LLVM and dynamically loaded into the running process. The resulting `vrl_execute` function symbol is then resolved from the binary and called for each event to be transformed. + +Instead of recursively calling [`resolve`](https://github.com/vectordotdev/vector/blob/f1404bea186ba83c4426a32bbef3f633c17cf4d2/lib/vrl/compiler/src/expression.rs#L51-L56) on an [`Expression`](https://github.com/vectordotdev/vector/blob/f1404bea186ba83c4426a32bbef3f633c17cf4d2/lib/vrl/compiler/src/expression.rs#L50), we add an `emit_llvm` method to the trait: + +```rust +/// Emit LLVM IR that computes the `Value` for this expression. +fn emit_llvm<'ctx>( + &self, + state: &crate::state::Compiler, + context: &mut crate::llvm::Context<'ctx>, +) -> Result<(), String>; +``` + +where `Context` is defined as + +```rust +pub struct Context<'ctx> { + context: &'ctx inkwell::context::Context, + execution_engine: inkwell::execution_engine::ExecutionEngine<'ctx>, + module: inkwell::module::Module<'ctx>, + builder: inkwell::builder::Builder<'ctx>, + function: inkwelll::values::FunctionValue<'ctx>, + context_ref: inkwelll::values::PointerValue<'ctx>, + result_ref: inkwelll::values::PointerValue<'ctx>, + ... +} +``` + +By convention, each expression can call `context.result_ref()` to get an LLVM `PointerValue` with a pointer to where the [`Resolved`](https://github.com/vectordotdev/vector/blob/f1404bea186ba83c4426a32bbef3f633c17cf4d2/lib/vrl/compiler/src/expression.rs#L48) value should be stored. + +The result pointer can be temporarily changed using `context.set_result_ref()`. This mechanism allows the parent expression to call `emit_llvm` on the child while controlling where the machine code emitted for the child expression stores the result. This is useful, e.g. when emitting a binary operation where both of its operands need to be computed first. + +Calling `context.context_ref()` returns a reference to the VRL [`Context`](https://github.com/vectordotdev/vector/blob/f1404bea186ba83c4426a32bbef3f633c17cf4d2/lib/vrl/compiler/src/context.rs#L5-L9) that is provided by the `remap` transform as a function argument. + +For anything less trivial than emitting branches or calling functions, we want to leverage the Rust compiler. For one, this allows us to not concern ourselves with memory layout and doesn't force us to define an FFI when we only use basic integer types or pointers/references. It also provides us with Rust's memory safety guarantees for large parts of the emitted LLVM IR. + +_TODO: Elaborate on the expression compilation a bit more._ + +_TODO: Explain how constants are handled._ + +Below we show a preliminary, work-in-progress excerpt of the precompiled functions. The LLVM module will be initialized with the resulting bitcode. Therefore, these function symbols possibly do not exist at runtime anymore if they are optimized out by LLVM. + +```rust +#[no_mangle] +pub extern "C" fn vrl_resolved_initialize(result: *mut Resolved) { + unsafe { result.write(Ok(Value::Null)) }; +} + +#[no_mangle] +pub extern "C" fn vrl_resolved_drop(result: *mut Resolved) { + drop(unsafe { result.read() }); +} + +#[no_mangle] +pub extern "C" fn vrl_resolved_is_err(result: &mut Resolved) -> bool { + result.is_err() +} + +#[no_mangle] +pub extern "C" fn vrl_resolved_boolean_is_true(result: &Resolved) -> bool { + result.as_ref().unwrap().as_boolean().unwrap() +} + +#[no_mangle] +pub extern "C" fn vrl_expression_assignment_target_insert_external_impl( + ctx: &mut Context, + path: &LookupBuf, + resolved: &Resolved, +) { + let value = resolved.as_ref().unwrap().clone(); + let _ = ctx.target_mut().insert(path, value); +} + +#[no_mangle] +pub extern "C" fn vrl_expression_literal_impl(value: &Value, result: &mut Resolved) { + *result = Ok(value.clone()); +} + +#[no_mangle] +pub extern "C" fn vrl_expression_op_eq_impl(rhs: &mut Resolved, result: &mut Resolved) { + let rhs = std::mem::replace(rhs, Ok(Value::Null)); + *result = match (result.clone(), rhs) { + (Ok(lhs), Ok(rhs)) => Ok(Value::Boolean(rhs == lhs)), + _ => unimplemented!(), + }; +} + +#[no_mangle] +pub extern "C" fn vrl_expression_query_target_external_impl( + context: &mut Context, + path: &LookupBuf, + result: &mut Resolved, +) { + *result = Ok(context + .target() + .get(path) + .ok() + .flatten() + .unwrap_or(Value::Null)); +} +``` + +With the precompiled library, we can emit code in terms of it by utilizing stack allocations, branches and functions calls only. E.g. the LLVM IR for the following VRL program: + +```vrl +if .status == 123 { + .foo = "bar" +} +``` + +Would look like this: + +```llvm +; Function Attrs: mustprogress nofree norecurse nosync nounwind readnone uwtable willreturn +define void @vrl_execute(%"vrl_compiler::Context"* noalias nocapture align 8 dereferenceable(32) %context, %"std::result::Result"* noalias nocapture align 8 dereferenceable(88) %result) unnamed_addr #55 { +start: + br label %if_statement_begin + +if_statement_begin: ; preds = %start + br label %"op_==_begin" + +"op_==_begin": ; preds = %if_statement_begin + call void @vrl_expression_query_target_external_impl(%"vrl_compiler::Context"* %context, %"lookup_buf::LookupBuf"* bitcast ([32 x i8]* @status to %"lookup_buf::LookupBuf"*), %"std::result::Result"* %result) + %rhs = alloca %"std::result::Result", align 8 + call void @vrl_resolved_initialize(%"std::result::Result"* %rhs) + br label %literal_begin + +literal_begin: ; preds = %"op_==_begin" + call void @vrl_expression_literal_impl(%"memmem::SearcherKind"* bitcast ([40 x i8]* @"123" to %"memmem::SearcherKind"*), %"std::result::Result"* %rhs) + call void @vrl_expression_op_eq_impl(%"std::result::Result"* %rhs, %"std::result::Result"* %result) + call void @vrl_resolved_drop(%"std::result::Result"* %rhs) + %vrl_resolved_boolean_is_true = call i1 @vrl_resolved_boolean_is_true(%"std::result::Result"* %result) + br i1 %vrl_resolved_boolean_is_true, label %if_statement_if_branch, label %if_statement_else_branch + +if_statement_end: ; preds = %if_statement_else_branch, %block_end + ret void + +if_statement_if_branch: ; preds = %literal_begin + br label %block_begin + +if_statement_else_branch: ; preds = %literal_begin + br label %if_statement_end + +block_begin: ; preds = %if_statement_if_branch + br label %assignment_single_begin + +block_end: ; preds = %block_next, %block_error + br label %if_statement_end + +block_error: ; preds = %assignment_single_end + br label %block_end + +assignment_single_begin: ; preds = %block_begin + br label %literal_begin1 + +assignment_single_end: ; preds = %literal_begin1 + %vrl_resolved_is_err = call i1 @vrl_resolved_is_err(%"std::result::Result"* %result) + br i1 %vrl_resolved_is_err, label %block_error, label %block_next + +literal_begin1: ; preds = %assignment_single_begin + call void @vrl_expression_literal_impl(%"memmem::SearcherKind"* bitcast ([40 x i8]* @"\22bar\22" to %"memmem::SearcherKind"*), %"std::result::Result"* %result) + call void @vrl_expression_assignment_target_insert_external_impl(%"vrl_compiler::Context"* %context, %"lookup_buf::LookupBuf"* bitcast ([32 x i8]* @foo to %"lookup_buf::LookupBuf"*), %"std::result::Result"* %result) + br label %assignment_single_end + +block_next: ; preds = %assignment_single_end + br label %block_end +} +``` + +After running several LLVM optimization passes over the LLVM IR: + +```llvm +; Function Attrs: nofree norecurse nosync nounwind readnone uwtable willreturn +define void @vrl_execute(%142* noalias nocapture align 8 dereferenceable(32) %0, %752* noalias nocapture align 8 dereferenceable(88) %1) unnamed_addr #87 personality i32 (i32, i32, i64, %462*, %9*)* @rust_eh_personality { + %3 = alloca %529*, align 8 + %4 = alloca %752, align 8 + %5 = alloca [5 x i64], align 8 + %6 = alloca %135, align 8 + %7 = alloca %116, align 8 + %8 = alloca %135, align 8 + tail call void @vrl_expression_query_target_external_impl(%142* nonnull %0, %74* bitcast ([32 x i8]* @16146 to %74*), %752* nonnull %1) #104 + %9 = alloca %752, align 8 + %10 = getelementptr inbounds %752, %752* %9, i64 0, i32 0 + store i64 0, i64* %10, align 8 + %11 = getelementptr inbounds %752, %752* %9, i64 0, i32 1 + %12 = bitcast [10 x i64]* %11 to i8* + store i8 8, i8* %12, align 8 + tail call void @llvm.experimental.noalias.scope.decl(metadata !99596) + %13 = bitcast [5 x i64]* %5 to i8* + call void @llvm.lifetime.start.p0i8(i64 40, i8* nonnull %13), !noalias !99596 + %14 = bitcast [5 x i64]* %5 to %135* + call fastcc void @17203(%135* noalias nocapture nonnull dereferenceable(40) %14, %135* nonnull align 8 dereferenceable(40) bitcast ([40 x i8]* @16147 to %135*)) #104, !noalias !99596 + %15 = bitcast [10 x i64]* %11 to %135* + invoke fastcc void @17183(%135* nonnull %15) + to label %18 unwind label %16 + +common.resume: ; preds = %49, %16 + %common.resume.op = phi { i8*, i32 } [ %17, %16 ], [ %50, %49 ] + resume { i8*, i32 } %common.resume.op + +16: ; preds = %2 + %17 = landingpad { i8*, i32 } + cleanup + store i64 0, i64* %10, align 8, !alias.scope !99596 + call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 8 dereferenceable(40) %12, i8* noundef nonnull align 8 dereferenceable(40) %13, i64 40, i1 false) + br label %common.resume + +18: ; preds = %2 + store i64 0, i64* %10, align 8, !alias.scope !99596 + call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 8 dereferenceable(40) %12, i8* noundef nonnull align 8 dereferenceable(40) %13, i64 40, i1 false) + call void @llvm.lifetime.end.p0i8(i64 40, i8* nonnull %13), !noalias !99596 + call void @vrl_expression_op_eq_impl(%752* nonnull %9, %752* nonnull %1) #104 + %19 = bitcast %752* %4 to i8* + call void @llvm.lifetime.start.p0i8(i64 88, i8* nonnull %19) + %20 = bitcast %752* %9 to i8* + call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 8 dereferenceable(88) %19, i8* noundef nonnull align 8 dereferenceable(88) %20, i64 88, i1 false) #104 + %21 = getelementptr inbounds %752, %752* %4, i64 0, i32 0 + %22 = load i64, i64* %21, align 8, !range !220, !alias.scope !99599 + %23 = icmp eq i64 %22, 0 + %24 = getelementptr inbounds %752, %752* %4, i64 0, i32 1 + br i1 %23, label %25, label %27 + +25: ; preds = %18 + %26 = bitcast [10 x i64]* %24 to %135* + call fastcc void @17183(%135* nonnull %26) #104 + br label %29 + +27: ; preds = %18 + %28 = bitcast [10 x i64]* %24 to %529* + call void @17184(%529* nonnull %28) + br label %29 + +29: ; preds = %25, %27 + call void @llvm.lifetime.end.p0i8(i64 88, i8* nonnull %19) + %30 = getelementptr %752, %752* %1, i64 0, i32 0 + %31 = load i64, i64* %30, align 8, !range !220 + %32 = getelementptr inbounds %752, %752* %1, i64 0, i32 1 + %33 = icmp eq i64 %31, 0 + br i1 %33, label %38, label %34 + +34: ; preds = %29 + %35 = bitcast %529** %3 to i8* + call void @llvm.lifetime.start.p0i8(i64 8, i8* nonnull %35), !noalias !99602 + %36 = bitcast %529** %3 to [10 x i64]** + store [10 x i64]* %32, [10 x i64]** %36, align 8, !noalias !99602 + %37 = bitcast %529** %3 to {}* + call void @_ZN4core6result13unwrap_failed17h0f27636d1d025391E([0 x i8]* noalias nonnull readonly align 1 bitcast (<{ [43 x i8] }>* @13883 to [0 x i8]*), i64 43, {}* nonnull align 1 %37, [3 x i64]* noalias readonly align 8 dereferenceable(24) bitcast (<{ i8*, [16 x i8], i8*, [0 x i8] }>* @6297 to [3 x i64]*), %71* noalias nonnull readonly align 8 dereferenceable(24) bitcast (<{ i8*, [16 x i8] }>* @6302 to %71*)) #104 + unreachable + +38: ; preds = %29 + %39 = bitcast [10 x i64]* %32 to %135* + %40 = bitcast [10 x i64]* %32 to i8* + %41 = load i8, i8* %40, align 8, !range !1540 + %42 = icmp eq i8 %41, 3 + %43 = getelementptr inbounds %135, %135* %39, i64 0, i32 1, i64 0 + %44 = load i8, i8* %43, align 1 + %45 = select i1 %42, i8 %44, i8 2 + br label %NodeBlock + +NodeBlock: ; preds = %38 + %Pivot = icmp slt i8 %45, 2 + br i1 %Pivot, label %LeafBlock, label %LeafBlock21 + +LeafBlock21: ; preds = %NodeBlock + %SwitchLeaf22 = icmp eq i8 %45, 2 + br i1 %SwitchLeaf22, label %46, label %NewDefault + +LeafBlock: ; preds = %NodeBlock + %SwitchLeaf = icmp eq i8 %45, 0 + br i1 %SwitchLeaf, label %47, label %NewDefault + +46: ; preds = %LeafBlock21 + call void @_ZN4core9panicking5panic17h367b69984712bd50E([0 x i8]* noalias nonnull readonly align 1 bitcast (<{ [43 x i8] }>* @13881 to [0 x i8]*), i64 43, %71* noalias nonnull readonly align 8 dereferenceable(24) bitcast (<{ i8*, [16 x i8] }>* @6303 to %71*)) #104 + unreachable + +47: ; preds = %LeafBlock, %71 + ret void + +NewDefault: ; preds = %LeafBlock21, %LeafBlock + br label %48 + +48: ; preds = %NewDefault + call void @llvm.experimental.noalias.scope.decl(metadata !99605) + call void @llvm.lifetime.start.p0i8(i64 40, i8* nonnull %13), !noalias !99605 + call fastcc void @17203(%135* noalias nocapture nonnull dereferenceable(40) %14, %135* nonnull align 8 dereferenceable(40) bitcast ([40 x i8]* @16148 to %135*)) #104, !noalias !99605 + invoke fastcc void @17183(%135* nonnull %39) + to label %51 unwind label %49 + +49: ; preds = %48 + %50 = landingpad { i8*, i32 } + cleanup + store i64 0, i64* %30, align 8, !alias.scope !99605 + call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 8 dereferenceable(40) %40, i8* noundef nonnull align 8 dereferenceable(40) %13, i64 40, i1 false) + br label %common.resume + +51: ; preds = %48 + store i64 0, i64* %30, align 8, !alias.scope !99605 + call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 8 dereferenceable(40) %40, i8* noundef nonnull align 8 dereferenceable(40) %13, i64 40, i1 false) + call void @llvm.lifetime.end.p0i8(i64 40, i8* nonnull %13), !noalias !99605 + call void @llvm.experimental.noalias.scope.decl(metadata !99608) + %52 = getelementptr inbounds %135, %135* %8, i64 0, i32 0 + call void @llvm.lifetime.start.p0i8(i64 40, i8* nonnull %52), !noalias !99611 + call fastcc void @17203(%135* noalias nocapture nonnull dereferenceable(40) %8, %135* nonnull align 8 dereferenceable(40) %39) #104, !noalias !99611 + %53 = bitcast %116* %7 to i8* + call void @llvm.lifetime.start.p0i8(i64 24, i8* nonnull %53), !noalias !99611 + %54 = getelementptr inbounds %142, %142* %0, i64 0, i32 0, i32 0 + %55 = load {}*, {}** %54, align 8, !alias.scope !99613, !noalias !99616, !nonnull !1 + %56 = getelementptr inbounds %142, %142* %0, i64 0, i32 0, i32 1 + %57 = load [3 x i64]*, [3 x i64]** %56, align 8, !alias.scope !99613, !noalias !99616, !nonnull !1 + %58 = getelementptr inbounds %135, %135* %6, i64 0, i32 0 + call void @llvm.lifetime.start.p0i8(i64 40, i8* nonnull %58), !noalias !99611 + call void @llvm.memcpy.p0i8.p0i8.i64(i8* noundef nonnull align 8 dereferenceable(40) %58, i8* noundef nonnull align 8 dereferenceable(40) %52, i64 40, i1 false), !noalias !99611 + %59 = getelementptr inbounds [3 x i64], [3 x i64]* %57, i64 0, i64 4 + %60 = bitcast i64* %59 to void (%116*, {}*, %74*, %135*)** + %61 = load void (%116*, {}*, %74*, %135*)*, void (%116*, {}*, %74*, %135*)** %60, align 8, !invariant.load !1, !noalias !99611, !nonnull !1 + call void %61(%116* noalias nocapture nonnull sret(%116) dereferenceable(24) %7, {}* nonnull align 1 %55, %74* noalias nonnull readonly align 8 dereferenceable(32) bitcast ([32 x i8]* @16149 to %74*), %135* noalias nocapture nonnull dereferenceable(40) %6) #104, !noalias !99608 + call void @llvm.lifetime.end.p0i8(i64 40, i8* nonnull %58), !noalias !99611 + %62 = getelementptr inbounds %116, %116* %7, i64 0, i32 0 + %63 = load {}*, {}** %62, align 8, !noalias !99611 + %64 = icmp eq {}* %63, null + %65 = bitcast {}* %63 to i8* + br i1 %64, label %71, label %66 + +66: ; preds = %51 + %67 = getelementptr inbounds %116, %116* %7, i64 0, i32 1, i64 0 + %68 = load i64, i64* %67, align 8, !noalias !99611 + %69 = icmp eq i64 %68, 0 + br i1 %69, label %71, label %70 + +70: ; preds = %66 + call void @__rust_dealloc(i8* nonnull %65, i64 %68, i64 1) #104, !noalias !99608 + br label %71 + +71: ; preds = %51, %66, %70 + call void @llvm.lifetime.end.p0i8(i64 24, i8* nonnull %53), !noalias !99611 + call void @llvm.lifetime.end.p0i8(i64 40, i8* nonnull %52), !noalias !99611 + br label %47 +} +``` + +Note the batched stack allocations, inlining of function calls and consolidation of control flow. + +## Rationale + +As long as the single-core performance of VRL is the bottleneck of a topology, general performance improvements to VRL are extremely valuable as they equate to an equally sized performance improvement to the entire topology. + +We want to live up to the [performance guarantees](https://github.com/vectordotdev/vector/blob/f1404bea186ba83c4426a32bbef3f633c17cf4d2/website/cue/reference/remap/features/compilation.cue#L4-L8) outlined in VRL's list of features. + +Every VRL program benefits from the reduced runtime overhead, without us needing to optimize any specific use case. Consistent execution speed is important to build trust in the language. + +Being able to execute our log transformation DSL at speeds which would otherwise only be attainable by hand-writing Rust programs will strengthen a key value proposition of Vector: best-in-class performance. + +## Drawbacks + +By generating machine code via LLVM, we are no longer (largely) immune to memory violations. To reduce the error surface as much as possible, we employ industry practices like fuzz-testing VRL programs, static analysis via LLVM and rely on the Rust compiler for any non-trivial code fragments. That being said, memory safety is an illusion - with the difference that we are now exposing ourselves to being partly responsible for maintaining invariants instead of being able to defer to a third party for correctness. We still provide an inherently memory safe language to the user. + +Producing LLVM bitcode for Rust's `std` library is guarded behind the [`-Z build-std`](https://doc.rust-lang.org/cargo/reference/unstable.html#build-std) flag and only available on the nightly compiler toolchain. We need `std` to fully link our precompiled LLVM bitcode. Further investigation is needed here if this means that we need to upgrade our entire compiler toolchain to nightly or if it's possible to build a binary-compatible `std` according to our `rust-toolchain.toml` file. + +Statically linking LLVM to the Vector binary adds roughly 9MB, additionally to precompiled bitcode that needs to be included with the binary. If this is a concern, we can consider shipping binaries with the LLVM feature disabled. + +While LLVM is a highly used framework within the industry, working with it requires rather specialized knowledge about compiler construction. However, there exists plenty of publicly accessible material for code generation using LLVM, some of which I linked to [above](#introduction-to-llvm). In addition, there should be great focus to document this part of the code base extraordinarily well. + +## Prior Art + +_TODO._ + +## Alternatives + +### Compile to Rust + +Using Rust as a compilation target would require us to + +- ship a Rust compiler and its libraries +- ship Vector source code and its dependent crates + +which would be hundreds of MB and therefore infeasible. + +### Compile to C + +Using C as a compilation target would require us to + +- ship a C compiler + +while + +- not having any better safety guarantees +- not being able to inline functions and therefore miss optimization potential + +and therefore not provide any significant benefits over using LLVM directly. + +### Compile to WebAssembly + +Using WebAssembly as a compilation target would require us to + +- ship a Wasm runtime +- define an FFI +- costly serialization at boundaries +- copy data in and out of WebAssembly or use `mmap`ing techniques +- likely slower execution speed +- precompile WebAssembly bitcode or miss optimization potential + +additionally to these drawbacks, we just dropped support for the WebAssembly transform. + +### Compile to Bitcode + +As mentioned in the context, we are currently moving forward with a VM for VRL. Compared to the current execution model and an LLVM-based approach, the VM provides a middle ground for execution speed, memory safety and sophistication. + +Weighing the benefits depends on the real world performance of both approaches. + +## Plan Of Attack + +Incremental steps to execute this change. These will be converted to issues after the RFC is approved: + +- Submit a PR with spike-level code _roughly_ demonstrating the change: [#10442](https://github.com/vectordotdev/vector/pull/10442). +- Extract a core library from VRL for exposing its types with minimal dependencies, necessary to reduce size of the precompiled bitcode. +- Refine code generation by taking into account type information. +- Add unit tests for each expression in isolation. +- Add fuzz testing infrastructure that compares results of all three execution modes. + +--- + +[^1]: It certainly doesn't help that "LLVM was originally an initialism for Low Level Virtual Machine". However, the "LLVM abbreviation has officially been removed to avoid confusion, as LLVM has evolved into an umbrella project that has little relationship to what most current developers think of as (more specifically) process virtual machines." [↪](https://en.wikipedia.org/wiki/LLVM#History)