How to use ArrayBuilder most efficiently? #307

HDembinski · 2021-03-13T11:29:30Z

HDembinski
Mar 13, 2021
Maintainer

I am wondering how smart ArrayBuilder is implemented. Consider the following typical analysis code, where we read a chunk of events, process them with numba and the fill the result in some (boost-)histograms. ArrayBuilder is needed for this, because in the analysis the size of the output array is not predictable.

from numba import njit

@njit
def process(builder, a, b, c):
   ...


# tree is some tree read with uproot, hist is some histogram
for a, b, c in tree.iterate(["a", "b", "c"], how=tuple):
    builder = ArrayBuilder()
    process(builder, a, b, c)
    hist.fill(builder.snapshot())

This should work, but it is not efficient, since ArrayBuilders are created and destroyed in each iteration of the outer loop and allocating memory from the OS is an expensive operation. ArrayBuilder should reuse its accumulated memory similarly to a std::vector on clear for the next iteration.

I read in the docs that it is safe to reuse the builder after calling snapshot. So I was wondering if the following code is smart enough to not accumulate an ever increasing amount of memory:

builder = ArrayBuilder()
for a, b, c in tree.iterate(["a", "b", "c"], how=tuple):
    process(builder, a, b, c)
    hist.fill(builder.snapshot())

I read in the docs that the array returned by builder.snapshot is a memory view. So at least in theory this array once destroyed could call back into its parent to signal the viewed memory is now free again to be used for building the next array.

So does it work this way or could it be implemented to work in this way?

jpivarski · 2021-03-13T18:28:59Z

jpivarski
Mar 13, 2021
Maintainer

ArrayBuilder code is completely contained within the src/libawkward/builder directory (interface in include/awkward/builder).

Each ArrayBuilder in Python (or Numba) is an instance of ArrayBuilder in C++, which is an interface class that holds a tree of Builder nodes. The first Builder is an UnknownBuilder which represents data of unknown type. If you only call ArrayBuilder's null method, the UnknownBuilder just counts nulls, but if you call any other method, such as integer, then the UnknownBuilder gets replaced by an Int64Builder (or whichever). If you had accumulated many nulls, the Int64Builder would be wrapped in an OptionBuilder. As long as the methods that you call correspond to the Builder tree that's already there, it only appends data to the buffers associated with the existing tree, but if you ever call a surprising method, it adds to the tree to reflect the more complex type. In the most generic situations, you'd end up with a lot of UnionBuilders, but if you stick to lists, records, and numeric types, the tree-building finishes early in the process and you're just growing buffers. (Some types are considered subtypes of each other: appending a floating point number to integers retroactively turns the integers into floating point, appending a complex number turns them into complex, and adding a field to a record retroactively adds the field to previous instances of the record with missing values up to now. More detail here.)

The reason for all this generality is to make ak.Array([any data goes here]) work. It makes the construction of small examples very easy. But as you can imagine, dynamically checking types and growing trees that consist of vtable-heavy subclasses is not the fastest way to accumulate data.

Each Builder node in this tree can have zero or more GrowableBuffers associated with it. A GrowableBuffer grows in a similar way to std::vector (originally, I was just going to use std::vector, but there's a technical issue with that). These are templated by the type of data they can fill (primitive type T), so in a different context, they might be fast. A GrowableBuffer holds one std::shared_ptr<T> pointing to data, and the std::shared_ptr is constructed with a deleter that deletes the array (array_deleter) when there are no more references to the std::shared_ptr. After an initial allocation (default is 1024 items), filling the GrowableBuffer inserts items into the buffer until reaching its maximum, and then it allocates a new, larger std::shared_ptr<T> (by default, 1.5 times larger: see initial and resize arguments), copies the old data to the new, and lets the old std::shared_ptr<T> go out of scope.

When you perform a snapshot, the data are not copied: the ArrayBuilder just passes a reference to the new tree of Content nodes that is created. If growing the ArrayBuilder means putting more items into the shared buffers, that's okay because the Content doesn't "see" them. (The Content has a fixed length, so what happens in its buffers beyond that length is invisible.) If the ArrayBuilder appends enough that its GrowableBuffers allocate new buffers, that's okay too, since the Content created by the snapshot is an owner of the buffers: they stay in scope if and only if somebody is using them.

But this policy has some bad consequences. For instance, if you wanted to use an ArrayBuilder in a way that clears the existing data and starts appending from zero, such an action would have to replace, not reuse, the std::shared_ptrs associated with each GrowableBuffer. Reusing the same std::shared_ptr could invalidate data that is in some Content somewhere. Also, @ianna discovered that replacing std::shared_ptrs with std::unique_ptrs would have a noticeable impact on performance. (I don't remember the exact number, but it was something like 10%.) Also also, the buffers given to Content nodes can't be trimmed to their actual size—they're some 1.5× multiple of 1024.

This choice of policy favors frequent snapshots of a long-growing array. That doesn't seem to be the most popular use, so @ianna and I have been talking about changes in policy. Here are all the possibilities we've considered:

GrowableBuffers own std::shared_ptrs, extending them by making exponentially larger ones, and share them with Content at snapshot time (current policy).
GrowableBuffers own std::unique_ptrs, extending them by making exponentially larger ones, and copy the data into new buffers at snapshot time. This makes the snapshot more costly, but the ArrayBuilders can now have a clear operation that does not need to allocate data (so once a high-water mark is reached, it probably won't have to allocate any more in a long-running process that frequently clears the buffers). It also means that the data owned by the Content nodes can be trimmed to the minimal necessary size, which makes sense because in most uses, the Content are more long-lived than the ArrayBuilder that made them, as the ArrayBuilder is usually just implicit in the ak.Array) constructor.
Instead of extending GrowableBuffers by replacing their data with exponentially larger allocations, the way std::vector does, perhaps it should allocate equal-size, non-contiguous chunks, pushing to a std::vector<std::unique_ptr<T>>. With this policy, a snapshot operation has to copy, since the snapshot would have to concatenate them for the Content nodes, which need contiguous data.

The second policy would be good for frequent clearing and reuse, which I didn't know was a common use, but it appears to be in your case. The third policy might be better for memory management, since a field full of equal-size allocations that get deleted when each ArrayBuilder is done would be less fragmented: new ArrayBuilders can fill in those equal-size slots. Thus, the third policy might be best for situations with many ArrayBuilders making arrays, which I think might be common (It certainly is in our unit tests!)

Independent of the problem of finding the best GrowableBuffer policy, there's another performance bottleneck in the way that ArrayBuilder has to discover the type of its data while accumulating. For months, we've been thinking about adding a TypedArrayBuilder that would fix this problem by requiring the type up-front. It could then build the one and only tree for that type, and most commands, such as begin_list where an integer is expected, would raise errors. @ianna is implementing this in PR #769, and it looks nearly done.

Although the TypedArrayBuilder knows the type before filling, it doesn't know the type at compile-time, since we have to distribute a wheel for Python that can't instantiate the infinitely many types, so we need something that "almost compiles" at runtime. @ianna is implementing this in C++ using AwkwardForth (arxiv:2102.13516), which is also intended for communication between Uproot and Awkward Array: it's a low-level language to generate code for (i.e. not writing it by hand) that is optimized for filling buffers quickly. Instructions in AwkwardForth are not quite as fast as compiled C++ (they're about 5 ns per instruction with no ability to do compiler-optimizations to reduce the number of instructions), but some AwkwardForth programs are faster than fetching data from memory, and certainly from disk. Most importantly, they are written at runtime. @ianna's implementation of TypedArrayBuilder has fixed C++ code for the begin_list, append integer, etc. commands that call into an AwkwardForth VM whose code is generated at runtime, from the type of the array one wishes to build.

This talk about generating code at runtime begs the question, "Why not use a JIT-compiler?" so here is where I will talk about Numba. AwkwardForth is an alternative to a full JIT-compiler without dependencies—its purpose is to ensure that Awkward Array doesn't strictly require LLVM for all users. But if you have LLVM because you're using JAX or Numba, that's a different story.

The non-typed ArrayBuilder is implemented in Numba by having Numba call ArrayBuilder's C++ through function pointers. (The Numba lowering is in src/awkward/_connect/_numba/builder, ArrayBuilder's function pointer interface is here.) From Numba's perspective, the ArrayBuilder is a single opaque type, since the types of Numba objects can't change at runtime the way that ArrayBuilder does. From a performance perspective, this means that we gain nothing from Numba's ability to JIT-compile to a specific type, and LLVM can't optimize code that is called through external function pointers.

The TypedArrayBuilder concept could help here, but not the C++ implementation linked through function pointers. To really gain a performance advantage from a TypedArrayBuilder in Numba, we would have to use the given type to generate specialized Numba code, which unfortunately means a complete reimplementation. We're going forward with TypedArrayBuilder in C++ now because (a) it has non-Numba applications and (b) it sets the interface for how a TypedArrayBuilder should work, so that implementing a Numba version would be adhering to an already-developed standard. (This "design work" is relevant because the TypedArrayBuilder interface is not exactly the same as the ArrayBuilder interface—some commands, like begin_record, end_record, and field become unnecessary in a context where the type is already known.)

As a very first step, we should think about Numba-native GrowableBuffers, which would be useful even without the (Typed)ArrayBuilder. You've brought that up before, @HDembinski. As described above, there are at least three ways to do the buffer-growing: (1) replace with exponentially larger buffers and share with snapshot, (2) replace with exponentially larger buffers and copy to snapshot, and (3) accumulate a list of equal-size buffers and concatenate + copy to snapshot. We should probably figure out what is the fastest kind of GrowableBuffer (for the right set of use-cases) before deciding on one to implement in Numba.

I don't see how a "changing size array" concept fits into JAX's view of JIT-compilation, but if you have any ideas there, let me know!

Most of what I've talked about above presupposes that we change how ArrayBuilder (or at least GrowableBuffer) works to improve performance. In its current state, you have to

for a, b, c in tree.iterate(["a", "b", "c"], how=tuple):
    builder = ArrayBuilder()
    process(builder, a, b, c)
    hist.fill(builder.snapshot())

and can't

builder = ArrayBuilder()
for a, b, c in tree.iterate(["a", "b", "c"], how=tuple):
    process(builder, a, b, c)
    hist.fill(builder.snapshot())

because the latter would fill histograms with more and more data (refilling it with what was filled on previous iterations). There's no clear operation, and it certainly isn't implicit. If this is the use-case we're optimizing for, I think that GrowableBuffer policy (2) would be the best one, since that policy would make a clear operation inexpensive (no new allocations) and reuse memory buffers that have reached their high-water mark, at the expense of making the snapshot operation more expensive than it is right now.

Also implicit in this example: it looks like your process only ever makes unstructured, one-dimensional arrays? If you're histogramming it, then it must be flat. In that case, you really want the GrowableBuffer without any ArrayBuilder, typed or otherwise. If "a", "b", and "c" are also flat or rectilinear, then bonus: you can use library="np" and get plain NumPy arrays. It all depends on what your process does and what the inputs are.

I think it should be possible to implement a GrowableBuffer in Numba, even without extending Numba. Each buffer would be a NumPy array in Numba. I don't know if Numba's type system allows you to replace a variable representing an array, but it's possible to make typed lists, and you can replace an element of a list or append to the list. (By replacing a length-1 list, you could implement policy (1) or (2); by appending, you could implement policy (3).) I searched for an example of this, and this StackOverflow answer shows how to do it in a @jitclass, which is yet another way to make mutable non-arrays in Numba.

Looking even further into the future, your problem would be more easily solved if histogram objects were recognized by Numba (by writing a Numba extension for it). @henryiii and I have talked about that—that would be great. It would be a matter of replacing the boost-histogram with a Numba model consisting of NumPy arrays viewing the histogram's counts and then implementing the fill operation as a lowered Numba function. Then you wouldn't even need to fill a large array with data that you're just going to be reducing into a histogram anyway. Writing Numba extensions requires some insider knowledge—it's not fully documented—that I accumulated through successive versions of Awkward Array. I'd be willing to share what I've learned for the purposes of lowering histograms. @henryiii and I have already talked about this and looked at code examples for a day, but that was over a year ago now. If you're ever interested in lowering histograms as a Numba extension, I'm here to help!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use ArrayBuilder most efficiently? #307

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

How to use ArrayBuilder most efficiently? #307

HDembinski Mar 13, 2021 Maintainer

Replies: 1 comment

jpivarski Mar 13, 2021 Maintainer

HDembinski
Mar 13, 2021
Maintainer

jpivarski
Mar 13, 2021
Maintainer