Skip to content
Paul Rogers edited this page Jan 31, 2017 · 15 revisions

Memory management in Drill is a complex topic. Here we describe just enough to explain a serious design flaw that impairs our ability to implement effective resource management for memory. We come at the topic from three directions:

  • Explanation of Drill's memory allocator
  • Explanation of Drill batch sizes
  • Explanation of how the above two are in direct conflict.

Good references:

Memory Manager

Drill is a columnar system, with data for each column stored in a value vector. Each vector is backed by a direct memory allocation. While Java provides robust management of heap memory, each application must invent its own manager for direct memory. Drill's manager is based on that provided by Netty, since Drill uses Netty to send (and thus release) and receive (and thus allocate) value vectors.

Memory management is a very complex topic; this discussion does not attempt to repeat that story. Rather, we focus on a number of critical aspects.

Java provides the Unsafe class that, via devious means, and application can use to allocate blocks of memory directly from the OS. Java provides no management at all: the application is responsible for freeing the memory when no longer needed (unlike heap memory, which is automatically garbage collected.)

This writer does not know how Java handles memory management below the Unsafe level. Is memory obtained from a call to the C malloc routine? Does Java provide its own memory manager above malloc? Presumably, the Unsafe memory manager is inefficient because Netty layers its own memory manager on top of Unsafe.

The Netty manager is implemented in the io.netty.buffer.PooledByteBufAllocator class. This class, in Drill, provides a per-thread, lock-free memory allocator. Basically, each minor fragment has its own pooled allocator instance. The "pooled" in the name is the key: the allocator maintains a free list of direct memory "chunks". In Drill, each chunk is 16 MB in size.

Allocation proceeds as follows:

  • Round the allocation up to the next power of 2. (A 5K request, say, is rounded up to 8K.)
  • Scan the pool's free list for an available chunk.
  • If a free chunk is available, and has sufficient capacity, carve off the desired allocation.
  • If the request is 16MB in size, the request uses the entire chunk. If smaller, the request slices up a chunk to provide the allocation.
  • If no free chunks exist or the request is larger than 16MB, request the memory directly from Unsafe.

Freeing proceeds as follows:

  • If the request is 16MB or larger, release memory back to the pool in 16 MB chunks. (Need to verify this...)
  • If the request is smaller than 16 MB, find the existing chunk in the memory pool and mark the used portion as now free.

Notice the asymmetry: any size request can be made. But, releasing memory carves it up into 16 MB chunks. Large requests come from Unsafe, but releases of those blocks go into the pooled allocator as 16 MB chunks.

Record Batch Management

Drill works with records that typically contain multiple fields. Upon reading, Drill "rotates" the data from row to columnar format. Thus, a record with, say, 10 fields becomes a group of 10 value vectors. Vectors only make sense when used to hold a group of records. In Drill, that group (or, more specifically, the vectors that make up the group) are called a "record batch." (Not really, a "record batch" is the implementation of an operator that works on the record batch, but we shall ignore that confusion here and simply use the term to refer to the "bundle" of vectors.)

The above description immediately raises the question: how many records should appear in each batch? Should the number be fixed, or should it depend on the size of each record (or column)? The answer in Drill is, "it depends."

It seems Drill was originally designed so that all record batches hold 64K records. This number is an obvious choice: it is the maximum number addressable by a two-byte selection vector (a so-called "SV2.") This choice works well for records that consist of small numbers of numeric columns. For example, a record of 10 floats, needs only:

10 columns * 8 bytes per double * 64K rows = 5 MB

This is quite a small allocation for a big data system.

Later, it was realized (it seems) that not all rows are (or can be restricted to) such a favorable layout. Consider analytics over documents. Each row may be 10s of K in size. Document databases, which favor denormalized data, are often of this size. Now the math becomes:

1 column * 50K characters * 64K rows = 3.3 GB

It was realized that this size is just a bit too big for comfort. But, the 64K row goal is still desirable. So, various operators tried for compromises: choose 64K rows, or as much data as will fit into some target record batch size. This gave rise to two additional issues.

First, Drill provides no metadata information in order to predict (or track) column widths. Thus, Drill is blind to the size of record batches during execution. Apparently some operators found work-arounds (need to research.) For example, the "managed" external sort measures the size of the entire record batch, divides this by the record count and obtains a crude estimate of row width. (But, not information about individual columns.)

Second, the target batch size is a per-operator decision. The Flatten operator chose 512 MB. The text scanner appears to have chosen 128 MB. Other operators don't enforce size. The result is that there is no standard, no guideline for how large record batches can be (or even if size is a concern.)

Vector Serialization

Some operators serialize value vectors. Indeed, serialization is one of the key benefits of the vector format.

When creating a vector, Drill may (need to verify) rely on Netty's concept of a "composite" ByteBuf: a logical buffer that is backed by a collection of separate physical buffers. Or, Drill may allocate a contiguous buffer.

When serializing a vector, Drill (actually Netty) writes the bytes out as one continuous stream of bytes. Drill prepends type and length information.

When deserializing, Drill:

  • Obtains type information
  • Obtains the data length
  • Allocates a (single contiguous) buffer to hold the data
  • Reads data into the newly created buffer
  • Wraps the data in a value vector depending on type type information

Notice that deserialization is always contiguous, this will be important in the next section.

Memory Fragmentation of Death

We can now combine the above two descriptions to explain the fatal flaw in Drill's memory management design.

Consider a summary of the facts presented above:

  • Allocations of any size are permitted: those above 16 MB come directly from Unsafe.
  • Release of memory always goes into the pooled allocator with large allocations sliced into 16 MB chunks.
  • Drill targets 64K record batches, and has no effective knowledge of column width.
  • Deserialization reads an entire value vector into one contiguous buffer.

Now, consider what happens when input records have large columns, say 8K in size. The scanner attempts to create input batches of size 128 MB. Since our columns are 1K in size, this means 16K rows. (Or, the target record count is 16K and batch size works out to 128MB. Need to research to determine which is the case.)

The 128MB is a single column. This batch is fed into the external sort. If the file is too large for memory (say a 18 GB file sorted in 3 GB of memory), the sort will spill. For efficiency, let's say we want large runs, of maybe 32 MB in size. During the run we continually receive batches, spill them, and free the underlying vectors. All these vectors go into Netty's pooled memory as 16 MB free chunks.

When reading each batch back from the spill file, we invoke the Drill deserializer. As explained above, Drill creates a single vector for each spilled vector. In this case, each vector is 32 MB in size.

On each such allocation, Drill asks for an allocation. Since the request is above 16 MB, the request goes directly to Unsafe.

Now, here is the issue. We set the external sort's memory at 3GB, which is most of the memory given to the Drillbit. The fragment's own pooled allocator already owns the 3GB. When we request Unsafe to give us more (for the 32 MB allocation), there is no more to give.

The result is that the allocation fails with an out-of-memory (OOM) error. And, does so despite the fact that a vast pool of memory exists in the pooled allocator -- but divided up into 16 MB chunks unusable to satisfy a 32 MB allocation.

Restating the Problem

The above should give you a good understanding of how we discovered the issues. At this point, it may be helpful to restate the problem clearly.

  • The original External Sort assumed all Varchar columns are 50 characters or less, even when they are actually larger. The original sort will run out of memory at random times with increasing probability as the width of columns increases above 50 bytes simply because the original sort does not know how bit the data actually is and so cannot properly plan spilling.
  • We revised the sort (to create the "managed" version). This version does plan spilling based on actual data size. This solves the above issue. As a result, users can now sort data with much larger widths, say up to 8K per column.
  • In most parts of Drill, record batches are sized by rows, not bytes. With the sort able to handle long columns, we now ask other parts of Drill to also handle long columns. Since batches are sized by rows, we end up allocating very large vectors to hold the fixed number of 8K columns.
  • The Drill (really Netty) memory allocator tracks memory in 16 MB chunks. With wide columns, we end up with buffers of very large sizes. That results in direct allocations from the system, bypassing the Netty free list.
  • With many large allocations of, essentially, random size, direct memory rapidly becomes fragmented.
  • The result is that, at some point, an allocation asks for a buffer greater than 16 MB, cannot be served by the ample 16 MB chunks on the free list, but the underlying direct memory has become so fragmented the it can't serve the request either. The query fails with OOM.
  • This issue was discovered as part of the Resource Management project. Here, we wish to maximize usage of memory: assigning each operator a known-size pool in which to operate, and adjust each operator to work within that pool. The more tightly we try to manage memory, the less "slop" exists to absorb memory losses due to fragmentation. So, the more we try to manage memory, the more likely we will encounter OOM errors in production.

The problem will not occur if columns are 256 bytes or less. (In this case 64K rows * 256 bytes = 16 MB buffer.) There is no magic number at which OOM errors are guaranteed to occur. As column widths increase (and as RM attempts to better manage memory), the more likely it becomes that fragmentation will trigger OOM.

Solutions

The above is, as was said, a fundamental flaw in Drill's memory management algorithm. It simply does not work to allow Drill to allocate any size of buffer, but only keep a free list of 16 MB chunks. Something has to change.

Better Free List Management?

One could dispense with the free list. However, direct memory has no garbage collector. If it did, then the GC might be able to shift data around in memory to coalesce free blocks into one large contiguous block. Since no GC is available, Drill (or Netty) must manage memory. Since a global malloc is expensive (due to locking), a local allocator is required.

However, the local allocator does not request all its memory at once; it does so in 16 MB chunks from the global allocator. As a result, the free chunks in the pool cannot be coalesced to form a large chunk when needed. Instead, a new, large chunk must be requested from the underlying system allocator.

Better Drill "Page Buffer" Management?

On the other hand, we could restrict Drill to allocate only blocks of 16 MB. Relational databases have long done this: each defines some fixed page size. All data fits into page buffers (perhaps spanning pages when needed.) A page buffer pool manager manages free, in-use and dirty pages, writing data to and from disk as needed.

Drill could use a similar idea: use fixed-size pages to back vectors. However, doing so requires an extensive rewrite of Drill's memory allocator, and requires changes to how Netty allocates and frees memory (since Netty is the source and/or sink of many Drill vectors.)

The problem also is that Drill would have to be aware of column widths and know how to limit vector sizes based on such widths. The required meta-data does not, however, exist in Drill at present.

Use Java Heap

Yet another solution is to rely on the Java heap, and Java garbage collector, for memory management. In the abstract, this is a bad idea as Drill's frequent, large allocations will lead to excessive GC events. One must ask, however, if the costs of avoiding GC outweigh the benefits of doing so. This is not an easy question to answer.

Force Flush of Free List

It may be possible to force a flush of all blocks on the free list back to the global pool. However, unlike Java GC, ,the global pool has no way to move allocated blocks in order to gather and coalesce free blocks. It is thus not clear whether releasing a collection of 16 MB chunks back to the system will result in the ability to allocate larger chunks.

Increase the Chunk Size

If Drill allocates blocks larger than the 16 MB blocks which Netty tracks, perhaps Drill should increase the chunk size to the largest that Drill will allocate. The problem is that the size is, essentially, unbounded. Drill works with record counts, not with column widths. We saw in an example above that columns with 10K of text would require allocations of over 3GB. At that level, memory fragmentation, even in a large direct memory pool, will become a severe problem.

Practical Solutions

Of the above, only two appear practical:

  1. Switch to use heap memory.
  2. Design a buffer pool system for Drill.

Other Related Material

The text scanner (and many others), allocate value vectors based on assumed row counts and widths:

      AllocationHelper.allocate(v, Character.MAX_VALUE, 50, 10);

That is:

  • Row count is assumed to be 64K.
  • Row width is assumed to be 50 bytes.
  • In repeated fields, each repeatable element is assumed to repeat 10 times.

The above results in:

  • Initial Varchar vector size of 50 * 64K = 3,276,750 bytes.
  • Initial Repeated Varchar size also of 3,276,750 bytes, in this case we assume each element is 50 / 10 = 5 bytes in size.
  • For repeated vectors (such as used when reading from a text file), the index vector starts with 64K int elements, or 256K bytes.

Value vectors enforce their own upper allocation size. For UInt4Vector:

  public static final int MAX_ALLOCATION_SIZE = Integer.MAX_VALUE;

  private void allocateBytes(final long size) {
    if (size > MAX_ALLOCATION_SIZE) {
      throw new OversizedAllocationException("Requested amount of memory is more than max allowed allocation size");
    }
    ...

This means that vectors can allocate buffers of up to 2 GB (far larger than the free list size of 16 MB.)

Vector Growth

When reading, the value vector is initialized as above. The vector size then grows as data arrives. On each growth, the vectors double in size. (One mystery is that, for a text reader, the vector starts at the 3 MB size above, but somehow shrinks to 32K by the time that the first reallocation occurs...) Each reallocation consists of:

  • Allocate a new buffer double the original size
  • Copy the bytes from the old to the new buffer
  • Release the old buffer.

The above shows that Drill is prone to memory fragmentation: Drill happily allocates buffers larger than the 16 MB free list chunk size. The code does not support "composite" buffers that result from stringing together buffers (say 16 MB chunks) to form a larger "logical" buffer.

The frequent memory allocations, releases and copies are an obvious performance penalty that serve no useful purpose.

Maximum Vector Size

Each record batch contains up to 64K records, or up to end of file. No limit applies to the memory consumed by the record batch. The memory consumed is driven purely by the 64K record goal and the actual row width of data on disk. In practice, no single column can be above the limits above:

Max. column width = 2 GB allocation limit / 64 K row limit = 32,768 bytes.

This means that if any single input file were to contain a run of 64K records where the average column width is greater than 32K, the scan operator will fail when it attempts to (re)allocate the buffer for the data. (Of course, as the scan operator doubles the vector size, it will badly fragment the Unsafe memory pool.)

Double Vectors Passed to Sort

The test case in question reads records with a single 8K column from an 18 GB file. The scanner resizes the column vector as it reads up to a maximum size of 64 MB. The project above the scanner then grows another vector of the same type up to 64 MB. (Need to clarify this aspect.) This is seems an unnecessary allocation (since there is nothing in the SQL that would require copying the values.) This also, somehow, causes the sort (sitting above the project) to see an incoming batch of 128 MB in size. The double size appears to come from the project: data is read into a RepeatableVarChar vector, the project copies it into a NullableVarChar vector. Both are passed to the sort.

Vector Storage

Vectors are built upon DrillBuf which is built upon Netty's "Unsafe Direct Little Endian" (UDLE) family of classes. Of these, the most important are:

  • PooledUnsafeLittleEndian which references chunks of memory of 16 MB size or less, and
  • UnpooledUnsafeDirectByteBuf which references memory larger than 16 MB.

The decision about which to allocate is made in Netty's PooledByteBufAllocatorL.InnerAllocator class. Actual allocation is done via the java.nio.DirectByteBuffer class. The NIO class takes care of page alignment, allocation from the Java VM and so on. On deallocation, NIO provides a Cleaner mechanism. Netty seems to take over part of that process itself. However, ultimately, the memory passes back to the JVM; it is not sliced up into the free list via Unsafe.freeMemory().

Tentative Conclusion

Drill is a big data analytics engine. Big data mostly means large quantities of numbers, perhaps with short text columns. Memory fragmentation is not a new issue. Indeed, it is well-established in the literature. Here is a passage from an ’84 paper on buffer managers:

Since pages can be displaced deliberately, variable-sized pages would cause heavy fragmentation problems.

Since the original Drill designers chose to use variable-sized pages for column widths greater than 256 characters, they either did not know about this fundamental fact of buffer pool design, or they understood the problem but determined it was not an actual risk because we designed Drill to handle columns small enough to use Netty's fixed-size 16 MB chunks.

The result is that either Drill usage is restricted to a maximum column width of 256 characters (allows 64K values per value vector in 16 MB of memory), or a redesign is needed of the memory allocator. Since that redesign is out of scope, we must live with the limitation at present.

Clone this wiki locally