GitHub - specht/diob_lkm: A loadable kernel module for dynamic I/O buffering

Introduction

The idea of this module is to provide automatic buffering for I/O on Linux at the system call level. Files are being watched while they are read, and whenever it appears that a large file is read in many small chunks (which is the default behaviour of most libc-linked programs), a buffer of increasing size (256 kB, 1 MB, 4 MB) is allocated in kernel memory and used to store and return pre-fetched file content.

This module has been developed and tested on Cent OS 5.5 with a Linux 2.6.18-348.6.1.el5. It might also work with other configurations.

Results

With the module loaded in a VirtualBox, a reduction in executing time to about 50% could be observed while running sha1sum on a 100 MB file, roughly corresponding to the effect seen when using dd with a large block size and piping its output to sha1sum. Without virtualization, the effect is minimal. However, it is expected that this module will mitigate I/O problems on cluster file systems due to excessive amounts of I/O system calls resulting from small default buffer sizes. With the module loaded, the DDoS-like effects resulting from small default buffer sizes are mitigated before they reach the cluster file system.

Alternatives

Writing a kernel module carries a great potential to mess a system up so bad it needs to be reset. There are a couple of possible alternatives:

the provided functionality could be implemented as patched gnulibc, but there are problems with the LD_PRELOAD approach: it’s not working reliably.
the file system driver could be patched to provide the same functionality

Strategy

Every process has a file descriptor table in which process-specific file descriptors are mapped to entries in the system-wide open file table. These open file table pointers are converted into a CRC-16 checksum (“the hash”), resulting in a maximum of 2^16 files being watched for I/O behaviour, and each file is managed internally via its hash.

The hashing is necessary because different processes use the same file descriptors for different files (ruling out raw file descriptors as distinctive keys) and the open file table does not seem to be a table where each entry has an offset but file objects seem to be scattered throughout kernel memory, hence the pointer hashing. CRC-16, when implemented with a reasonable polynomial, can be expected to uniformly distribute various pointers over a range of 64k slots. In the case of hash collision, the second file will not be handled by the module while handling for the first file continues as if nothing happened.

Whenever a file gets opened, a couple of checks are performed to determine whether the file should be watched:

the owner of the file must not be root
it must be a regular file
it must have a minimum size of 16 MB

If these tests pass, the file will be watched. The aim of these tests is to skip files which cannot be seeked and to discard most files for which additonal buffering wouldn’t buy much.

Whenever the read system call is invoked for a watched file, a counter is increased every time a block smaller than 128k is read (this is considered a small read). When this counter reaches 1024, buffering gets activated for the file. First, a 256 kB buffer is allocated in kernel memory, filled from the current file position and returned to user space in small chunks as requested. The buffer gets re-filled automatically whenever it is necessary. After 1024 more small reads, the buffer size is increased to 1 MB, and later to 4 MB. On every read call which just returns data from the buffer instead actually performing a file system read, the file pointer is advanced as necessary via lseek (which should only affect internal kernel data structures and result in no actual I/O), thus mimicking a default read call.

Any call to open, close, lseek or write will de-activate read buffering. In the case of lseek, counting of small reads starts from the beginning.

Implementation details

There are three states a file can be in:

unwatched: This is the default.
watched: The transition from unwatched to watched happens in open if the file meets certain requirements. A file is being watched if hash_watcher[hash].file_pointer is equal to the file pointer referenced in the currently executing system call hook. If it’s not NULL, but different from the current file pointer, there’s a hash collision and your file is not invited to the party. If a hash slot is unused, its file_pointer is NULL.
buffered: A file is being buffered if it is being watched and hash_watcher[hash].accelerator is not NULL.

At the beginning of every hook function, the 16 bit file hash is determined by looking at the file descriptor and the open file descriptor table of the current process.

`init`

clear all hash slots

`open`

if there’s already a different file being watched on the same hash slot, return and do nothing
if the file is worth being watched, reset the hash watcher slot and set hash_watcher[hash].file_pointer to out file pointer, thus marking the file as being watched

`close`

if the file is being watched, reset its watcher slot

`lseek` and `write`

if the file is being watched, rewind its watcher (reset stage and disable buffering)

Memory footprint

In addition to the module code, there’s a fixed minimum RAM usage of 1.5 MiB for the 64k hash watcher slots.

For every file which is buffered, a full page is allocated for every r_fd_accelerator structure, which is only 32 bytes, while a page is 4096 bytes on most systems. This means that 99% of the page is wasted. With MAX_ACCELERATORS set to 256, this means that no more than 1.0 MiB is allocated for all accelerators, and we could reduce this number to 8 kb by replacding the call to vmalloc with a simple fixed-size table.

By default, buffers come in three sizes: 256k, 1M and 4M, resulting in a maximum memory usage of 1 GiB for buffers if 256 files are simultaneously buffered with 4M.

Installation

Before you can compile the module, you must save sys_call_table.template.h as sys_call_table.h and set SYS_CALL_TABLE to the correct value. Now type make.

To do list

it is unclear what happens when the module is unloaded and hooked syscalls, especially reads, are still pending
it is unclear what happens when a system call hook is interrupted and restarted (especially when a buffer is allocated in the read hook)
don’t use vmalloc() to allocate accelerators, manage a fixed array of accelerator slots to save memory
add support for writing
test and handle dup and dup2
maybe don’t bump buffer size until it’s used up (otherwise, we could end up reading the same data multiple times)
if a process which opened a file dies, we won’t notice that the file gets closed in background, thus piling up stale file watchers
there are many ways a file can be manipulated, don’t assume we know the file position for example, query it inside the read function, don’t rely on data remaining consistent between system calls
Are the hooks thread safe? Probably, but maybe not yet.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.gitignore		.gitignore
Makefile		Makefile
README.textile		README.textile
diob.c		diob.c
sys_call_table.template.h		sys_call_table.template.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Results

Alternatives

Strategy

Implementation details

`init`

`open`

`close`

`lseek` and `write`

Memory footprint

Installation

To do list

About

Releases

Packages

Languages

specht/diob_lkm

Folders and files

Latest commit

History

Repository files navigation

Introduction

Results

Alternatives

Strategy

Implementation details

init

open

close

lseek and write

Memory footprint

Installation

To do list

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

`init`

`open`

`close`

`lseek` and `write`

Packages