This package contains a CLOCK based approximate-LRU cache optimized for concurrent usage. It minimizes both the number of locks held, and the number of writes to shared cache lines. In the worst case, this implementation requires at most 1 shared-cache write on a hit, not counting lock acquisition.
Typically, LRU is implemented with a map and a doubly-linked list, all elements are stored in both the map, and the list. Whenever there's a hit on a cache-item, it's moved to the head of the list; evictions are done from the tail of the list. This ensures that it's always the least-recently-hit element that's evicted.
While conceptually simple, the standard implementation of LRU as significant issues in a multi-processor setting, since every read to the cache writes to the underlying datastructures. Updating the LRU list requires at least 4 writes on shared cache lines; 2 to unlink the element from it's old place in the list, and 2 more to add it to the head. When the LRU is being accessed from multiple cores, these cache lines will bounce among them, causing a significant slowdown.
The CLOCK algorithm approximates LRU. Conceptually, it augments a map with a
bitset, with each element owning one bit in the bitset. Whenever an element is
hit, it's bit is set to 1
. The evictor maintains a cursor into the bitset, and
reads it round-robin.
eviction cursor
v
bitset: 0 1 0 0 1 1 0 1 0 0 1 1 1 1 0
Whenever the evictor sees a 1
it sets the bit to 0
.
Whenever the evictor see a 0
, it evicts the element.
This means that eviction always evicts an element that hasn't been read since
the last time the evictor ran, and, since it reads the element in order, the
first such element is the element that hasn't been touched the longest since the
last run.
CLOCK two important benefits over regular LRU in a concurrent setting; the state that changes on a hit is per-element instead of shared, and the changes made by a hit are idempotent.
Regular LRU relies on a shared list, which is manipulated every time an element
is hit. CLOCK only requires a per-element bit to be set on a hit. This means
that on a hit, only a single value, and with the proper layout, a single cache
line, needs to be written to. Already, this reduces the worst-case number of
writes to 1 from 4. We can further reduce the number of writes by observing that
for an LRU to be useful, hits must be more common than evictions, and in CLOCK
any successive hits before an eviction need not change any state; the bit is
already set to 1
. This means that we can check the bit before writing it, and
do nothing if it's already set, further reducing the number of writes.
The cache is built off a sharded hashmap; values are first hashed to see which hashmap they should be stored in, then stored in an individual shard. The shards are protected by Read/Write locks, with the read side taken for cache lookups, and the write side for cache inserts and evictions. Eviction locks a single shard at a time, going in round-robin fashion until it finds an element to evict.
Each shard consists of a hashmap and linkedlist. The map is used to perform key
lookups, and the list is used to ensure a consistent order for eviction checks.
The actual elements and staleness-bits are stored in the individual nodes of the
list, there is no separate bitset. List nodes are padded to take a cache line
each to prevent false-sharing. Staleness-bits are set using atomic operations,
and are only written after they have been read to be 0
.