-
Notifications
You must be signed in to change notification settings - Fork 47
Advanced Topic: Non Aligned Writes
# LISTING 1: THE CRISIS OF NON-ALIGNED WRITES AND ITS RESOLUTION
import numpy as np
from cloudvolume import CloudVolume
vol = CloudVolume('gs://test/test/test/')
vol.bounds # Bbox( [0,0,0], [500, 500, 500] )
vol.chunk_size # (100, 100, 100)
black = np.zeros(shape=(100,100,100), dtype=vol.dtype)
vol[0:100, 0:100, 0:100] = black # works!
vol[1:101, 0:100, 0:100] = black # DIES
vol.non_aligned_writes = True
vol[1:101, 0:100, 0:100] = black # works!
Non-aligned writes are disabled in CloudVolume by default. Instead, it provides a warning that encourages you to rethink your course of action before proceeding. Why?
Precomputed format evenly divides arbitrarily large images into (possibly multichannel) three dimensional chunks. Each chunk is represented by a single file whose name encodes a bounding box e.g. "0-10_0-10_0-5". If your writes are chunk aligned, then writing data is very easy, simply overwrite each extant file or create a new file. Importantly, chunk aligned writing also means that each write is the size of the bounding box defined in the file, so each write is non-overlapping and writing the file can be regarded as an atomic operation.
However, when there is a single pixel of deviation from this paradigm, we have to do things differently. You can't write a single voxel to a remote file, instead you need to upload a whole new file. Since that file could be pre-existing and contain data, you have to download it, paint the single voxel into it, and then upload it. In the worst case, where your bounding box extends diagonally into the adjacent chunks, you'll incur 8x the I/O both for reading and for writing. However, you only need to paint the non-chunk aligned shell of your bounding box, so as the box grows to wholly contain more chunks, the situation improves.
On this performance basis alone, it would be advisable to avoid non-chunk aligned writes. However, things are worse than that. There's a race condition that is best explained by the below diagram.
If two processes try to write to each half of the same chunk at about the same time, only one half of the chunk will be updated.
There are some conditions under which it is safe to consider using non-aligned writes.
- Single Process: There's no chance of an update anomaly. Useful for experimenting with small volumes.
- Careful Design: You have designed your access patterns such that the non-aligned writes will never coincide with each other.
- Locking: It's possible to use a distributed locking mechanism like etcd to ensure that intersecting bounding boxes aren't accessed at the same time.
Assuming your use case falls into one of the three exceptions above and you're willing to incur the performance hit, you can set vol.non_aligned_writes = True
to disable the warnings. Here's what that looks like.