You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was doing some profiling of my data pipeline and found that the Batch transformation was a severe bottleneck. Here are the critical lines in operations.py:
I found that self._use_shared_memory == True iff you used the deprecated grain.BatchOperation, rather than the "recommended" grain.Batch. And what do you know, switching to grain.BatchOperation gave me a 3x increase in throughput! This matches up with my intuition, because in the self._use_shared_memory == True branch, there is only one copy that goes directly into shared memory. But in the self._use_shared_memory == False branch, the np.stack will induce one copy into private memory, and then the later CopyNumPyArrayToSharedMemory transform performs an explicit second copy into shared memory. It's not too surprising that adding another copy of all of the pipeline's data could slow things down significantly.
Here comes the real problem -- I want to use grain through airio, which doesn't go through the standard DataLoader, but the much more complex lazy_dataset API. In lazy_dataset, batching is done through a different code path that does not have an option to enable this optimization. It always batches into private memory, and then the MultiprocessPrefetchLazyIterDataset does a second copy into shared memory.
I manually added a (slightly hacky) solution that enables batching directly into shared memory iff the batch operation is a parent of a MultiprocessPrefetchLazyIterDataset. Indeed, I saw a significant performance increase when using grain through airio. Is this something that could possibly be upstreamed into grain?
The text was updated successfully, but these errors were encountered:
I was doing some profiling of my data pipeline and found that the
Batch
transformation was a severe bottleneck. Here are the critical lines in operations.py:I found that
self._use_shared_memory == True
iff you used the deprecatedgrain.BatchOperation
, rather than the "recommended"grain.Batch
. And what do you know, switching tograin.BatchOperation
gave me a 3x increase in throughput! This matches up with my intuition, because in theself._use_shared_memory == True
branch, there is only one copy that goes directly into shared memory. But in theself._use_shared_memory == False
branch, thenp.stack
will induce one copy into private memory, and then the later CopyNumPyArrayToSharedMemory transform performs an explicit second copy into shared memory. It's not too surprising that adding another copy of all of the pipeline's data could slow things down significantly.Here comes the real problem -- I want to use grain through airio, which doesn't go through the standard
DataLoader
, but the much more complexlazy_dataset
API. Inlazy_dataset
, batching is done through a different code path that does not have an option to enable this optimization. It always batches into private memory, and then the MultiprocessPrefetchLazyIterDataset does a second copy into shared memory.I manually added a (slightly hacky) solution that enables batching directly into shared memory iff the batch operation is a parent of a
MultiprocessPrefetchLazyIterDataset
. Indeed, I saw a significant performance increase when using grain through airio. Is this something that could possibly be upstreamed into grain?The text was updated successfully, but these errors were encountered: