Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batching into shared memory is deprecated, but essential for performance #492

Open
kvablack opened this issue Jun 28, 2024 · 3 comments
Open

Comments

@kvablack
Copy link

I was doing some profiling of my data pipeline and found that the Batch transformation was a severe bottleneck. Here are the critical lines in operations.py:

def stacking_function(*args):
      first_arg = np.asanyarray(args[0])
      shape, dtype = (len(args),) + first_arg.shape, first_arg.dtype
      if not self._use_shared_memory or dtype.hasobject:
        return np.stack(args)
      return np.stack(args, out=SharedMemoryArray(shape, dtype=dtype)).metadata

I found that self._use_shared_memory == True iff you used the deprecated grain.BatchOperation, rather than the "recommended" grain.Batch. And what do you know, switching to grain.BatchOperation gave me a 3x increase in throughput! This matches up with my intuition, because in the self._use_shared_memory == True branch, there is only one copy that goes directly into shared memory. But in the self._use_shared_memory == False branch, the np.stack will induce one copy into private memory, and then the later CopyNumPyArrayToSharedMemory transform performs an explicit second copy into shared memory. It's not too surprising that adding another copy of all of the pipeline's data could slow things down significantly.

Here comes the real problem -- I want to use grain through airio, which doesn't go through the standard DataLoader, but the much more complex lazy_dataset API. In lazy_dataset, batching is done through a different code path that does not have an option to enable this optimization. It always batches into private memory, and then the MultiprocessPrefetchLazyIterDataset does a second copy into shared memory.

I manually added a (slightly hacky) solution that enables batching directly into shared memory iff the batch operation is a parent of a MultiprocessPrefetchLazyIterDataset. Indeed, I saw a significant performance increase when using grain through airio. Is this something that could possibly be upstreamed into grain?

@quanvuong
Copy link

+1

2 similar comments
@Mddct
Copy link

Mddct commented Jul 19, 2024

+1

@mhyatt000
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants