[Performance] Faster `PrioritizedSliceSampler._padded_indices` #2433

kurtamohler · 2024-09-12T00:10:26Z

Description

Speeds up PrioritizedSliceSampler._padded_indices by about 2x.

Running the performance script given in #2431 (comment), my machine gives the following:

time: 7.804846733536882 ms
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      900    2.739    0.003    3.114    0.003 samplers.py:1845(_padded_indices)
      900    0.197    0.000    6.895    0.008 samplers.py:1907(sample)
      900    0.067    0.000    0.342    0.000 samplers.py:473(sample)

This is a speedup of (22.235 / 7.804) = 2.8x

Although sometimes, the runtime per sample call reaches up to 11.8 ms, a speedup of ~1.8x.

Motivation and Context

close #2431

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

What types of changes does your code introduce? Remove all that do not apply:

Bug fix (non-breaking change which fixes an issue)

Checklist

Go over all the following points, and put an x in all the boxes that apply.
If you are unsure about any of these, don't hesitate to ask. We are here to help!

I have read the CONTRIBUTION guide (required)
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.

pytorch-bot · 2024-09-12T00:10:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2433

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 7 Unrelated Failures

As of commit 98b369b with merge base fb9cc2c ():

NEW FAILURES - The following jobs have failed:

Habitat Tests on Linux / tests (3.9, 12.1) / linux-job (gh)
RuntimeError: Command docker exec -t 18501facf3b8030eb07fb8107af31b36ad38f87233c4b0ff4a4e2fd101fa92c0 /exec failed with exit code 134
Wheels / build-wheel-windows (3.10, 3.10.3) (gh)
Wheels / build-wheel-windows (3.11, 3.11) (gh)
Wheels / build-wheel-windows (3.12, 3.12) (gh)
Wheels / build-wheel-windows (3.9, 3.9) (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

Continuous Benchmark (PR) / CPU Pytest benchmark (gh) (detected as infra flaky with no log or failing log classifier)
Continuous Benchmark (PR) / GPU Pytest benchmark (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kurtamohler · 2024-09-12T00:12:04Z

torchrl/data/replay_buffers/samplers.py


-    @implement_for("torch", None, "2.4")


On my machine, both of the previous implementations had roughly the same performance

kurtamohler · 2024-09-12T00:16:44Z

torchrl/data/replay_buffers/samplers.py

        )
-        pad = nt.to_padded_tensor(-1).flip(-1).flip(0)
-        return pad
+        for pad_row, group_start, group_end, pad_len in zip(


It would be nice to be able to get rid of this for loop, but I don't see a good way to do it. I think indexing different sized ranges under each row of a tensor is only possible if you list out all the indices, like for torch.index_select.

If torch.nn.utils.rnn.pad_sequence supported left-padding, we could just use that, but it doesn't.

FWIW, I did try getting rid of the for loop by listing out all the indices and using index_copy_ like so:

shapes = shapes.flatten() num_groups = shapes.shape[0] max_group_len = shapes.max() pad_lengths = max_group_len - shapes pad_before_groups = pad_lengths.cumsum(0) pad_before_indices = torch.repeat_interleave(pad_before_groups, shapes) indices = torch.arange(arange.shape[0], dtype=arange.dtype, device=arange.device) + pad_before_indices p = torch.full( (num_groups * max_group_len,), -1, dtype=arange.dtype, device=arange.device ) p.index_copy_(0, indices, arange) pad = p.reshape((num_groups, max_group_len))

It worked, but it gave very similar performance to the old implementations

IIRC, torch.nn.utils.rnn.pad_sequence left padding was mentioned during a review not long ago
cc @mikaylagawarecki

kurtamohler · 2024-09-12T00:18:31Z

torchrl/data/replay_buffers/samplers.py

    def _padded_indices(self, shapes, arange) -> torch.Tensor:
        # this complex mumbo jumbo creates a left padded tensor with valid indices on the right, e.g.
        # tensor([[ 0,  1,  2,  3,  4],
        #         [-1, -1,  5,  6,  7],
        #         [-1,  8,  9, 10, 11]])
        # where the -1 items on the left are padded values
-        st, off = torch._nested_compute_contiguous_strides_offsets(shapes.flip(0))
-        nt = torch._nested_view_from_buffer(
-            arange.flip(0).contiguous(), shapes.flip(0), st, off


It seems that the main thing that the new implementation improves upon is that it doesn't need to do these torch.flip operations, and the time it saves on those seems to outweigh the overhead of the for loop

kurtamohler · 2024-09-12T01:01:15Z

Something is interesting about the cProfile results I got. Before this PR, _padded_indices took 11.616 s cumulative time and sample was about 20.767. After this PR, _padded_indices is 3.114 and sample is 6.895.

_padded_indices decreased by 8.502 s, while sample decreased by 13.872. I would have expected all of the time savings to come from _padded_indices, but apparently something outside of that function also gained a performance boost.

I would guess that the reason is that _padded_indices is now outputing a contiguous tensor, whereas it used to be noncontiguous, so operations performed on the output can be a little faster

kurtamohler · 2024-09-12T01:05:54Z

One other thing, I tried to measure the cuda performance, but changing the device in my script #2431 (comment) to 'cuda' still ended up calling _padded_indices with cpu tensors. I'm not sure if cuda is just not supported in PrioritizedSliceSampler or if something is wrong with how I tried to set the device in the script

vmoens

LGTM thanks for taking care of this!

vmoens · 2024-09-12T07:56:40Z

torchrl/data/replay_buffers/samplers.py

        )
-        pad = nt.to_padded_tensor(-1).flip(-1).flip(0)
-        return pad
+        for pad_row, group_start, group_end, pad_len in zip(


IIRC, torch.nn.utils.rnn.pad_sequence left padding was mentioned during a review not long ago
cc @mikaylagawarecki

vmoens

LGTM thanks for taking care of this!

vmoens

LGTM thanks for taking care of this!

[Performance] Faster PrioritizedSliceSampler._padded_indices

98b369b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2024

kurtamohler commented Sep 12, 2024

View reviewed changes

vmoens reviewed Sep 12, 2024

View reviewed changes

vmoens approved these changes Sep 12, 2024

View reviewed changes

vmoens merged commit 361b763 into pytorch:main Sep 12, 2024
59 of 71 checks passed

vmoens added the performance Performance issue or suggestion for improvement label Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Faster `PrioritizedSliceSampler._padded_indices` #2433

[Performance] Faster `PrioritizedSliceSampler._padded_indices` #2433

kurtamohler commented Sep 12, 2024 •

edited

Loading

pytorch-bot bot commented Sep 12, 2024 •

edited

Loading

kurtamohler Sep 12, 2024

kurtamohler Sep 12, 2024 •

edited

Loading

kurtamohler Sep 12, 2024 •

edited

Loading

vmoens Sep 12, 2024

kurtamohler Sep 12, 2024 •

edited

Loading

kurtamohler commented Sep 12, 2024

kurtamohler commented Sep 12, 2024

vmoens left a comment

vmoens Sep 12, 2024

vmoens left a comment

vmoens left a comment

[Performance] Faster PrioritizedSliceSampler._padded_indices #2433

[Performance] Faster PrioritizedSliceSampler._padded_indices #2433

Conversation

kurtamohler commented Sep 12, 2024 • edited Loading

Description

Motivation and Context

Types of changes

Checklist

pytorch-bot bot commented Sep 12, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2433

❌ 5 New Failures, 7 Unrelated Failures

kurtamohler Sep 12, 2024

Choose a reason for hiding this comment

kurtamohler Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

kurtamohler Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

vmoens Sep 12, 2024

Choose a reason for hiding this comment

kurtamohler Sep 12, 2024 • edited Loading

Choose a reason for hiding this comment

kurtamohler commented Sep 12, 2024

kurtamohler commented Sep 12, 2024

vmoens left a comment

Choose a reason for hiding this comment

vmoens Sep 12, 2024

Choose a reason for hiding this comment

vmoens left a comment

Choose a reason for hiding this comment

vmoens left a comment

Choose a reason for hiding this comment

[Performance] Faster `PrioritizedSliceSampler._padded_indices` #2433

[Performance] Faster `PrioritizedSliceSampler._padded_indices` #2433

kurtamohler commented Sep 12, 2024 •

edited

Loading

pytorch-bot bot commented Sep 12, 2024 •

edited

Loading

kurtamohler Sep 12, 2024 •

edited

Loading

kurtamohler Sep 12, 2024 •

edited

Loading

kurtamohler Sep 12, 2024 •

edited

Loading