Support fine grained activation sharding. #881

patrick-toulme · 2024-12-11T17:53:35Z

Hello, this is a draft PR to support fine grained activation sharding. All specs are defaulted to None, so this PR maintains all existing behavior of the Axlearn codebase.

I am requesting feedback and comments on the approach.

With this PR the user can control more closely how GSPMD partitions activations allowing more control of the collectives generated by the SPMD partitioner. For example, GSPMD will generate all-to-alls around the gather operation of the vocab table, but with this PR user can annotate partition specs to generate a reduce-scatter instead.

Thank you!

ruomingp

Thanks! Some questions about the API...

axlearn/common/attention.py

ruomingp · 2024-12-14T01:00:21Z

axlearn/common/attention.py

            x = self.norm(inputs)
+            x = maybe_shard(x, cfg.premlp_partition_spec)


Do we need this? I suppose norm usually does not change the partition spec?

We want to change the spec post norm to not propagate the spec from norm to MLP linear1.
So for sequence parallel we would make the premlp_partition_spec = ((fsdp, data), None, None) to force an all-gather on the sequence dimension.

Thanks for the explanation. Maybe we can add output_partition_spec to RMSNorm.Config to be consistent with Linear.Config. This is also more flexible as RMSNorm can be used in other places. WDYT?

Yes, I actually originally had this! I have revised to take this approach.

ruomingp · 2024-12-14T01:01:28Z

axlearn/common/attention.py

+        # If not None, how to partition pre attention activation values.
+        preattention_partition_spec: Optional[tuple[Optional[str]]] = None
+        # If not None, how to partition post attention activation values.
+        postattention_partition_spec: Optional[tuple[Optional[str]]] = None


How is this different from setting

axlearn/axlearn/common/attention.py

Line 3319 in 2134a25

attn_layer.output_linear.param_partition_spec = (fsdp_axis_names, tp_axis_names, None)

?

+1, I also have a high level question, should the sharding config auto-derived and optimized by XLA for most of the intermediary activations, based on weights and outputs etc?
I'd like to see a concrete example to understand why explicit sharding is needed here.

OpenXLA ShardingPropagation often makes non optimal decisions in propagation. SPMD partitioner then partitions based on what HloSharding was propagated to each HloInstruction, so if your sharding propagation is not optimal then you will see collectives generated such as all-to-all to resolve the sharding conflicts.

An example is if you set the partition spec for premlp or preattention to be ((fsdp, data), model, None) to force the computation to be done in sequence parallel, then ShardingPropagation pass will propagate that HloSharding spec down to the matmul causing an all-to-all collective.

This is not as optimal as an all-gather of the sequence dimension before the matmul.

axlearn/common/attention.py

kelvin-zou · 2024-12-14T03:04:33Z

axlearn/common/attention.py

+        # If not None, how to partition pre attention activation values.
+        preattention_partition_spec: Optional[tuple[Optional[str]]] = None
+        # If not None, how to partition post attention activation values.
+        postattention_partition_spec: Optional[tuple[Optional[str]]] = None


+1, I also have a high level question, should the sharding config auto-derived and optimized by XLA for most of the intermediary activations, based on weights and outputs etc?
I'd like to see a concrete example to understand why explicit sharding is needed here.

kelvin-zou · 2024-12-14T03:08:15Z

axlearn/common/layers.py

@@ -814,8 +821,14 @@ def _create_layer_parameter_specs(self) -> dict[str, ParameterSpec]:
        )

    def forward(self, x: Tensor) -> Tensor:
+        cfg = self.config


Same question here, can we handle this inside XLA compiler directly instead of making it explicit here? Adding this kind of fine-level sharding spec is a divergence from GSPMD programming paradigm imho.

Responded above with a concrete example. The main issues we are trying to solve here are

We want Norms/Dropouts/Residuals to be computed in sequence parallel so partitioned along sequence dimension. This is to avoid redundant compute across tensor parallel workers.

OpenXLA ShardingPropagation often makes non optimal decisions. By inserting these partition specs they actually show up in the Hlo as custom-call (Sharding); this guides the Sharding Propagation to make more optimal decisions, as it will prioritize propagating user provided partition specs. - https://github.com/openxla/xla/blob/main/xla/service/sharding_propagation.h

Those non optimal decisions such as say propagating sequence level sharding to the left hand side of the attention QKV proj or MLP up proj will cause the SPMD partitioner to insert all-to-alls or collective-permutes or involuntary full rematerialization.

I am curious, why Axlearn has an activation level sharding spec in use today for linear2.output_partition_spec
Was this added to also address non optimal SPMD partitioning or propagation?

kelvin-zou · 2024-12-14T03:09:18Z

axlearn/common/attention_test.py

+        cfg_layer.self_attention.attention.input_linear = self_attention_input_linear_cfg
+        cfg_layers = [cfg_layer, cfg_layer]
+
+        cfg_layer.self_attention.prenorm_partition_spec = (fsdp_axis_names, tp_axis_names, None,)


interesting, is it the real use case you are testing nowadays?

Yes, this is the real config for sequence parallel. I have confirmed with these specs and internal Neuron changes we can generate partitioned HLOs that perform Norms/Residuals/Dropouts partitioned along the sequence and matmuls partitioned along the model dimension without any unnecessary collectives, such as all-to-alls or collective-permutes or involuntary full rematerialization.

ruomingp · 2024-12-15T20:05:44Z

Thanks. Please re-request review when it's ready.

patrick-toulme · 2024-12-16T00:14:30Z

axlearn/common/utils.py

+        return x
+    return with_sharding_constraint(x, PartitionSpec(*partition_spec))
+
+


pre-commit wanted these blank lines. I ran

pre-commit run -a

patrick-toulme · 2024-12-16T00:38:31Z

@kelvin-zou @ruomingp

I have revised the PR based on your comments. I removed "WIP," as the PR is now production ready. I removed Neuron specific logic; I think that should be in a separate PR, and this PR should be enabling hardware agnostic activation sharding.

I ran precommit and pytest and both are passing. Please take a look! Thank you.

ruomingp

Thanks! Are there tests that can be added for the new logic?

axlearn/common/utils.py

patrick-toulme · 2024-12-16T21:43:58Z

@ruomingp What test do you envision? What should I assert? This is effectively just putting

with_sharding_constraint(tensor, PartitionSpec('fdsp', 'model'))

behind a config.

patrick-toulme · 2024-12-16T21:48:57Z

@ruomingp I could add a test where I make an RMSNorm and set the config for partition specs and then assert they are the same specs, but I am not sure what benefit that adds. I do not see tests like that for other configs, as that is like a setter/getter test.

ruomingp · 2024-12-16T22:02:26Z

@ruomingp I could add a test where I make an RMSNorm and set the config for partition specs and then assert they are the same specs, but I am not sure what benefit that adds. I do not see tests like that for other configs, as that is like a setter/getter test.

How about patching with_sharding_constraint to make sure that it's called with the right tensors in forward() when we set partition specs?

axlearn/common/utils.py

patrick-toulme · 2024-12-16T22:10:19Z

@ruomingp I could add a test where I make an RMSNorm and set the config for partition specs and then assert they are the same specs, but I am not sure what benefit that adds. I do not see tests like that for other configs, as that is like a setter/getter test.

How about patching with_sharding_constraint to make sure that it's called with the right tensors in forward() when we set partition specs?

Do you have an example of this?

patrick-toulme · 2024-12-16T23:53:31Z

@ruomingp @kelvin-zou
I have added tests per @ruomingp specifications. All tests pass and pre-commit passes also.

Thank you!

patrick-toulme · 2024-12-17T12:33:02Z

@ruomingp @kelvin-zou Thank you for the approval!!
Does the PR auto merge? Or is automerge blocked on @markblee review? It says all checks have passed.

patrick-toulme requested review from ruomingp and markblee as code owners December 11, 2024 17:53

ruomingp requested a review from kelvin-zou December 11, 2024 17:59

apoorvtintin mentioned this pull request Dec 11, 2024

[DO-NOT-MERGE] PR encompassing all changes needed to support neuron on Axlearn #886

Open

ruomingp requested changes Dec 14, 2024

View reviewed changes

kelvin-zou reviewed Dec 14, 2024

View reviewed changes

patrick-toulme force-pushed the activation_sharding_pr branch 6 times, most recently from 5134c41 to 7d1d723 Compare December 16, 2024 00:02

patrick-toulme changed the title ~~WIP - Support fine grained activation sharding.~~ Support fine grained activation sharding. Dec 16, 2024

patrick-toulme force-pushed the activation_sharding_pr branch from 7d1d723 to a6acf7c Compare December 16, 2024 00:13

patrick-toulme commented Dec 16, 2024

View reviewed changes

patrick-toulme requested review from kelvin-zou and ruomingp December 16, 2024 00:38

ruomingp reviewed Dec 16, 2024

View reviewed changes

axlearn/common/utils.py Outdated Show resolved Hide resolved

patrick-toulme force-pushed the activation_sharding_pr branch from a6acf7c to 58015ff Compare December 16, 2024 21:58

ruomingp reviewed Dec 16, 2024

View reviewed changes

axlearn/common/utils.py Outdated Show resolved Hide resolved

patrick-toulme force-pushed the activation_sharding_pr branch from 58015ff to 93c3126 Compare December 16, 2024 23:51

patrick-toulme requested a review from ruomingp December 16, 2024 23:53

Support fine grained activation sharding. (#21)

4eec4d1

patrick-toulme force-pushed the activation_sharding_pr branch from 93c3126 to 4eec4d1 Compare December 17, 2024 00:03

ruomingp approved these changes Dec 17, 2024

View reviewed changes

kelvin-zou approved these changes Dec 17, 2024

View reviewed changes

ruomingp added this pull request to the merge queue Dec 17, 2024

Merged via the queue into apple:main with commit 01b762e Dec 17, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support fine grained activation sharding. #881

Support fine grained activation sharding. #881

patrick-toulme commented Dec 11, 2024

ruomingp left a comment

ruomingp Dec 14, 2024

patrick-toulme Dec 14, 2024

ruomingp Dec 15, 2024

patrick-toulme Dec 16, 2024

ruomingp Dec 14, 2024

kelvin-zou Dec 14, 2024

patrick-toulme Dec 14, 2024

kelvin-zou Dec 14, 2024

kelvin-zou Dec 14, 2024

patrick-toulme Dec 14, 2024 •

edited

Loading

kelvin-zou Dec 14, 2024

patrick-toulme Dec 14, 2024

ruomingp commented Dec 15, 2024

patrick-toulme Dec 16, 2024

patrick-toulme commented Dec 16, 2024

ruomingp left a comment

patrick-toulme commented Dec 16, 2024

patrick-toulme commented Dec 16, 2024

ruomingp commented Dec 16, 2024

patrick-toulme commented Dec 16, 2024

patrick-toulme commented Dec 16, 2024

patrick-toulme commented Dec 17, 2024 •

edited

Loading

		x = self.norm(inputs)
		x = maybe_shard(x, cfg.premlp_partition_spec)

		return x
		return with_sharding_constraint(x, PartitionSpec(*partition_spec))

Support fine grained activation sharding. #881

Support fine grained activation sharding. #881

Conversation

patrick-toulme commented Dec 11, 2024

ruomingp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrick-toulme Dec 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruomingp commented Dec 15, 2024

Choose a reason for hiding this comment

patrick-toulme commented Dec 16, 2024

ruomingp left a comment

Choose a reason for hiding this comment

patrick-toulme commented Dec 16, 2024

patrick-toulme commented Dec 16, 2024

ruomingp commented Dec 16, 2024

patrick-toulme commented Dec 16, 2024

patrick-toulme commented Dec 16, 2024

patrick-toulme commented Dec 17, 2024 • edited Loading

patrick-toulme Dec 14, 2024 •

edited

Loading

patrick-toulme commented Dec 17, 2024 •

edited

Loading