Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement aten::index | feat(torchlib) #862

Merged
merged 13 commits into from
Jul 17, 2023

Conversation

@justinchuby justinchuby added the topic: torch_lib Related to the torch/aten function lib in development label Jul 12, 2023
@codecov
Copy link

codecov bot commented Jul 12, 2023

Codecov Report

Merging #862 (fbfcf66) into main (0851f0f) will increase coverage by 0.04%.
The diff coverage is 100.00%.

❗ Current head fbfcf66 differs from pull request most recent head f707360. Consider uploading reports for the commit f707360 to get more accurate results

@@            Coverage Diff             @@
##             main     #862      +/-   ##
==========================================
+ Coverage   76.46%   76.50%   +0.04%     
==========================================
  Files         112      112              
  Lines       13373    13383      +10     
  Branches     1342     1344       +2     
==========================================
+ Hits        10225    10239      +14     
+ Misses       2816     2812       -4     
  Partials      332      332              
Impacted Files Coverage Δ
...ipt/tests/function_libs/torch_lib/ops_test_data.py 96.72% <ø> (ø)
onnxscript/function_libs/torch_lib/ops/core.py 76.93% <100.00%> (+0.05%) ⬆️
...ript/tests/function_libs/torch_lib/extra_opinfo.py 98.36% <100.00%> (+0.07%) ⬆️

... and 1 file with indirect coverage changes

@justinchuby justinchuby added the help wanted Extra attention is needed label Jul 12, 2023


@torch_op("aten::index", trace_only=True)
def aten_index(self: TensorType, indices: Sequence[Optional[INT64]]) -> TensorType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should work for all cases except for bool mask index, if possible.

Let me know if you can find a bug! @justinchuby

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try op.NonZero to convert bool mask to integer index.

s = 5
test_args = [
([common_methods_invocations.index_variable(2, s, device=device)],),
# ([torch.tensor()],)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinchuby will add more tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sharing snippet I used to test. It poses complication here as indexing, i.e., data[], lowers to multiple operators, usually couple of aten.slice + aten.index in the end. So you might need to figure out a way to construct the proper input state if you are testing aten.index.Tensor alone.

import torch
import onnxruntime
import onnxscript


class IndexModel(torch.nn.Module):
    def forward(self, data, index, index2):
        # return data[..., index, index2]
        # return data[index, :, index2, ...]
        # return data[..., index, index2, :]
        # return data[index, :, index2]
        return data[:, :, index, index2]


data = torch.arange(0, 7 * 3 * 4 * 5 * 6).view(7, 3, 4, 5, 6)
index = torch.tensor([2, 1])
index2 = torch.tensor([[[0]], [[2]]])

model = IndexModel()
pt_output = model(data, index, index2)
print("PyTorch output shape:", pt_output.shape)

export_output = torch.onnx.dynamo_export(model, data, index, index2)
export_output.diagnostic_context.dump("log.sarif")

print(onnxscript.proto2text(export_output.model_proto))
sess = onnxruntime.InferenceSession(
    export_output.model_proto.SerializeToString(), providers=["CPUExecutionProvider"]
)
output = sess.run(None, {"arg0": data.numpy(), "arg1": index.numpy(), "arg2": index2.numpy()})

torch.testing.assert_close(pt_output.numpy(), output[0])

print("PASS")

pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Jul 12, 2023
Needed by 'aten.index.Tensor', where 'indices' is list of optional
tensors.

Related microsoft/onnxscript#862
Pull Request resolved: #105040
Approved by: https://github.com/titaiwangms, https://github.com/thiagocrepaldi
@justinchuby
Copy link
Collaborator Author

justinchuby commented Jul 13, 2023

@justinchuby justinchuby mentioned this pull request Jul 13, 2023
@justinchuby justinchuby removed the help wanted Extra attention is needed label Jul 13, 2023
@gramalingam
Copy link
Collaborator

Is this migrating the logic used by the previous exporter (for this op)? Or, is this different or new?

@BowenBao
Copy link
Contributor

Is this migrating the logic used by the previous exporter (for this op)? Or, is this different or new?

It is new and hopefully more robust, while the logic is similar.

The old implementation was using opset9.GatherElements, at that point GatherND does not exist yet.

@gramalingam
Copy link
Collaborator

The old implementation was using opset9.GatherElements, at that point GatherND does not exist yet.

Oh ... I thought Spandan introduced the new Gather op variants to support the pytorch converter.

@BowenBao
Copy link
Contributor

Oh ... I thought Spandan introduced the new Gather op variants to support the pytorch converter.

Yes it was used in some other ops, as well as extended edge case support for aten.index after opset11, but there wasn't a full rewrite.

@justinchuby justinchuby deleted the justinchu/tensor-index branch July 17, 2023 17:12
justinchuby added a commit that referenced this pull request Jul 17, 2023
WIP: need more tests

Gather op:
-
https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
-
https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

ghstack-source-id: a9912360be34cc91c6a15bd9a09ee70d6c6c04fd
Pull Request resolved: #883
@justinchuby
Copy link
Collaborator Author

Replaced by #883

justinchuby added a commit that referenced this pull request Jul 17, 2023
WIP: need more tests

Gather op:
-
https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
-
https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

ghstack-source-id: 8755417e09c3d564e927005515bec041101d7d36
Pull Request resolved: #883
justinchuby added a commit that referenced this pull request Jul 17, 2023
WIP: need more tests

Gather op:
-
https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
-
https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

ghstack-source-id: 078cf567cafa63db20b35a9cfa6cbaee4fb3899f
Pull Request resolved: #883

Signed-off-by: Justin Chu <[email protected]>
justinchuby added a commit that referenced this pull request Jul 17, 2023
WIP: need more tests

Gather op:
-
https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
-
https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

ghstack-source-id: e6415333fe437851da62300566acf6845da30630
Pull Request resolved: #883

Signed-off-by: Justin Chu <[email protected]>
justinchuby added a commit that referenced this pull request Jul 17, 2023
---

**This change implements the logic for `aten::index` and adds tests for different nd index combinations and permutations.**

## Understanding `aten::index`

For `arg0` with shape `[7, 3, 4, 5, 6]`
The indexing operation `arg0[0, :, 1:2, tensor([[4,5]])]` will be translated to
```
+>  select: i64[3, 4, 5, 6] = torch.ops.aten.select.int(arg0, 0, 0);
+>  slice_1: i64[3, 4, 5, 6] = torch.ops.aten.slice.Tensor(select, 0, 0, 9223372036854775807);
+>  slice_2: i64[3, 1, 5, 6] = torch.ops.aten.slice.Tensor(slice_1, 1, 1, 2);
+>  index: i64[3, 1, 1, 2, 6] = torch.ops.aten.index.Tensor(slice_2, [None, None, arg1]);
```
Here,
- `indices = [None, None, arg1]` is equivalent to `indices = [None, None, arg1, None]`
- The operation `arg0[0, :, 1:2, tensor([[4,5]])]` is equivalent to `arg0[0, :, 1:2, tensor([[4,5]]), :]`
None in `indices` are like fillers for dimensions that cannot be removed in the process.

## Gather op reference

- https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
- https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

[ghstack-poisoned]
justinchuby added a commit that referenced this pull request Jul 17, 2023
WIP: need more tests

Gather op:
-
https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
-
https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

ghstack-source-id: 194baee5afb28f719a38dc657a0e54a8980af2e4
Pull Request resolved: #883

Signed-off-by: Justin Chu <[email protected]>
justinchuby added a commit that referenced this pull request Jul 17, 2023
…)"


---

**This change implements the logic for `aten::index` and adds tests for different nd index combinations and permutations.**

## Understanding `aten::index`

For `arg0` with shape `[7, 3, 4, 5, 6]`
The indexing operation `arg0[0, :, 1:2, tensor([[4,5]])]` will be translated to
```
+>  select: i64[3, 4, 5, 6] = torch.ops.aten.select.int(arg0, 0, 0);
+>  slice_1: i64[3, 4, 5, 6] = torch.ops.aten.slice.Tensor(select, 0, 0, 9223372036854775807);
+>  slice_2: i64[3, 1, 5, 6] = torch.ops.aten.slice.Tensor(slice_1, 1, 1, 2);
+>  index: i64[3, 1, 1, 2, 6] = torch.ops.aten.index.Tensor(slice_2, [None, None, arg1]);
```
Here,
- `indices = [None, None, arg1]` is equivalent to `indices = [None, None, arg1, None]`
- The operation `arg0[0, :, 1:2, tensor([[4,5]])]` is equivalent to `arg0[0, :, 1:2, tensor([[4,5]]), :]`
None in `indices` are like fillers for dimensions that cannot be removed in the process.

## Gather op reference

- https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
- https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

[ghstack-poisoned]
justinchuby added a commit that referenced this pull request Jul 17, 2023
---

**This change implements the logic for `aten::index` and adds tests for different nd index combinations and permutations.**

## Understanding `aten::index`

For `arg0` with shape `[7, 3, 4, 5, 6]`
The indexing operation `arg0[0, :, 1:2, tensor([[4,5]])]` will be translated to
```
+>  select: i64[3, 4, 5, 6] = torch.ops.aten.select.int(arg0, 0, 0);
+>  slice_1: i64[3, 4, 5, 6] = torch.ops.aten.slice.Tensor(select, 0, 0, 9223372036854775807);
+>  slice_2: i64[3, 1, 5, 6] = torch.ops.aten.slice.Tensor(slice_1, 1, 1, 2);
+>  index: i64[3, 1, 1, 2, 6] = torch.ops.aten.index.Tensor(slice_2, [None, None, arg1]);
```
Here,
- `indices = [None, None, arg1]` is equivalent to `indices = [None, None, arg1, None]`
- The operation `arg0[0, :, 1:2, tensor([[4,5]])]` is equivalent to `arg0[0, :, 1:2, tensor([[4,5]]), :]`
None in `indices` are like fillers for dimensions that cannot be removed in the process.

## Gather op reference

- https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
- https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

[ghstack-poisoned]
justinchuby added a commit that referenced this pull request Jul 17, 2023
WIP: need more tests

Gather op:
-
https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
-
https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

ghstack-source-id: 4b050c3446ff3abac2df4eb90495e1ee44c5fa03
Pull Request resolved: #883

Signed-off-by: Justin Chu <[email protected]>
justinchuby added a commit that referenced this pull request Jul 17, 2023
…)"


---

**This change implements the logic for `aten::index` and adds tests for different nd index combinations and permutations.**

## Understanding `aten::index`

For `arg0` with shape `[7, 3, 4, 5, 6]`
The indexing operation `arg0[0, :, 1:2, tensor([[4,5]])]` will be translated to
```
+>  select: i64[3, 4, 5, 6] = torch.ops.aten.select.int(arg0, 0, 0);
+>  slice_1: i64[3, 4, 5, 6] = torch.ops.aten.slice.Tensor(select, 0, 0, 9223372036854775807);
+>  slice_2: i64[3, 1, 5, 6] = torch.ops.aten.slice.Tensor(slice_1, 1, 1, 2);
+>  index: i64[3, 1, 1, 2, 6] = torch.ops.aten.index.Tensor(slice_2, [None, None, arg1]);
```
Here,
- `indices = [None, None, arg1]` is equivalent to `indices = [None, None, arg1, None]`
- The operation `arg0[0, :, 1:2, tensor([[4,5]])]` is equivalent to `arg0[0, :, 1:2, tensor([[4,5]]), :]`
None in `indices` are like fillers for dimensions that cannot be removed in the process.

## Gather op reference

- https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
- https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

[ghstack-poisoned]
justinchuby added a commit that referenced this pull request Jul 17, 2023
---

**This change implements the logic for `aten::index` and adds tests for different nd index combinations and permutations.**

## Understanding `aten::index`

For `arg0` with shape `[7, 3, 4, 5, 6]`
The indexing operation `arg0[0, :, 1:2, tensor([[4,5]])]` will be translated to
```
+>  select: i64[3, 4, 5, 6] = torch.ops.aten.select.int(arg0, 0, 0);
+>  slice_1: i64[3, 4, 5, 6] = torch.ops.aten.slice.Tensor(select, 0, 0, 9223372036854775807);
+>  slice_2: i64[3, 1, 5, 6] = torch.ops.aten.slice.Tensor(slice_1, 1, 1, 2);
+>  index: i64[3, 1, 1, 2, 6] = torch.ops.aten.index.Tensor(slice_2, [None, None, arg1]);
```
Here,
- `indices = [None, None, arg1]` is equivalent to `indices = [None, None, arg1, None]`
- The operation `arg0[0, :, 1:2, tensor([[4,5]])]` is equivalent to `arg0[0, :, 1:2, tensor([[4,5]]), :]`
None in `indices` are like fillers for dimensions that cannot be removed in the process.

## Gather op reference

- https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
- https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

[ghstack-poisoned]
justinchuby added a commit that referenced this pull request Jul 17, 2023
WIP: need more tests

Gather op:
-
https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
-
https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

ghstack-source-id: b22c3f82ea3971ce79087d7f46b4c7faee213693
Pull Request resolved: #883

Signed-off-by: Justin Chu <[email protected]>
justinchuby added a commit that referenced this pull request Jul 17, 2023
…)"


---

**This change implements the logic for `aten::index` and adds tests for different nd index combinations and permutations.**

## Understanding `aten::index`

For `arg0` with shape `[7, 3, 4, 5, 6]`
The indexing operation `arg0[0, :, 1:2, tensor([[4,5]])]` will be translated to
```
+>  select: i64[3, 4, 5, 6] = torch.ops.aten.select.int(arg0, 0, 0);
+>  slice_1: i64[3, 4, 5, 6] = torch.ops.aten.slice.Tensor(select, 0, 0, 9223372036854775807);
+>  slice_2: i64[3, 1, 5, 6] = torch.ops.aten.slice.Tensor(slice_1, 1, 1, 2);
+>  index: i64[3, 1, 1, 2, 6] = torch.ops.aten.index.Tensor(slice_2, [None, None, arg1]);
```
Here,
- `indices = [None, None, arg1]` is equivalent to `indices = [None, None, arg1, None]`
- The operation `arg0[0, :, 1:2, tensor([[4,5]])]` is equivalent to `arg0[0, :, 1:2, tensor([[4,5]]), :]`
None in `indices` are like fillers for dimensions that cannot be removed in the process.

## Gather op reference

- https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
- https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

[ghstack-poisoned]
justinchuby added a commit that referenced this pull request Jul 17, 2023
---

**This change implements the logic for `aten::index` and adds tests for different nd index combinations and permutations.**

## Understanding `aten::index`

For `arg0` with shape `[7, 3, 4, 5, 6]`
The indexing operation `arg0[0, :, 1:2, tensor([[4,5]])]` will be translated to
```
+>  select: i64[3, 4, 5, 6] = torch.ops.aten.select.int(arg0, 0, 0);
+>  slice_1: i64[3, 4, 5, 6] = torch.ops.aten.slice.Tensor(select, 0, 0, 9223372036854775807);
+>  slice_2: i64[3, 1, 5, 6] = torch.ops.aten.slice.Tensor(slice_1, 1, 1, 2);
+>  index: i64[3, 1, 1, 2, 6] = torch.ops.aten.index.Tensor(slice_2, [None, None, arg1]);
```
Here,
- `indices = [None, None, arg1]` is equivalent to `indices = [None, None, arg1, None]`
- The operation `arg0[0, :, 1:2, tensor([[4,5]])]` is equivalent to `arg0[0, :, 1:2, tensor([[4,5]]), :]`
None in `indices` are like fillers for dimensions that cannot be removed in the process.

## Gather op reference

- https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
- https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

[ghstack-poisoned]
justinchuby added a commit that referenced this pull request Jul 17, 2023
WIP: need more tests

Gather op:
-
https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
-
https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

ghstack-source-id: 1c8f0e2259138c878d90b314bc94824bd717aaa0
Pull Request resolved: #883

Signed-off-by: Justin Chu <[email protected]>
justinchuby added a commit that referenced this pull request Jul 17, 2023
…)"


---

**This change implements the logic for `aten::index` and adds tests for different nd index combinations and permutations.**

## Understanding `aten::index`

For `arg0` with shape `[7, 3, 4, 5, 6]`
The indexing operation `arg0[0, :, 1:2, tensor([[4,5]])]` will be translated to
```
+>  select: i64[3, 4, 5, 6] = torch.ops.aten.select.int(arg0, 0, 0);
+>  slice_1: i64[3, 4, 5, 6] = torch.ops.aten.slice.Tensor(select, 0, 0, 9223372036854775807);
+>  slice_2: i64[3, 1, 5, 6] = torch.ops.aten.slice.Tensor(slice_1, 1, 1, 2);
+>  index: i64[3, 1, 1, 2, 6] = torch.ops.aten.index.Tensor(slice_2, [None, None, arg1]);
```
Here,
- `indices = [None, None, arg1]` is equivalent to `indices = [None, None, arg1, None]`
- The operation `arg0[0, :, 1:2, tensor([[4,5]])]` is equivalent to `arg0[0, :, 1:2, tensor([[4,5]]), :]`
None in `indices` are like fillers for dimensions that cannot be removed in the process.

## Gather op reference

- https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
- https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

[ghstack-poisoned]
justinchuby added a commit that referenced this pull request Jul 17, 2023
---

**This change implements the logic for `aten::index` and adds tests for different nd index combinations and permutations.**

## Understanding `aten::index`

For `arg0` with shape `[7, 3, 4, 5, 6]`
The indexing operation `arg0[0, :, 1:2, tensor([[4,5]])]` will be translated to
```
+>  select: i64[3, 4, 5, 6] = torch.ops.aten.select.int(arg0, 0, 0);
+>  slice_1: i64[3, 4, 5, 6] = torch.ops.aten.slice.Tensor(select, 0, 0, 9223372036854775807);
+>  slice_2: i64[3, 1, 5, 6] = torch.ops.aten.slice.Tensor(slice_1, 1, 1, 2);
+>  index: i64[3, 1, 1, 2, 6] = torch.ops.aten.index.Tensor(slice_2, [None, None, arg1]);
```
Here,
- `indices = [None, None, arg1]` is equivalent to `indices = [None, None, arg1, None]`
- The operation `arg0[0, :, 1:2, tensor([[4,5]])]` is equivalent to `arg0[0, :, 1:2, tensor([[4,5]]), :]`
None in `indices` are like fillers for dimensions that cannot be removed in the process.

## Gather op reference

- https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
- https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

[ghstack-poisoned]
justinchuby added a commit that referenced this pull request Jul 17, 2023
WIP: need more tests

Gather op:
-
https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
-
https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <bowbaomicrosoft.com>

ghstack-source-id: 4e36b48c6df41b7c0e6a959c8f86fcb4da89166d
Pull Request resolved: #883

Signed-off-by: Justin Chu <[email protected]>
justinchuby added a commit that referenced this pull request Jul 17, 2023
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at
bottom):
* __->__ #883

---

**This change implements the logic for `aten::index` and adds tests for
different nd index combinations and permutations.**

## Understanding `aten::index`

For `arg0` with shape `[7, 3, 4, 5, 6]`
The indexing operation `arg0[0, :, 1:2, tensor([[4,5]])]` will be
translated to
```
+>  select: i64[3, 4, 5, 6] = torch.ops.aten.select.int(arg0, 0, 0);
+>  slice_1: i64[3, 4, 5, 6] = torch.ops.aten.slice.Tensor(select, 0, 0, 9223372036854775807);
+>  slice_2: i64[3, 1, 5, 6] = torch.ops.aten.slice.Tensor(slice_1, 1, 1, 2);
+>  index: i64[3, 1, 1, 2, 6] = torch.ops.aten.index.Tensor(slice_2, [None, None, arg1]);
```
Here,
- `indices = [None, None, arg1]` is equivalent to `indices = [None,
None, arg1, None]`
- The operation `arg0[0, :, 1:2, tensor([[4,5]])]` is equivalent to
`arg0[0, :, 1:2, tensor([[4,5]]), :]`
None in `indices` are like fillers for dimensions that cannot be removed
in the process.

## Gather op reference

-
https://github.com/openxla/xla/blob/main/docs/operation_semantics.md?rgh-link-date=2023-07-13T01%3A09%3A16Z#gather
-
https://www.pathpartnertech.com/gather-scatter-operation-in-deep-learning-framework/

---------

Co-authored-by: BowenBao <[email protected]>
kunal-vaishnavi added a commit to microsoft/onnxruntime that referenced this pull request Oct 23, 2023
### Description
This PR contains fusion-level and kernel-level optimizations for [Meta's
LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/).

Some of the added optimizations include:

- SimplifiedLayerNorm changes
  - Fusions for multiple variants
- SkipSimplifiedLayerNorm changes
  - Kernel support for CPU
- Rotary embeddings (previously did not exist)
  - Fusions for multiple variants
  - CPU and CUDA kernels
  - Supports interleaving and non-interleaving in the same kernels
  - Optimized cache that requires half of its originally exported sizes
- Reduced from `(max_sequence_length, head_size)` to
`(max_sequence_length, head_size / 2)`
- Multi-head attention
  - Support for 2D and 3D attention masks
- Group query attention (for FP16 CUDA and INT4 CUDA)
  - Integration with flash attention v2 and past-present buffer sharing
- Removes need for `attention_mask` input as it is supported in the
kernel
- 4 bit quantization
  - `block_size` parameter is available for customizing
- Support the new changes for [Microsoft
version](https://github.com/microsoft/Llama-2-Onnx)
- Support combinations of the below variants (ex: export ORT version and
run with Optimum)

Supported variants of LLaMA-2 include:
- [ORT
version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama)
- Produces one ONNX file that is already optimized (and quantized if
requested)
  - Integrates with Optimum
- [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx)
  - Already exported and available off-the-shelf
  - Faster versions of those models will be uploaded there soon
- [Hugging Face version](https://huggingface.co/meta-llama)
  - Models that end with `-hf`
- Some older and current versions of
[`transformers`](https://github.com/huggingface/transformers) and
[`optimum`](https://github.com/huggingface/optimum) that export the
model to ONNX differently
- Note that while some older versions are supported, it is recommended
to use the latest package versions.

### Usage

To use the optimizations, please see `README.md` for details. Please
note the various `requirements.txt` files for the package versions
recommended in order to use these changes.

To run the ORT transformer optimizer separately, run the script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0
```

### Motivation and Context
This PR helps the following issues:
- #14997
- #16254
- #17681
- #17925
- microsoft/onnxruntime-inference-examples#320

This PR uses changes from the following PRs:
- pytorch/pytorch#104468
- pytorch/pytorch#109759
- #17020
- #17674
- #17890
- #17920
- huggingface/transformers#26162
- huggingface/optimum#1257
- huggingface/optimum#1289
- huggingface/optimum#1462

### New TorchDynamo Exporter (experimental stage)

This PR uses changes from the following issues and PRs to begin
supporting the [new TorchDynamo
exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter):
- huggingface/transformers#26307
- pytorch/pytorch#104903
- pytorch/pytorch#105040
- microsoft/onnxscript#847
- microsoft/onnxscript#862
- microsoft/onnxscript#493
wejoncy added a commit to microsoft/onnxruntime that referenced this pull request Oct 26, 2023
commit 538e97c
Author: Patrice Vignola <[email protected]>
Date:   Wed Oct 25 19:56:16 2023 -0700

    [DML EP] Add dynamic graph compilation (#17876)

    Historically, DML was only able to fuse partitions when all sizes are
    known in advance or when we were overriding them at session creation
    time. But in practice, it should be possible to compile partitions at
    compute time if the caller knows that the dimensions won't be changed
    for every inference (e.g. resizing a webcam window, or padding the input
    to powers of 2). This graph will be cached and reused until the sizes
    change.

    This is an opt-in option gated under the `enable_dynamic_graph_fusion`
    option, which means that it will only be enabled when the caller
    requests it since they have more context on how their model will be
    called between inferences.

    This PR also adds the option to disable metacommands from the python
    API, which is an option for the C API but was lacking for python.

commit d30d4d3
Author: Jambay Kinley <[email protected]>
Date:   Wed Oct 25 15:34:58 2023 -0700

    Add MatMul FP4 and NF4 Support (#18066)
    Add a contrib op MatMulBnb4 (FP4 and NF4) and related toolchain to
    support quantization on weight.

    This PR adds:
    - schema for contrib op MatMulBnb4 which can support FP4 (4-bit floating
    point) and NF4 (4-bit NormalFloat) quantization on weight.
    - a naive implementation for MatMulBnb4 on CPU and GPU, i.e.,
    implemented like MatMul(A, Dequantize(B)).
    - a special implementation for GemV for MatMulBnb4 and related benchmark
    tool.
    - tool to quantize model to FP4 or NF4.

commit d88d52e
Author: snadampal <[email protected]>
Date:   Wed Oct 25 13:34:57 2023 -0500

    [aarch64] Remove mmla kernel support from apple (#18082)
    <!-- Describe your changes. -->
    The mmla kernels require additional ISA flags
    and are currently supported only on Linux
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->
    more context is in #15270

    cc: @skottmckay , @chenfucn , @snnn

commit 706e13e
Author: liqun Fu <[email protected]>
Date:   Wed Oct 25 10:46:04 2023 -0700

    implement affinegrid cpu kernel (#17777)

commit 2c6b31c
Author: pengwa <[email protected]>
Date:   Wed Oct 25 15:11:02 2023 +0800

    FP16 optimizer automatically detect DeepSpeed compatibility (#18084)

    Optimum/Transformers are using accelerate lib to prepare models, so our
    FP16 optimizer wrapper does not work for long time. Because the
    namespace is `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper`,
    which underlying is still calling into DeepSpeed stage1and2 optimizer.

    This PR includes following changes:
    1. Add `accelerate.utils.deepspeed.DeepSpeedOptimizerWrapper` in the
    modifier registry, plus a check on its contained `optimizer` property
    MUST be DeepSpeed stage 1 and 2 optimizer. (let's cover Stage 3
    optimizer later)
    2. For DeepSpeed version > 0.9.1, we will store the source code in a
    version list. As long as the related function in DeepSpeed remains
    unchanged during its new release, we won't need manually upgrade the
    version check any more. If some day, the source code did not match, a
    warning will be raised to users, to add a new version of source code in
    the list.

    With the above change, we will have our FP16 Optimizer working again in
    Optimum.

    ![image](https://github.com/microsoft/onnxruntime/assets/10530022/d35b4aa9-b371-46f1-98ae-73114f91179b)

commit ae85619
Author: Sumit Agarwal <[email protected]>
Date:   Tue Oct 24 19:41:10 2023 -0700

    Introduce new optimizer MatMul + BatchNormalization (#17915)
    Introduce new ORT L1 optimizer under RewriteRule category to fuse MatMul
    + BatchNormalization node. This optimizer look for a specific pattern
    observed in one of the impacting customer models and fuse the Matmul and
    Batchnormalization node into a Gemm node. For details on the pattern
    matching and fusion please refer to the comment section of
    `matmul_bn_fusion.cc`.

    To visualize, this optimizer will replace following subgraph to a Gemm
    node.
    <pre>
                   MatMul                  GEMM
                     |                       |
                  Reshape ^     --->      Reshape ^
                     |                       |
                Transpose ^             Transpose ^
                     |
           BatchNormalization
    Note: ^ means there can be >=0 occurrence(s) of that node.
    Few example fusable pattern:
    * - MatMul -> Reshape -> Transpose -> BatchNormalization ---> GEMM ->
    Reshape -> Transpose
    * - MatMul -> Reshape -> BatchNormalization ---> GEMM -> Reshape
    * - MatMul -> Transpose -> BatchNormalization ---> GEMM -> Transpose
    * - MatMul -> Reshape -> Reshape -> BatchNormalization ---> GEMM ->
    Reshape -> Reshape
    * - MatMul -> Reshape -> Transpose -> Reshape -> BatchNormalization --->
    GEMM -> Reshape -> Transpose -> Reshape
    * - MatMul -> BatchNormalization ---> GEMM
    </pre>

    Note: This optimizer may evolve in the future to be more generic in
    terms of the pattern matching.
    - Why is this change required? What problem does it solve?
    One of the user of ORT+DML ep needs this to better target the model to
    DML. But this transformation applies more broadly, so added L1
    optimizer.
    <!-- - If it fixes an open issue, please link to the issue here. -->

commit 76e275b
Author: Jian Chen <[email protected]>
Date:   Tue Oct 24 15:17:36 2023 -0700

    Merge Cuda docker files into a single one (#18020)
    <!-- Describe your changes. -->
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->

commit 6ec45f2
Author: Changming Sun <[email protected]>
Date:   Tue Oct 24 13:04:08 2023 -0700

    Merge aiinfra-linux-ARM64-CPU-2019 and onnxruntime-linux-ARM64-CPU-2019 (#18069)
    Merge aiinfra-linux-ARM64-CPU-2019 and onnxruntime-linux-ARM64-CPU-2019
    machines to a single one to ease management.

commit efa0cc2
Author: liqun Fu <[email protected]>
Date:   Tue Oct 24 10:58:54 2023 -0700

    implement isinf20 and isnan20 (#17874)

commit abb3291
Author: Changming Sun <[email protected]>
Date:   Tue Oct 24 10:50:12 2023 -0700

    Update win-wasm-ci.yml: increase the timeout value (#18023)

commit e63ccd3
Author: Jian Chen <[email protected]>
Date:   Tue Oct 24 10:47:23 2023 -0700

    Install CUDA 12.2 on Windows (#18044)
    <!-- Describe your changes. -->
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->

commit eb47008
Author: Jiajia Qin <[email protected]>
Date:   Tue Oct 24 13:56:56 2023 +0800

    [js/webgpu] FP16 Cast, Resize (#18035)
    <!-- Describe your changes. -->

    Cast/Resize with f16 are missing in vae-decoder-f16. With this change,
    vae-decoder-f16 becomes 315 ms from over than 1 seconds.

commit 688524a
Author: Tianlei Wu <[email protected]>
Date:   Mon Oct 23 22:00:02 2023 -0700

    [CUDA EP] Add warning logs when adding memcpy nodes (#18032)

    Memcpy nodes could have negative impact on performance, they also cause
    ORT unable to run CUDA graph.

    Here we add a warning log for CUDA EP when this happens. It could help
    trouble shooting. For example, when CUDA graph cannot run, we can see
    the logs to find out where the Memcpy nodes are inserted (Although it is
    also possible through saving optimized model, but that need more time
    and disk space).

    Note that the warning is per graph. When there are subgraphs, we might
    see multiple warnings if the issue happens in multiple graphs.

    Example logs:
    ```
    2023-10-19 20:58:10.678176531 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after input_ids for CUDAExecutionProvider
    2023-10-19 20:58:10.678198702 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after /text_model/ArgMax_output_0 for CUDAExecutionProvider
    2023-10-19 20:58:10.678211727 [I:onnxruntime:, transformer_memcpy.cc:329 AddCopyNode] Add MemcpyFromHost after /text_model/Gather_3_output_0 for CUDAExecutionProvider
    2023-10-19 20:58:10.678257903 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 3 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
    ```

commit 555b2af
Author: Chi Lo <[email protected]>
Date:   Tue Oct 24 02:41:15 2023 +0000

    [TensorRT EP] Add unit test for user provided cuda stream (#17974)

    Add a unit test for testing user provided CUDA stream

commit 4ffd022
Author: Chi Lo <[email protected]>
Date:   Tue Oct 24 00:46:38 2023 +0000

    [TensorRT EP] Refactor of TRT plugins support (#17946)

    Make sure "trt.plugins" custom op domain only being registered once.
    The bottom line is "trt.plugins" custom op domain needs to be registered
    before model load.

    `CreateTensorRTCustomOpDomainList()` is TRT EP's function to create
    "trt.plugins" custom op domain. Following are places where this function
    will be called. (This function only fetches all the TRT plugins from TRT
    plugin registry but not yet registered them to ORT custom op registry.
    The real registration happens in AddCustomOpDomains())

    C/C++ APIs:

    - `OrtApis::SessionOptionsAppendExecutionProvider_TensorRT_XX`: This
    function will make session option object contain the "trt.plugins"
    custom op domain for ORT to register. So that later the session creation
    api can register the custom op domain accordingly and won't complain
    about invalid onnx node.
    - `InferenceSession::RegisterExecutionProvider`: In some cases, users
    might create the session object first and later call
    session_object.RegisterExecutionProvider(). This function will call
    p_exec_provider->GetCustomOpDomainList() which returns "trt.plugins"
    custom op domain. Otherwise, session_object.Load(model) will complain.

    Python APIs:

    - `RegisterTensorRTPluginsAsCustomOps`: Need to call this function so
    that session option object contains the "trt.plugins" custom op domain
    for ORT to register.

    Different language bindings have slightly different workflow of
    initializing the session. This might cause duplicate custom op domain in
    `session_option.custom_op_domains_` or
    `CreateTensorRTCustomOpDomainList()` being called more than once, but we
    put checks to make sure ep's custom op domain won't be registered twice.

commit 2c50b75
Author: Dmitri Smirnov <[email protected]>
Date:   Mon Oct 23 17:42:20 2023 -0700

    Functions Ahead Of Time inlininng (#17764)
    Inline functions in an EP aware fashion.

    The result of this PR is that models that are having been inlined by
    ONNX inliner and optimized and models that have been AOT inlined appear
    to be visually identical.

    For tests I used two models. The only difference is the resulting size
    because ONNX inliner removes local function definitions and AOT does
    not. Difference in sizes for `HF Mobile` model was 2.5 MB, and for `HF
    Bart` it was ~500K. It seems that the resuling model size affects the
    load time more than the actual optimizations.

    In general, the inlined models grow in size very fast and can easily
    exceed 2Gb limit.

    Q. Should we make AOT optional?

    `If` costant folding and the removal of local inlined models will be
    coming in other PRs.

    Some stats:

    ![image](https://github.com/microsoft/onnxruntime/assets/11303988/fcb4c815-2e06-4574-8d96-5a0a727d1ecf)

commit f3cfe08
Author: satyajandhyala <[email protected]>
Date:   Mon Oct 23 16:02:50 2023 -0700

    [JS/Web] Enabled 1d spacial input to GlobalAveragePool (#17973)
    Enable one-dim special  input to GlobalAveragePoll input
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->
    Currently only 2D input is supported.

commit 780ee18
Author: snadampal <[email protected]>
Date:   Mon Oct 23 16:49:04 2023 -0500

    [aarch64] Implement QGEMM kernels with UMMLA/SMMLA instructions (#17160)
    <!-- Describe your changes. -->
    This PR adds UMMLA and SMMLA based QGEMM kernels for aarch64. This
    covers
    (i) symmetric quantization (zero point is Zero)
    (ii) asymmetric quantization (zero point is non zero)
    (iii) per channel as well as per tensor quantization
    (iv) Signed weights (U8S8 Gemm)
    (v) Unsigned weights (U8U8 Gemm) and
    (vi) Signed activations and weights (S8S8 Gemm) scenarios

    I've enabled the ummla/smmla kernels based on cpuinfo check for `I8MM`
    support
    MMLA QGEMM kernels are enabled for all the devices that support I8MM
    instructions.
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->
    This is to improve INT8 quantized MatMul performance on aarch64
    platform.
    I have run the below benchmarking script (bert , roberta and gpt2 model
    inference) on AWS Graviton3 based c7g.4xl instance and observed up to
    1.33x performance improvement compared to the optimized UDOT qgemm
    kernel performance.

    ```
    cd onnxruntime/python/tools/transformers
    python3 benchmark.py
    ```
    I have also run the unit tests, and made sure all are passing

    ```
    ./build.sh --config RelWithDebInfo --build_shared_lib --parallel --compile_no_warning_as_error --skip_submodule_sync

    ```

commit 2a17d5c
Author: kunal-vaishnavi <[email protected]>
Date:   Mon Oct 23 13:00:56 2023 -0700

    LLaMA Model Optimization (#18021)
    This PR contains fusion-level and kernel-level optimizations for [Meta's
    LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/).

    Some of the added optimizations include:

    - SimplifiedLayerNorm changes
      - Fusions for multiple variants
    - SkipSimplifiedLayerNorm changes
      - Kernel support for CPU
    - Rotary embeddings (previously did not exist)
      - Fusions for multiple variants
      - CPU and CUDA kernels
      - Supports interleaving and non-interleaving in the same kernels
      - Optimized cache that requires half of its originally exported sizes
    - Reduced from `(max_sequence_length, head_size)` to
    `(max_sequence_length, head_size / 2)`
    - Multi-head attention
      - Support for 2D and 3D attention masks
    - Group query attention (for FP16 CUDA and INT4 CUDA)
      - Integration with flash attention v2 and past-present buffer sharing
    - Removes need for `attention_mask` input as it is supported in the
    kernel
    - 4 bit quantization
      - `block_size` parameter is available for customizing
    - Support the new changes for [Microsoft
    version](https://github.com/microsoft/Llama-2-Onnx)
    - Support combinations of the below variants (ex: export ORT version and
    run with Optimum)

    Supported variants of LLaMA-2 include:
    - [ORT
    version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama)
    - Produces one ONNX file that is already optimized (and quantized if
    requested)
      - Integrates with Optimum
    - [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx)
      - Already exported and available off-the-shelf
      - Faster versions of those models will be uploaded there soon
    - [Hugging Face version](https://huggingface.co/meta-llama)
      - Models that end with `-hf`
    - Some older and current versions of
    [`transformers`](https://github.com/huggingface/transformers) and
    [`optimum`](https://github.com/huggingface/optimum) that export the
    model to ONNX differently
    - Note that while some older versions are supported, it is recommended
    to use the latest package versions.

    To use the optimizations, please see `README.md` for details. Please
    note the various `requirements.txt` files for the package versions
    recommended in order to use these changes.

    To run the ORT transformer optimizer separately, run the script as
    follows:
    ```
    $ cd onnxruntime/onnxruntime/python/tools/transformers/
    $ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0
    ```
    This PR helps the following issues:
    - #14997
    - #16254
    - #17681
    - #17925
    - microsoft/onnxruntime-inference-examples#320

    This PR uses changes from the following PRs:
    - pytorch/pytorch#104468
    - pytorch/pytorch#109759
    - #17020
    - #17674
    - #17890
    - #17920
    - huggingface/transformers#26162
    - huggingface/optimum#1257
    - huggingface/optimum#1289
    - huggingface/optimum#1462

    This PR uses changes from the following issues and PRs to begin
    supporting the [new TorchDynamo
    exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter):
    - huggingface/transformers#26307
    - pytorch/pytorch#104903
    - pytorch/pytorch#105040
    - microsoft/onnxscript#847
    - microsoft/onnxscript#862
    - microsoft/onnxscript#493

commit 8a12b2c
Author: Jiajia Qin <[email protected]>
Date:   Tue Oct 24 02:02:19 2023 +0800

    [js/webgpu] Fix the transpose error when dims > 4D (#18027)
    <!-- Describe your changes. -->
    Currently, the uniform support has bugs when dims rank is larger than 4.
    See #17860 item 1.
    So this PR only enables shapes uniforms when shape rank is <= 4 for
    transpose. Otherwise, below compilation errors are thrown:
    ```
    1 error(s) generated while compiling the shader:
    :3:50 error: uniform storage requires that array elements are aligned to 16 bytes, but array element of type 'u32' has a stride of 4 bytes. Consider using a vector or struct as the element type instead.
          struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
                                                     ^^^^^^^^^^^^^

    :3:7 note: see layout of struct:
    /*            align(4) size(84) */ struct Uniforms {
    /* offset( 0) align(4) size( 4) */   output_size : u32;
    /* offset( 4) align(4) size(20) */   a_shape : array<u32, 5>;
    /* offset(24) align(4) size(20) */   a_strides : array<u32, 5>;
    /* offset(44) align(4) size(20) */   output_shape : array<u32, 5>;
    /* offset(64) align(4) size(20) */   output_strides : array<u32, 5>;
    /*                              */ };
          struct Uniforms { output_size:u32, a_shape:array<u32, 5>, a_strides:array<u32, 5>, output_shape:array<u32, 5>, output_strides:array<u32, 5> };
          ^^^^^^

    :4:42 note: 'Uniforms' used in address space 'uniform' here
          @group(0) @binding(2) var<uniform> uniforms: Uniforms;
                                             ^^^^^^^^
    ```

commit f0d5ea5
Author: Hector Li <[email protected]>
Date:   Mon Oct 23 09:01:29 2023 -0700

    [QNN EP] Disable flaky test QnnCPUBackendTests.MatMulOp_Broadcast (#18033)

    Disable flaky test QnnCPUBackendTests.MatMulOp_Broadcast. The test
    failed on Linux randomly.

commit b7ae293
Author: JiCheng <[email protected]>
Date:   Sun Oct 22 23:33:29 2023 +0800

    Support large model export using multi-gpu (#17990)

    This PR is to implemente a exporter which works for large language
    models(LLM).
    It works for models like Llama2-70b or gpt-175.

    The main idea is to utilize multiple-GPU and dispatch differnet layers
    to different GPU, in short, it symply implemented auto pipeline
    parallelism.

    For example : to export Llama2-70b, you need 8x V100-32GB or 4x A100-80G
    or More GPU memories.

    It would expect to export decoder-only models. For encoder-decoder
    arch-like models, we didn't test it yet.
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->

    ---------

    Co-authored-by: Justin Chu <[email protected]>

commit 444a0ed
Author: pengwa <[email protected]>
Date:   Sat Oct 21 19:45:45 2023 +0800

    Avoid one time clone to save memory peak (#17934)

commit 009cd4e
Author: RandySheriffH <[email protected]>
Date:   Fri Oct 20 16:12:21 2023 -0700

    Allow cuda custom ops allocate deferred cpu mem (#17893)

    Expose a new allocator from cuda stream.
    The allocator manages deferred cpu memory which only get recycled before
    stream destruction.

    ---------

    Co-authored-by: Randy Shuai <[email protected]>

commit 2f57625
Author: Chi Lo <[email protected]>
Date:   Fri Oct 20 22:09:46 2023 +0000

    [TensorRT EP] Add stream sync after enqueue (#18026)

    If the model is partitioned into TRT subgraphs and CUDA EP node, we
    observed cuda stream synchronization issue when multithreading. Calling
    stream sync API after enqueue can solve this issue without adding much
    performance overhead.

commit 020824e
Author: liqun Fu <[email protected]>
Date:   Fri Oct 20 15:08:25 2023 -0700

    Update ONNX to 1.15.0rc1 (#17914)

commit a43c57f
Author: Baiju Meswani <[email protected]>
Date:   Fri Oct 20 11:39:57 2023 -0700

    ResizeGrad CUDA/ROCM kernel implementation (#17772)

commit cc7e8cc
Author: Changming Sun <[email protected]>
Date:   Fri Oct 20 09:24:21 2023 -0700

    Update dockerfiles/Dockerfile.source to avoid installing onnx (#17975)
    Update dockerfiles/Dockerfile.source to avoid installing onnx python
    package. ONNX is not listed in
    https://github.com/microsoft/onnxruntime/blob/main/requirements.txt.in.
    We do not have to install it. Especially when we do not run tests, the
    package provides no help when building onnxruntime from source.
    Resolve #17781

commit 99b8dca
Author: Yi Zhang <[email protected]>
Date:   Fri Oct 20 23:41:40 2023 +0800

    Disable dml stage in windows GPU pipeline temporarily. (#18034)
    <!-- Describe your changes. -->
    <!-- - Why is this change required? What problem does it solve?
    - If it fixes an open issue, please link to the issue here. -->
tianleiwu pushed a commit to microsoft/onnxruntime that referenced this pull request Oct 31, 2023
This PR contains fusion-level and kernel-level optimizations for [Meta's
LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/).

Some of the added optimizations include:

- SimplifiedLayerNorm changes
  - Fusions for multiple variants
- SkipSimplifiedLayerNorm changes
  - Kernel support for CPU
- Rotary embeddings (previously did not exist)
  - Fusions for multiple variants
  - CPU and CUDA kernels
  - Supports interleaving and non-interleaving in the same kernels
  - Optimized cache that requires half of its originally exported sizes
- Reduced from `(max_sequence_length, head_size)` to
`(max_sequence_length, head_size / 2)`
- Multi-head attention
  - Support for 2D and 3D attention masks
- Group query attention (for FP16 CUDA and INT4 CUDA)
  - Integration with flash attention v2 and past-present buffer sharing
- Removes need for `attention_mask` input as it is supported in the
kernel
- 4 bit quantization
  - `block_size` parameter is available for customizing
- Support the new changes for [Microsoft
version](https://github.com/microsoft/Llama-2-Onnx)
- Support combinations of the below variants (ex: export ORT version and
run with Optimum)

Supported variants of LLaMA-2 include:
- [ORT
version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama)
- Produces one ONNX file that is already optimized (and quantized if
requested)
  - Integrates with Optimum
- [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx)
  - Already exported and available off-the-shelf
  - Faster versions of those models will be uploaded there soon
- [Hugging Face version](https://huggingface.co/meta-llama)
  - Models that end with `-hf`
- Some older and current versions of
[`transformers`](https://github.com/huggingface/transformers) and
[`optimum`](https://github.com/huggingface/optimum) that export the
model to ONNX differently
- Note that while some older versions are supported, it is recommended
to use the latest package versions.

To use the optimizations, please see `README.md` for details. Please
note the various `requirements.txt` files for the package versions
recommended in order to use these changes.

To run the ORT transformer optimizer separately, run the script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0
```

This PR helps the following issues:
- #14997
- #16254
- #17681
- #17925
- microsoft/onnxruntime-inference-examples#320

This PR uses changes from the following PRs:
- pytorch/pytorch#104468
- pytorch/pytorch#109759
- #17020
- #17674
- #17890
- #17920
- huggingface/transformers#26162
- huggingface/optimum#1257
- huggingface/optimum#1289
- huggingface/optimum#1462

This PR uses changes from the following issues and PRs to begin
supporting the [new TorchDynamo
exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter):
- huggingface/transformers#26307
- pytorch/pytorch#104903
- pytorch/pytorch#105040
- microsoft/onnxscript#847
- microsoft/onnxscript#862
- microsoft/onnxscript#493
kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024
### Description
This PR contains fusion-level and kernel-level optimizations for [Meta's
LLaMA-2](https://blogs.microsoft.com/blog/2023/07/18/microsoft-and-meta-expand-their-ai-partnership-with-llama-2-on-azure-and-windows/).

Some of the added optimizations include:

- SimplifiedLayerNorm changes
  - Fusions for multiple variants
- SkipSimplifiedLayerNorm changes
  - Kernel support for CPU
- Rotary embeddings (previously did not exist)
  - Fusions for multiple variants
  - CPU and CUDA kernels
  - Supports interleaving and non-interleaving in the same kernels
  - Optimized cache that requires half of its originally exported sizes
- Reduced from `(max_sequence_length, head_size)` to
`(max_sequence_length, head_size / 2)`
- Multi-head attention
  - Support for 2D and 3D attention masks
- Group query attention (for FP16 CUDA and INT4 CUDA)
  - Integration with flash attention v2 and past-present buffer sharing
- Removes need for `attention_mask` input as it is supported in the
kernel
- 4 bit quantization
  - `block_size` parameter is available for customizing
- Support the new changes for [Microsoft
version](https://github.com/microsoft/Llama-2-Onnx)
- Support combinations of the below variants (ex: export ORT version and
run with Optimum)

Supported variants of LLaMA-2 include:
- [ORT
version](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama)
- Produces one ONNX file that is already optimized (and quantized if
requested)
  - Integrates with Optimum
- [Another Microsoft version](https://github.com/microsoft/Llama-2-Onnx)
  - Already exported and available off-the-shelf
  - Faster versions of those models will be uploaded there soon
- [Hugging Face version](https://huggingface.co/meta-llama)
  - Models that end with `-hf`
- Some older and current versions of
[`transformers`](https://github.com/huggingface/transformers) and
[`optimum`](https://github.com/huggingface/optimum) that export the
model to ONNX differently
- Note that while some older versions are supported, it is recommended
to use the latest package versions.

### Usage

To use the optimizations, please see `README.md` for details. Please
note the various `requirements.txt` files for the package versions
recommended in order to use these changes.

To run the ORT transformer optimizer separately, run the script as
follows:
```
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 optimizer.py --input <filename>.onnx --output <filename>.onnx --model_type gpt2 --num_heads <number of attention heads> --hidden_size <attention hidden size> --use_external_data_format --opt_level 0
```

### Motivation and Context
This PR helps the following issues:
- microsoft#14997
- microsoft#16254
- microsoft#17681
- microsoft#17925
- microsoft/onnxruntime-inference-examples#320

This PR uses changes from the following PRs:
- pytorch/pytorch#104468
- pytorch/pytorch#109759
- microsoft#17020
- microsoft#17674
- microsoft#17890
- microsoft#17920
- huggingface/transformers#26162
- huggingface/optimum#1257
- huggingface/optimum#1289
- huggingface/optimum#1462

### New TorchDynamo Exporter (experimental stage)

This PR uses changes from the following issues and PRs to begin
supporting the [new TorchDynamo
exporter](https://pytorch.org/docs/stable/onnx.html#torchdynamo-based-onnx-exporter):
- huggingface/transformers#26307
- pytorch/pytorch#104903
- pytorch/pytorch#105040
- microsoft/onnxscript#847
- microsoft/onnxscript#862
- microsoft/onnxscript#493
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: torch_lib Related to the torch/aten function lib in development
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants