Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLIR][TORCH] Fix mean and mean.dim op for large-sized inputs #1605

Merged
merged 1 commit into from
Nov 22, 2022

Conversation

vivekkhandelwal1
Copy link
Collaborator

This commit fixes the aten.mean and aten.mean.dim op decomposition for supporting large-sized inputs.
This commit also fixes the formatting for the file stats.py

Signed-Off By: Vivek Khandelwal[email protected]

@vivekkhandelwal1 vivekkhandelwal1 marked this pull request as ready for review November 17, 2022 12:52
Copy link
Member

@pashu123 pashu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp Outdated Show resolved Hide resolved
lib/Dialect/Torch/Transforms/DecomposeComplexOps.cpp Outdated Show resolved Hide resolved
python/torch_mlir_e2e_test/test_suite/stats.py Outdated Show resolved Hide resolved
python/torch_mlir_e2e_test/test_suite/stats.py Outdated Show resolved Hide resolved
Copy link
Member

@pashu123 pashu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is in terms of f16 support → We shouldn't be up casting f16's to f64's, casting to f32's should be enough.

This commit fixes the aten.mean and aten.mean.dim op decomposition
for supporting large-sized inputs.
This commit also fixes the formatting for the file stats.py

Signed-Off By: Vivek Khandelwal<[email protected]>
@ramiro050
Copy link
Collaborator

This is in terms of f16 support → We shouldn't be up casting f16's to f64's, casting to f32's should be enough.

@vivekkhandelwal1 @pashu123, is this still needed?

@vivekkhandelwal1
Copy link
Collaborator Author

This is in terms of f16 support → We shouldn't be up casting f16's to f64's, casting to f32's should be enough.

@vivekkhandelwal1 @pashu123, is this still needed?

Not right now. Since the fp16 tests don't work on the refbackend, this support would be added in a separate PR.

@vivekkhandelwal1
Copy link
Collaborator Author

@ramiro050 Is this patch good to merge?

Copy link
Collaborator

@ramiro050 ramiro050 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tanyokwok
Copy link
Collaborator

Hi @vivekkhandelwal1, we found a great performance decay with the commit since all mean/sum such reduction computations are now in f64.

I am curious why we force the computation precision to f64. Can we use an option to enable f64 in your scenario and leave f32 computation by default?

cc @ramiro050 @silvasean

Affected issue: alibaba/BladeDISC#799

@silvasean
Copy link
Contributor

AFAICT PyTorch uses f32 for this computation instead of f64. Why do we need f64? Since it is causing a major regression I think we should revert this patch.

We don't want an option to control f32 or f64 here. We should just be consistent with PyTorch's behavior.

@vivekkhandelwal1
Copy link
Collaborator Author

vivekkhandelwal1 commented Nov 29, 2022

Hi @vivekkhandelwal1, we found a great performance decay with the commit since all mean/sum such reduction computations are now in f64.

I am curious why we force the computation precision to f64. Can we use an option to enable f64 in your scenario and leave f32 computation by default?

cc @ramiro050 @silvasean

Affected issue: alibaba/BladeDISC#799

If we revert this conversion to f64 then the tests with the large input size fail.

@silvasean
Copy link
Contributor

I tried reverting the .cpp file changes in this patch locally and all the e2e test cases in the patch continue to pass. We should just revert this patch since it does not appear to contain an e2e test that fails before but passes after.

@qedawkins
Copy link
Collaborator

qedawkins commented Nov 30, 2022

In the past I referenced this to see how PyTorch was doing the mean and variance computation on CPU when debugging a correctness issue for var.correction. This culminated in the VarCorrectionLargeInputModule_basic test added in #1100, but if the large input tests pass both with and without double precision then the tests must not be adequately testing the precision differences.

Also summation order by the backend matters here, e.g. say if we want to sum 1000 elements, we might first sum together the elements in groups of 4 and then fully reduce using these partial sums. This can avoid underflow problems with lower precision, but doesn't mean that summing with single precision in this lowering is correct. My guess is that PyTorch + CUDA won't do double precision accumulation though.

AFAICT PyTorch uses f32 for this computation instead of f64. Why do we need f64? Since it is causing a major regression I think we should revert this patch.

We don't want an option to control f32 or f64 here. We should just be consistent with PyTorch's behavior.

In short, agreed that PyTorch's behavior should have the final say. If they are no longer doing double precision accumulation for these ops then we can happily drop this patch. For context, this precision issue first showed up for aten.var.correction with V-Diffusion, I'm unsure which model had problems for the mean.dim op though.

EDIT: I changed the lines I linked to

tanyokwok pushed a commit to alibaba/BladeDISC that referenced this pull request Dec 5, 2022
@tanyokwok
Copy link
Collaborator

tanyokwok commented Dec 5, 2022

Also summation order by the backend matters here, e.g. say if we want to sum 1000 elements, we might first sum together the elements in groups of 4 and then fully reduce using these partial sums.

@qedawkins
I agree with you that the summation order and backend implementations do matter. Accumulation and computation types are chosen according to backends and algorithms, and data types. The example case you mentioned is just a special case; it's not general(Since it is only selected when reducing squares for a variance on all dimensions). And cuda's implementation is different: ReduceMomentKernel.cu#L13-L42. My two cents:

  1. I think Torch-MLIR conversions on torch-dialect won't tell one which summation order and accumulation types should be used. It should be selected by backends.
  2. The updates in the PR changed the accumulation and computation types of AtenMeanOp, which is inconsistent with what PyTorch does.

cc @vivekkhandelwal1 @silvasean

tanyokwok pushed a commit to alibaba/BladeDISC that referenced this pull request Dec 5, 2022
@silvasean
Copy link
Contributor

I agree @tanyokwok. Or, to put it another way: if an e2e test would fail when comparing different upstream PyTorch backends (e.g. CPU vs CUDA), then we do not want to include it in our e2e test suite.

@vivekkhandelwal1 can you revert this patch?

@vivekkhandelwal1
Copy link
Collaborator Author

I tried reverting the .cpp file changes in this patch locally and all the e2e test cases in the patch continue to pass. We should just revert this patch since it does not appear to contain an e2e test that fails before but passes after.

Hi @silvasean, if you run the following test:

class MeanLargeInputModule(torch.nn.Module):
    def __init__(self):
        super().__init__()

    @export
    @annotate_args([
        None,
        ([-1, -1, -1, -1], torch.float32, True),
    ])
    def forward(self, x):
        return torch.ops.aten.mean(x)


@register_test_case(module_factory=lambda: MeanLargeInputModule())
def MeanLargeInputModule_basic(module, tu: TestUtils):
    module.forward(tu.rand(3, 4, 256, 1024, low=100, high=200))

without this patch, then it won't pass. If you still want this patch to be reverted I can do that, or I can add this test for the other ops too.

@silvasean
Copy link
Contributor

Yes, I think we want to revert this patch. Does the test pass on CUDA?

@vivekkhandelwal1
Copy link
Collaborator Author

Yes, I think we want to revert this patch. Does the test pass on CUDA?

How can it be tested on CUDA?

@silvasean
Copy link
Contributor

We don't have a pre-built e2e config for CUDA, but just write a standalone script that runs the one op on cuda and CPU and compares them.

@silvasean
Copy link
Contributor

@vivekkhandelwal1 can you please revert this patch?

@powderluv
Copy link
Collaborator

Is the revert for lack of CUDA e2e tests (and lack of a GPU CI ?) or the earlier regression by moving to f64 ?
(Either way ok to revert until we figure out the way forward but trying to understand how we want to support CUDA models that are a mainstay of any SOTA model).

@silvasean
Copy link
Contributor

silvasean commented Dec 8, 2022

My understanding is that CUDA and CPU backends have provide different precision guarantees, and this test enforces the CPU precision guarantee. We generally do not want e2e tests that would fail if comparing PyTorch's native CPU and CUDA. When we have a GPU CI we can add a native torch CUDA e2e test config which enforces this, but for now the policy is enforced manually.

I documented the rationale for removing this here:

https://github.com/llvm/torch-mlir/blob/main/docs/adding_an_e2e_test.md#special-kinds-of-tests

The testing of functions with special numerical precision considerations can also be tricky. Our rule of thumb is that if a test would fail across two upstream PyTorch backends (e.g. CPU and CUDA) due to different numerical precision choices, then it should not be included in our e2e test suite. See https://github.com/llvm/torch-mlir/pull/1605 for context.

@vivekkhandelwal1
Copy link
Collaborator Author

@silvasean @tanyokwok #1692 is merged now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants