-
Notifications
You must be signed in to change notification settings - Fork 506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MLIR][TORCH] Fix mean and mean.dim op for large-sized inputs #1605
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in terms of f16 support → We shouldn't be up casting f16's to f64's, casting to f32's should be enough.
fbed9b1
to
e912166
Compare
This commit fixes the aten.mean and aten.mean.dim op decomposition for supporting large-sized inputs. This commit also fixes the formatting for the file stats.py Signed-Off By: Vivek Khandelwal<[email protected]>
e912166
to
850e0a5
Compare
@vivekkhandelwal1 @pashu123, is this still needed? |
Not right now. Since the fp16 tests don't work on the refbackend, this support would be added in a separate PR. |
@ramiro050 Is this patch good to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Hi @vivekkhandelwal1, we found a great performance decay with the commit since all mean/sum such reduction computations are now in f64. I am curious why we force the computation precision to f64. Can we use an option to enable f64 in your scenario and leave f32 computation by default? Affected issue: alibaba/BladeDISC#799 |
AFAICT PyTorch uses f32 for this computation instead of f64. Why do we need f64? Since it is causing a major regression I think we should revert this patch. We don't want an option to control f32 or f64 here. We should just be consistent with PyTorch's behavior. |
If we revert this conversion to |
I tried reverting the .cpp file changes in this patch locally and all the e2e test cases in the patch continue to pass. We should just revert this patch since it does not appear to contain an e2e test that fails before but passes after. |
In the past I referenced this to see how PyTorch was doing the mean and variance computation on CPU when debugging a correctness issue for var.correction. This culminated in the VarCorrectionLargeInputModule_basic test added in #1100, but if the large input tests pass both with and without double precision then the tests must not be adequately testing the precision differences. Also summation order by the backend matters here, e.g. say if we want to sum 1000 elements, we might first sum together the elements in groups of 4 and then fully reduce using these partial sums. This can avoid underflow problems with lower precision, but doesn't mean that summing with single precision in this lowering is correct. My guess is that PyTorch + CUDA won't do double precision accumulation though.
In short, agreed that PyTorch's behavior should have the final say. If they are no longer doing double precision accumulation for these ops then we can happily drop this patch. For context, this precision issue first showed up for aten.var.correction with V-Diffusion, I'm unsure which model had problems for the mean.dim op though. EDIT: I changed the lines I linked to |
@qedawkins
|
I agree @tanyokwok. Or, to put it another way: if an e2e test would fail when comparing different upstream PyTorch backends (e.g. CPU vs CUDA), then we do not want to include it in our e2e test suite. @vivekkhandelwal1 can you revert this patch? |
Hi @silvasean, if you run the following test:
without this patch, then it won't pass. If you still want this patch to be reverted I can do that, or I can add this test for the other ops too. |
Yes, I think we want to revert this patch. Does the test pass on CUDA? |
How can it be tested on CUDA? |
We don't have a pre-built e2e config for CUDA, but just write a standalone script that runs the one op on cuda and CPU and compares them. |
@vivekkhandelwal1 can you please revert this patch? |
Is the revert for lack of CUDA e2e tests (and lack of a GPU CI ?) or the earlier regression by moving to f64 ? |
My understanding is that CUDA and CPU backends have provide different precision guarantees, and this test enforces the CPU precision guarantee. We generally do not want e2e tests that would fail if comparing PyTorch's native CPU and CUDA. When we have a GPU CI we can add a native torch CUDA e2e test config which enforces this, but for now the policy is enforced manually. I documented the rationale for removing this here:
|
@silvasean @tanyokwok #1692 is merged now. |
This commit fixes the aten.mean and aten.mean.dim op decomposition for supporting large-sized inputs.
This commit also fixes the formatting for the file stats.py
Signed-Off By: Vivek Khandelwal[email protected]