Disable autocast #794

anijain2305 · 2022-05-11T00:33:38Z

With AMP, AOT Autograd traced graph already reflects the AMP modifications.

However, Torchscript does not know that, and can try to AMP-ify already AMP-ified AOTAutograd traced graph, resulting in weird type promotion errors.

Concern (and that's why WIP) - Calling with for each forward and backward pass, might add overhead. Is there any better way?

eellison · 2022-05-11T15:25:04Z

We can probably provide a private api to run without amp

anijain2305 · 2022-05-11T17:04:34Z

We can probably provide a private api to run without amp

Yes, this would be super helpful.

Just to confirm that we would like private API to be applicable for a torchscript graph. We already have global flags that we can disable/enable. But, that changes the behavior of the user code.

eellison · 2022-05-11T20:00:07Z

@anijain2305 i guess using jit compiler for forward/backward isnt in tree currently?

anijain2305 · 2022-05-11T21:24:47Z

@anijain2305 i guess using jit compiler for forward/backward isnt in tree currently?

Not sure I fully understand.

We can definitely use Torchscript - https://github.com/pytorch/functorch/blob/main/functorch/_src/compilers.py#L23

This is the place where we call it - https://github.com/pytorch/functorch/blob/main/functorch/_src/aot_autograd.py#L169

(basically fw_compiler = ts_compile)

This is the AOTAutogrard return object - https://github.com/pytorch/functorch/blob/main/functorch/_src/aot_autograd.py#L143-L185. An autograd.Function with its forward and backward set to the compiled graphs. It is this place, where this PR wraps the forward and backward calls in with torch.cuda.amp.autocast(enabled=False)

eellison · 2022-05-11T21:36:11Z

Thanks! Yea was just looking for the ts_compile function to know where I need to add hooks to set this. I missed invocation here

rwightman · 2022-06-14T21:35:14Z

I was about to file a separate issue for an AMP problem, but might be covered here. Testing PT 1.12 w/ the 0.2 release branch built locally I can no longer use aot-autograd with AMP (and that's really the only combo I'm interested in). PT 1.11 w/ the current pypi release of functorch seemed fine. Now, w/ AMP enabled I get errors like

WARNING: "The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<eval_with_key>.9", line 54, in forward
    getitem_174 = native_batch_norm_backward_2[1]
    getitem_175 = native_batch_norm_backward_2[2];  native_batch_norm_backward_2 = None
    convolution_backward_2 = torch.ops.aten.convolution_backward(getitem_173, relu__45, _to_copy_51, [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]);  getitem_173 = _to_copy_51 = None
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    getitem_176 = convolution_backward_2[0]
    getitem_177 = convolution_backward_2[1];  convolution_backward_2 = None
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

and

Traceback of TorchScript (most recent call last):
  File "<eval_with_key>.3", line 39, in forward
    _unsafe_view_2 = torch.ops.aten._unsafe_view(bmm, [256, 12, 197, 197]);  bmm = None
    mul = torch.ops.aten.mul(_unsafe_view_2, 0.125);  _unsafe_view_2 = None
    _softmax = torch.ops.aten._softmax(mul, -1, True);  mul = None
               ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _to_copy_6 = torch.ops.aten._to_copy(_softmax, dtype = torch.float16);  _softmax = None
    expand_3 = torch.ops.aten.expand(_to_copy_6, [256, 12, 197, 197]);  _to_copy_6 = None
RuntimeError: conversion is supported for Half type only

anijain2305 · 2022-06-16T21:10:08Z

I was about to file a separate issue for an AMP problem, but might be covered here. Testing PT 1.12 w/ the 0.2 release branch built locally I can no longer use aot-autograd with AMP (and that's really the only combo I'm interested in). PT 1.11 w/ the current pypi release of functorch seemed fine. Now, w/ AMP enabled I get errors like

WARNING: "The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "<eval_with_key>.9", line 54, in forward
    getitem_174 = native_batch_norm_backward_2[1]
    getitem_175 = native_batch_norm_backward_2[2];  native_batch_norm_backward_2 = None
    convolution_backward_2 = torch.ops.aten.convolution_backward(getitem_173, relu__45, _to_copy_51, [0], [1, 1], [0, 0], [1, 1], False, [0, 0], 1, [True, True, False]);  getitem_173 = _to_copy_51 = None
                             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    getitem_176 = convolution_backward_2[0]
    getitem_177 = convolution_backward_2[1];  convolution_backward_2 = None
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

and

Traceback of TorchScript (most recent call last):
  File "<eval_with_key>.3", line 39, in forward
    _unsafe_view_2 = torch.ops.aten._unsafe_view(bmm, [256, 12, 197, 197]);  bmm = None
    mul = torch.ops.aten.mul(_unsafe_view_2, 0.125);  _unsafe_view_2 = None
    _softmax = torch.ops.aten._softmax(mul, -1, True);  mul = None
               ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _to_copy_6 = torch.ops.aten._to_copy(_softmax, dtype = torch.float16);  _softmax = None
    expand_3 = torch.ops.aten.expand(_to_copy_6, [256, 12, 197, 197]);  _to_copy_6 = None
RuntimeError: conversion is supported for Half type only

Hi @rwightman, does this PR work for you? If you have a script, I can try on my end as well.

We did not merge this one because there is a slightly better fix that @eellison is working on in PyTorch core. But, its still work in progress. So, if this PR works for you, I am inclined to merge this in and bring in functorch 0.2 release to unblock.

Chillee

LGTM, one minor nit.

functorch/_src/aot_autograd.py

rwightman · 2022-06-19T21:59:34Z

@anijain2305 I cherry-picked this onto the 0.2 release branch locally and appears to resolve the AMP issues for me.

* Disable autocast * Add global flag * Add a test

….13.1 = (4): [FX] Add torch.memory_format as a BaseArgumentType (#62593) Use output memory format based on input for cudnn_convolution_relu (#62482) Improve performance of index_select by avoiding item (#63008) Add crow_/col_indices to view types (#63176) Aaron Bockover (2): Add support for the ONNX Runtime Eager Mode backend (#58248) CODEOWNERS: [ONNX] remove @shubhambhokare1; add @abock (#85476) Aaron Enye Shi (6): [Kineto][Bug Fix] Avoid picking up old CUPTI headers (#72761) [WIP][Kineto] Manual Submodule Update (#73090) [Kineto] Manual Submodule Update (#73858) [Profiler] Store Input shapes, dtypes, and metadata into flat AppendOnlyList (#74241) [libkineto] Re-enable user-annotations in PyTorch (#75601) [Profiler] Add quoted metadata API to remove empty trace cpu_op metadata (#84128) Aashaka Shah (3): [c10d] Fix async error in batch_isend_irecv (#82450) Enable pg_nccl.reduce_scatter to perform vector ReduceScatter for uneven input splits (#82924) Enable pg_nccl to perform vector AllGather for uneven output splits (#83713) Aayush Prakash (2): Removing tensor.data usage in utils with tensor set_ method (#63867) [Reland] Replacing the p.data acccess in utils with tensor.set_ . Passes both test_post_localSGD_optimizer_pari and test_periodic_model_averager tests (#63895) Abhijit Deo (1): [Documentation] Minor rendering issue (#84856) Abhishek Gadewar (1): Revert D31325860: [PyTorch] Implement improved version of gather_ranges_to_dense Abhishek Pathak (12): MPS: Add amax and amin Ops with tests (#79682) [MPS] Handle 1D inputs for NLL (#81290) [MPS] Handle 1D bias for addmm (#81519) [MPS] Fixes for MPS testConsistency (#81735) [MPS] Handle casting for div operation (#84742) [MPS] Fix memory error in var (#85571) [MPS] Clamp op - fix shape issues (#114) (#85673) [MPS] Enable adaptive avg pool 2d with larger output size (#85726) [MPS] Handle output shape for empty input in binary ops (#85836) [MPS] Handle scalar input for scatter and gather (#85842) [MPS] Handle compatible inputs to where (#85946) [MPS] Cast dot inputs to int32 when needed (#86140) Adam Costarino (1): Extrapolated on equiv between linalg @ and solve (#71769) Adam J. Stewart (7): DataLoader: allow non-integer Samplers (#63500) Add type hints for a few random functions/classes Add type hints for a few random functions/classes Docs: build with Sphinx 5 (#70309) Fix linspace dtype replacement in docs (#81371) torch.cartesian_prod: add type hints (#81377) Add type hints to torch.save, torch.load (#83937) Adam Mainz (1): changes for pytorch issue 55577 (#66571) Adam Simpkins (8): [caffe2] add a basic implementation of run-time feature rollout checks (#59355) [caffe2] update the BlobSerializer acceptor to allow moving in the data (#60207) [caffe2] update db::Transaction::Put() to accept the value by rvalue reference (#60208) [caffe2] add an EstimateAllBlobSizes operator (#59775) [caffe2] update make_cifar_db to move the string into DB::Put() (#60692) [caffe2] update make_mnist_db and make_image_db to move strings into DB::Put() (#60919) [caffe2] break one circular dependency between Caffe2 and ATen-cpu (#62632) [caffe2] fix type annotations for workspace.SwitchWorkspace() (#77464) Aditya Kane (1): Nit in `TripletMarginLoss` Aditya Pillai (2): Add static_runtime::fused_equally_split (#2) Make fb::sigrid_hash_compute_multipler_shift return a std::tuple<int64_t, int64_t> (#67123) Aditya Tewary (2): correct NLLLoss parameters default value (#68426) added set_printoptions examples (#68324) Adnios (3): Add mish activation function (#58648) Add the `maximize` flag to AdamW (#70146) fix typo in adam docs (#70387) Adrian Wälchli (1): Fix inefficient recursive update in ShardedTensor.state_dict hook (#68806) Aidyn-A (14): Fix max pool forward nhwc (#76597) add bfloat16 support for kl_div_backward_cuda (#77676) [primTorch] Elementwise Binary Ops I (#78023) Add logsumexp to AMP autocast (#76330) [primTorch] support one tensor and two scalars in _prims.where (#80146) [primTorch] Elementwise unary ops vi (#79526) Modify D2H copy with a different dtype (#80607) Update sample inputs for fft.hfftn (#81416) [CUDA graphs] Clear autocast amp cache (#81558) [Re-land] [CUDA graphs] Clear autocast amp cache (#81896) Update start_index and end_index for adaptive pooling (#84010) Disable autocast cache in torch.cuda.make_graphed_callables (#84289) [CUDA graphs] Fixes errors in RNG seed (#84967) Fix exception handling, improve overheads and avoid constructing storage for element size (#84612) Ailing Zhang (4): Fix GIL issue when acquiring multiple sessions. (#58584) Revert D28802058: [pytorch][PR] add dispatch for bitwise_and Reuse run_torch_xla_tests from pytorch/xla (#59888) Reuse build_torch_xla from pytorch/xla repo. (#59989) Akifumi Imanishi (4): Fix some tensor operators to return `NotImplemented` for invalid inputs (#58216) Support `__rmod__` (#58476) Support `torch.bitwise_{left/right}_shift` and `__rlshift__`, `__rrshift__` (#59544) Support `__rand__`, `__ror__` and `__rxor__` (#59240) Akshay Parashar (15): [Static Runtime] Fix aten::clone out variant (#78297) (#78322) [Static Runtime] Implement prim::Fork and aten::wait (#78780) [Static Runtime] prim::fork asyunchronous execution on JIT interpreter (#78858) [Static Runtime] support fork and wait operations on Static Runtime (#79211) [Static Runtime] Exception handling during fork subgraph execution (#79292) [Static Runtime] Performance optimization for fork operation (#79482) [Static runtime] Pass parent graph metadata to forked subgraphs (#79578) [Static Runtime] Added nested prim::fork and aten::wait test case (#79746) [Static Runtime] Add Metadata to ProcessedNode depending upon the op type (#79961) [Static Runtime] Support Futures in Static Runtime Engine (#80162) [Static Runtime] support forked subgraph execution on parent graph's executor (#80381) [Static Runtime] test case for staticRuntime::runAsync() API (#80407) [Static Runtime] Fix precision error in test cases (#80935) [Static Runtime] documentation update for StaticRuntime (#81066) [Static Runtime] implementation of variadic grouped_accessor_async operation (#82680) Akshit Khurana (31): Run pthreadpool with _NoPThreadPoolGuard on the same thread (#58759) Fix pthreadpool guard test (#58977) Fix xnnpack hardswish memory issue (#59577) Add aten::avgpool2d NNAPI converter (#58538) Add aten::softmax NNAPI converter (#58539) Add aten::to NNAPI converter (#58540) Add aten::div NNAPI converter (#58541) Add aten::flatten NNAPI converter (#60885) Add aten::detach NNAPI converter (#58543) Add aten::slice NNAPI converter (#59364) Add Int32 support for NNAPI (#59365) Add conv2d transpose NNAPI converter (#59529) Make conv2d nnapi converter accept flexible batch (#61021) Make NNAPI linear converter accept flex inputs (#61022) Make nnapi cat converter accept flex inputs Make nnapi flatten converter accept flex inputs (#61024) Add option to specify custom NNAPI serializer (#61025) Fix broken assertion error test in NNAPI convertor (#61586) Fix hardswish inplace op for strided tensor with skipped elements (#61622) Handle simple NNAPI flatten NHWC case (#61796) NNAPI: Support const values in binary ops Fix typo in NNAPI tests (#63797) add qmul (#63913) Add quantized::convtranspose2d (#63914) Fix typo in tensor docs (#64160) nnapi: Add int32 type torchscript expressions (#70197) NNAPI: Add runtime flexible shapes & return shapes (#70334) NNAPI: Add qint16 support via int16 (#70621) NNAPI: quant logistic fix (#70847) [Pytorch NNAPI] Add compilation_preference & relax_f32_to_f16 APIs (#78758) [PTE] Fix module level information in profiling (#81727) Alban Desmaison (111): Revert D28494073: [Gradient Compression] Do not skip the comm hook tests for Gloo/MPI backends Revert D28112689: CUDA support in the CSR layout: constructors Revert D28913223: [pytorch][PR] Adding run-specified-test-cases option in run_test.py Revert D29100708: [pytorch][PR] Parametrizations depending on several inputs Revert D29312809: add quantized_resize and dequantize for some cuda backends Add missing docker build to slow gradcheck label-triggered build (#61941) Svd docfix (#62028) Fix forward ad for matrix power land race (#62291) These should be equivalent per the previous formula but breaks xla (#62329) clean torch_function handling in serialization (#62744) Add serialization support for slots and subclass getstate/setstate (#62745) Revert D30090760: [iOS] Add podspec for libTorch-lite nightly build Revert D29399533: Hoisting common expressions out of If blocks Update full backward hook doc with not-same-object note (#63245) Revert D30426527: Adding DataLoader2 class as future replacement of DataLoader Revert D30417370: [nnc] Enable CPU fusion Allow implementing either backward or vjp for Function (#63434) Revert D30388099: Add a common autograd TLS state Add a common autograd TLS state (#63860) Revert D30526034: [pytorch][PR] compute reduction intermediate buffer size in elements Back out "Added reference tests to ReductionOpInfo" (#64183) Revert D30561459: Fix bytes_written and bytes_read Add forward AD support for custom Functions (#64061) Move THPVariable_NewWithVar around (#64550) Add more error checking in subclass creation (#64746) Revert D30888794: [Model Averaging] Simplify PostLocalSGD Optimizer API Allow parametrization to be nested (#65167) fix typo missing f string (#65226) Fix autograd engine test in python_dispatch (#65567) Fix engine check for case where grad is a subclass (#65568) add deepcopy support to subclasses (#65584) Remove old code that is unused in test/ (#66331) Update extending doc to cover forward mode AD (#66962) Docs module check (#67440) Revert D32175959: Merging the implementations of ClearProfiling Revert D32175957: Adding custom testing based on opinfos input for ops with custom rules. Revert D32175958: Adding Custom Rules to Device Propagation Revert D32175960: Moving parts of the Shape Registry into a common file Revert D32175963: Converting hardswish to strucutred kernels with metatensor support Fix deadlock for multi-output forward AD (#67995) fix gradcheck to generate valid input for forward AD complex (#68001) Revert D32187063: [static runtime] dequantize out variant Revert D32541986: [pytorch][PR] [opinfo] use dtypes instead of dtypesIfCPU Revert D32010095: [pytorch][PR] Add ability for a mobile::Module to save as flatbuffer Follow the undefined Tensor <-> None rule better in torch dispatch (#67793) add python dispatch test back to CI and fix typo in test (#69565) fix typo changing the generated code (#69899) Revert D33024528: [quant][fx][graphmode] Add support for conv add pattern in backend_config_dict Revert D32974907: [quant][graphmode][fx] Enable fuse handler for sequence of 3 ops Fix adamw formula doc (#68587) fix loading of older models that don't have maximize (#71023) Revert D32994274: [ONNX] Link to the wiki (#68505) Simplify TensorImpl size check and fix error message (#72070) Process commit update2 Revert D31316086: [fx-acc] PassManager Tensorimpl cleanup try 2 (#72336) remove some spurious warnings fixing (#72352) Change ParameterList and ParameterDict to be able to contain any kind of objects (#70499) remove some spurious warnings fixing take 2 (#72542) Remove un-used function in autograd engine (#72687) Clean up use of cpu ready queue in autograd engine (#72688) Allow forking until a worker thread is created in autograd engine (#72689) Clean up LoggingTensor semantic (#72620) Add wrapped Tensor autograd test (#72622) Add new tls snapshot feature (#72623) Add new tls snapshot feature (#72832) Revert D34250357: Sync lazy_tensor_staging back to master Revert D34342689: Revert D34250357: Sync lazy_tensor_staging back to master Ensure that call before redispatch work well for PythonTLSSnapshot (#73045) Fix error handling TestSetDefaultMobileCPUAllocator add dry run option and improve test list printing Reland fix dispatch (#73231) Revert D34400588: [pytorch][PR] super setUp call missing in TestSparse Revert D33994011: Make debug_pkl smaller by only emitting unique traces. simplify run_test for distributed tests only run complex autograd tests once Revert D34599476: [Quant][test] Added test to check if fp16 packing->unpacking yields the same result as to(torch.float16).to(torch.float32) Fix deadlock in some edge case in autograd (#73961) rename config module file to work with gh pages better Cleanup all module references in doc (#73983) Update tls logic to work better with guarded call (#73925) Add Sherlock to superusers Revert D35284563: Use the same checks in all `grid_sampler` functions update codeowner for public API Fix compilation on macos Remove spurious warning when using disabled torch function Fix public binding check for modules with `__all__` Fix doc build Reland Fix public binding check for modules with `__all__` ReReland Fix public binding check for modules with `__all__` Update allowlist ReReReland Fix public binding check for modules with `__all__` Make distributed raise ImportError when not available Revert "Revert "record_function: update to use custom_class API"" Improve more the error message with explicit recommendation Make -h work with run_test.py Make sure requires_grad is propagated for all backend Readme update to remove old python version Add gradient choice detail to autograd doc Make sure that we can build without xcode on mac (#77450) Fix MPS interaction with autograd engine Migrate x86 trunk build/test to macos12 Migrate cross compilation trunk test to use macos12 to build MPS improve mps note to describe the different functions available (#77767) Move x86 binaries builder to macos-12 to enable MPS build Fix a few issues on assert/double error/legacy constructor (#77966) prims shouldn't be checked for BC checks (#78079) Remove prints and add proper asserts Speed up test_mps from 9min to 25s Add full support for serialization of MPS Tensors (#79465) Add full support for serialization of MPS Tensors (#79465) Albert Chung (1): Update docstring for scale_factor in torch.nn.functional.interpolate. (#80807) Albert Liang (1): Add `dict` methods to `ParameterDict` (#69403) Alex Beloi (15): [fx-acc] add acc_op optimization flags and decorator (#65928) [acc_shape_inference] add shape inference for quantize_per_channel (#66562) [fx-acc] add optimize_quantization to FX graph opts (#65929) [fx-acc] add automated graph opt testing using AccOpProperty (#67228) [fx-acc] add optimize_noop graph opt (#68131) [fx-acc][graph-opts] bug fixes for transpose_to_reshape, optimize_quantization, finalize_kwargs_to_concrete [fx] add documentation to AccOpProperties (#71450) [fx][graph opts] port FoldLayerNormArithmetic from glow to FX (#69715) [fx-acc] PassManager (#67261) [fx][acc_tracer] fix defaulted placeholder normalization (#73406) [fx][1/2] add PassManager and refactor AFG/AGM (#74972) [fx][ShapeProp] make shapes and args/kwargs concrete for minimizer (#75291) [fx] refactor fba_passes into FBAPassManagerBuilder (#83268) [fx] add deferred weights (xl_weight) and tracing for xl_embedding_bag (#84016) [perf][1/5] Replace IValue::toString()->string() with IValue::toStringRef() (#85437) Alex Dai (1): fix at::from_blob_quantized_per_tensor_affine strides calculation (#79314) Alex Hedges (2): Fix BytesWarning in torch.load() (#74813) Fix code that triggers BytesWarning (#79868) Alex Li (1): Update cross entropy documentation to metion logits clearly (#82538) Alex Suhan (2): Add device and key for lazy tensors (#61621) Fix reshape for the Lazy key (#62846) Alex Zhao (1): .github: Migrate linux-xenial-py3.6-gcc7 to GHA (#67072) Alex Zhuang (1): Correct torch.nn.CrossEntropyLoss output shape specification (#79568) Alexander (6): fixing csr addmm bug (#58768) CUDA support in the CSR layout: constructors (#57274) CUDA support in the CSR layout: constructors (#59010) CUDA support in the CSR layout: sparse_to_dense/add_sparse_csr (#59011) CUDA support in the CSR layout: CUDA addmm/matvec (#59012) Add typing return value to init in nn.Module (#45654) Alexander Golynski (3): Update Gloo submodule (#58853) Switch PG::Work to Future in default_comm_hooks.cpp (#59398) PG NCCL cleanup: remove usage of completed_ in WorkNCCL copies (#59899) Alexander Grund (21): Fix arange functions for VSX specializations of Vec256 (#58553) Fix vectorized calculations on POWER (#59382) Fix accuraccy failures when running test_nn on A100s (#59624) Increase tolerance for test_grad_scaling_clipping (#60458) Fix test failures with some glibc libraries (#60450) Increase some tolerances for tf32 for Conv3d tests (#60451) Increase tolerance for some distributed tests to 5e-5 (#60462) Add Github action to upload full source releases (#63022) Fix segmentation fault due to access to destroyed CudaIPCGlobalEntities instance (#56141) Deduplicate codegenOutputQuery to query maximum CUDA compute capabilities (#55901) Pass WITH_BLAS option from environment to CMake (#78037) Copy Tensor for tests to avoid in-place transform modifying the original tensor (#80331) Only sync CUDA if the operation is run on GPU (#80328) Choose test affinity based on current affinity (#80327) Fix faulty, vectorized `pow` function on VSX (#82646) Skip TestNNAPI tests if QNNPACK is not supported (#82882) Increase default test timeout for distributed tests (#80330) Fix failing test_model_dump due to empty file (#84744) Increase timeout for ProcessGroupGlooTest (#85474) Fix `check_compiler_ok_for_platform` on non-English locales (#85891) Limit world size in test_fsdp_pure_fp16 (#85957) Alexander Soare (1): add autowrap_functions kwarg to fx.Tracer (#62106) Alexandr Guzhva (2): [quant] Add fp32/fp16 zero_point support for GPU fakeQuant (#65836) [quant] Add op benchmark for GPU FakeQuantizePerChannel with float zero_points (#66183) Alfredo Canziani (1): Update state_dict docs (#83104) Aliaksandr Ivanou (20): [Error-reporting] Set upper boundary on border element (#59311) Remove use_env from torch.distributed.run, clarify bc around that parameter in comment. (#59409) [Torch] Correct launcher tests (#59635) [Torch] Cast timestamp type to int (#59712) [pytorch] Move signal handler test to internal codebase (#60394) [torch] Various improvements to `torch.distributed.launch` and `torch.distributed.run` (#60925) [torch] Various improvements to `torch.distributed.launch` and `torch.distributed.run` (#61294) [torchelastic] Set the correct maximum border width [torchelastic] Make sure `rdzv_configs[timeout]` is not getting overwritten (#61471) [torch] Set `nproc_per_node` to 1 (#61552) [torchelastic] Improve process termination logic (#61602) [torchelastic][multiprocessing] Print warning message only when child processes are stuck (#62823) [torch] Set default log level for torch elastic (#63214) [torch][launch] Add ability to override sys.executable for `torch.distributed.run` (#66179) [torchelastic] Fix failing tests (#66440) [torchelastic] Skip tests in tsan mode (#67103) [torchelastic] Remove stale `test_get_default_executable` test (#68609) [torchelastic][1/n] Fix `caffe2.test.distributed.launcher.api_test` flaky tests (#68624) [torch][distributed] Check for file existence before invoking cleanup logic in FileStore destructor (#68603) [torch][elastic] Make final agent barrier to shutdown properly Alice Ou (1): Revert D28643215: Adds an aten::_ops namespace with unambiguous function names Allen Goodman (14): Beta function (#78031) Chebyshev polynomial of the first kind (#78196) Chebyshev polynomial of the second kind (#78293) Physicist’s Hermite polynomial (#78352) Probabilist’s Hermite polynomial (#78357) Laguerre polynomial (#78366) Bessel functions (#78451) Orthogonal Polynomials (#78304) c10 mathematical constants (#78910) torch.special.airy_ai (#78902) torch.special.scaled_modified_bessel_k1 (#78901) torch.special.gamma (#78904) torch.special.spherical_bessel_j0 (#78912) torch.special.scaled_modified_bessel_k0 (#78900) Amir Khojaste (1): Upgrading the loop to use irange (#70326) Amit Kumar Chawla (3): [Contrib][Fakelowp] Change Lut Size for Tanh (#68334) Compilation fix to access pretty_print_onnx function (#79864) [HPU] Enable torch.jit.load for HPU (#81759) Amr Elshennawy (3): Reduce PyToch Warnings - Cast fixes from D26624430 (#65015) Reduce PyTorch warnings: Cast fix xplat/caffe2/c10/core/TensorOptions.h (#65030) Reduce PyTorch warnings: Cast fix xplat/caffe2/aten/src/ATen/core/DeprecatedTypeProperties.h (#65031) Amy He (16): Python basic module execution unit test on delegation of backend_with_compiler_demo (#60468) Python error unit tests on delegation of backend_with_compiler_demo (#60689) Python composite module execution unit tests on delegation of backend_with_compiler_demo (#60801) [1/N] Nnapi backend delegation preprocess (#61499) [4/N] Nnapi backend delegation preprocess: List Tensors & Comment Updates (#61752) [3/N] Nnapi Backend Delegate Preprocess: Basic OSS Test (#61594) Back out "Revert D29687143: [3/N] Nnapi Backend Delegate Preprocess: Basic OSS Test" (#61878) [6/N] Nnapi Backend Delegate: Comprehensive OSS Tests (#61782) [7/N] Nnapi backend delegation preprocess: compile_spec sanity check (#62213) [8/N] Nnapi backend delegation preprocess: New refactored design (#62225) [1/N] Nnapi backend execute and compile (#62272) Fix Nnapi backend execute's dangling pointer (#63092) Refactor NnapiCompilation registration into it's own file (#63183) Move Android Nnapi srcs from aten_native_cpu to aten_cpu (#62919) Remove backend_debug from torch_core srcs and replace with library dependency (#63111) Nnapi Delegation: Quick improvements (#63489) Andre (1): [functorch] update compile example imports (pytorch/functorch#834) Andreas Kouzelis (1): updated the docs for BatchNorm1d and InstanceNorm1d (#71371) Andres Lugo-Reyes (2): [ROCm] Enable/fix unit tests test_stream_args and test_event_args (#82346) [ROCm] Retry loop implemented to avoid transient memory leak errors (#82607) Andres Suarez (5): [codemod][dirsync] Apply clang-format [codemod][lint][caffe2] Extend BLACK coverage [codemod][fbcode/caffe2] Apply all buildifier fixes [codemod][fbcode/caffe2] Apply all buildifier fixes [lint][fbcode/caffe2] CLANGFORMAT Andrew Gallagher (10): [caffe2/utils] Add explicit rule to avoid package boundary violation [caffe2/utils] Add explicit rule to avoid package boundary violation [caffe2/utils] Add explicit rule to avoid package boundary violation (#60677) [caffe2/utils] Add some fine-grained rules to avoid package boundary violations [caffe2] Fix include of corresponding header [caffe2/libtorch] Remove already-owned source Update llvm deps for Buck build (#79919) [caffe2/perfkernels] Avoid `native.host_info()` in build files (#80812) [caffe2] Use `arch_deps` instead of host info for arch-specific deps (#80814) [caffe2] Remove last clang-for-cuda sources (#84021) Andrew Gu (127): Add NCCL_ASYNC_ERROR_HANDLING as an environment variable (#59109) Extract c10d Store tests to dedicated test file (#59271) Fix broken hyperlinks (#59425) Refactor c10d and dist aliases for torch.distributed (#59456) Sort params by size (decreasing) Refactor ZeroRedundancyOptimizer Assuming SPSD (#59834) Clean Up ZeRO (#60285) Fix ZeRO sort to be by numel (#60556) Refactor DDP join() API, adding hooks (#60757) Add Model Parallel Support to ZeRO (#61370) Refactor non-joined process computation (#61555) Remove `_broadcast_object()` from `ZeroRedundancyOptimizer` (#61539) Fix indent (#61784) Minor documentation fixes (#61785) Add generic join unit tests (#61786) Fix `c10d` -> `dist` in `test_ddp_hooks.py` (#61864) Add overlap with DDP to ZeRO (two approaches) (#62157) Add invariant check (bucket indices: 0, 1, ..., k-1) (#62623) Refactor commonalities between two approaches (#62624) Make _Join, _Joinable, _JoinHook public (#62605) Add tutorial link (#62785) Add ``allow_empty_param_list`` to functional optimizers (#62522) Simplify data structures, add uniform approximation, fix mem leak (#63162) Pass `_allow_empty_param_list` into func opt ctor (#63163) Remove req to call step() in training loop (#63164) [PT-D][BE] Fix DDP no_sync() test logic (#72348) [FSDP] Add no_sync() context manager (#72446) [ZeRO] Add ctor support for multiple param groups (#72578) [Join][BE] Fix typo; remove obsolete method (#72886) [ZeRO] (Reland) Add ctor support for multiple param groups (#72932) [DDP][BE] Remove bucket replicas (#73237) [DDP][BE] Fix clang-tidy (#73299) [Easy][c10d] Minor fixes (#73318) [DDP][BE] (Reland) Remove bucket replicas (#73567) [Easy][c10d][DDP] (Reland) Minor fixes (#73569) [FSDP] Add grad accumulation without `no_sync()` (#73535) [Easy][FSDP] Fix warning render (#73786) [FSDP][BE] Change assert to assertEqual (#73787) [ZeRO][BE] Clean up ZeRO tests (#73842) [FSDP] Override `named_parameters()` for clean names in `summon_full_params()` (#74333) [Easy][FSDP] Minor doc fixes (#74214) [PT-D] Update dist code owners (#74840) [Easy][FSDP] (Reland) Doc fixes (#74834) [FSDP] Add full optim state dict (#74215) [FSDP] Optim state chkpt: key by param name, not ID (#74879) [FSDP] Add re-key btw param names/IDs for optim state dict (#74912) [Easy][FSDP] Update full osd warning (#75109) [FSDP][Easy] Fix 0-dim tensor optim state device (#75243) [FSDP][Easy] Fix return in docstrings [FSDP] Add `rank0_only` to `full_optim_state_dict()` [FSDP] Fix `_get_param_to_unflat_param_names()` for shared params [FSDP] Add `ignored_modules` ctor arg [FSDP][Easy] `named_parameters()`, `named_buffers()` refactor [FSDP] Add `scatter_full_optim_state_dict()` [FSDP][Easy] Minor simplifications [FSDP] Fix `no_sync()` + `FULL_SHARD` root all-gather behavior [FSDP] Add exec order validation [FSDP] Fix exec order validation (static variable issue) [FSDP] Relax exec order valid. to only fwd [FSDP] Validate exec order using `compute_device` [FSDP] Faster dict inversion [FSDP] Optim state: ignore params if not in dict [FSDP] Include buffers in `ignored_modules` [FSDP] Move param/buffer name comp. to ctor for `ignored_modules` [FSDP] Do not clone buffers; offload buffers to CPU if needed [FSDP] Do not check fwd order in eval mode [FSDP][Easy] Remove extraneous print [FSDP][Easy] Doc fixes [FSDP][Easy] Fix `state_dict_type()` docstring example [FSDP][Easy] Reword device placement warning [FSDP][Easy] Update `state_dict()` docstring [FSDP] Remove unneeded padding logic for optim state dict [FSDP][Docs] Fix typo in `full_optim_state_dict()` [FSDP] Allow different `optim_input` orders across ranks [FSDP] Fix exec order validation for diff ignored modules across ranks [FSDP] Extend ignored modules test to not pass to root [FSDP] Fix param name prefixes for ignored modules (#79955) [Checkpoint Wrapper] Fix assert (#80283) [FSDP] Fix `full_optim_state_dict()` hang (#80712) [BE][FSDP] Remove unneeded `torch.cuda.synchronize()` (#80868) [BE][FSDP] Fix that MP config not being passed to FSDP (#80869) [BE][FSDP] Sort `common_fsdp.py` imports (#80870) [BE][FSDP] Retire `_get_full_detached_param()` (#80871) [BE][FSDP] Introduce `FSDPTestModel` interface (#80873) [BE][FSDP] Subtest prefetching in `test_fsdp_core.py` (#80908) [BE][FSDP] Subtest prefetching in `test_mixed_precision_e2e_full_shard()` (#80915) [Easy][FSDP] Delete dead code (#81158) [FSDP] Stricten `_update_p_data()` in `_summon_full_params()` (#81573) [FSDP] Introduce `FlatParamHandle` (#79652) [FSDP] Deduplicate `_orig_size` and `_unsharded_size` (#79984) [FSDP] Move tensor sharding logic to `FlatParamHandle` (#80000) [FSDP] Remove `self.numel_padded_per_param` (unused) (#80002) [Easy][FSDP] Add `zero_grad()` to unit test train loop (#80087) [FSDP] Clean up `_lazy_init()` (#80185) [FSDP] Move `_post_backward_called` to `_init_param_attributes` (#81243) [FSDP] Update `ShardingStrategy` and `_free_full_params()` docs (#80894) [FSDP] Refactor casting of grad to full param dtype (#81574) [Easy][FSDP] Remove variable shadowing (#82386) [Easy][FSDP] Fix sharded optim state dict doc formatting (#84198) [Easy][FSDP] ufmt `_optim_utils.py` (#84199) [Easy][FSDP] Update `StateDictType` doc (#84200) [FSDP] Retire `self.device_id`; clean up ctor (#83663) [FSDP] ufmt `flat_param.py`, `flatten_params_wrapper.py` (#83664) [FSDP][Easy] Move utils to `_utils.py` (#84212) [FSDP][Easy] Remove unused functions (#84598) [BE][PT-D] Fix race on checkpoint file (#84881) [FSDP] Remove `forward_prefetch` (#84600) [FSDP] Subtest prefetching for `test_fsdp_grad_acc.py` (#84601) [FSDP][Easy] Minor cleanup (#84761) [FSDP] Generalize prefetching; lower unshard/reshard to handle (#83665) [FSDP][Easy] Save unpadded/padded unsharded sizes as attributes (#84366) [FSDP] Add rate limiter (#83917) [FSDP] Fix `pin_memory()` for CPU offloading (#85048) [Easy][FSDP] Remove outdated comment (#85051) [Easy][FSDP] Change `assert` -> `p_assert` (#85052) [FSDP] Fix memory regression! (#85087) [FSDP] Add `_set_flattened()`; `_is_flattened()` (#85038) [FSDP] Short-term fix to remove `optim_input` (#84201) [FSDP] Simplify backward prefetch implementation (#85176) [FSDP] Add back `forward_prefetch` (#85177) [Easy][FSDP] Simplify `assert` to `p_assert` (#85479) [FSDP] Make `_ran_pre_backward_hook` check more robust (#85481) [ShardedTensor] Add `is_meta` (#85482) [ShardedTensor] Add `is_floating_point` (#85483) [FSDP] Add `FSDPExtensions` for TP support (#85039) [FSDP] Expose internal prefetch limits (#86198) [FSDP] Dequeue one instead of flush (#86165) Andrew M. James (19): Connect Tensor.__ipow__ to pow_ method Discover and check operator variants Enable index_add for ComplexHalf (#79897) Add support for BSR <-> Strided Conversion (#80354) Add spdiags sparse matrix initialization (#78439) Add spdiags sparse matrix initialization (#78439) Add support for `select` of batch dims for all sparse compressed formats. (#82119) Fix BSR->Dense Batched Bug (#82120) Dense <-> bsc conversions (#80781) Sparse_coo: Be more agressive in setting coalesced True to avoid suprising behaviors (#82426) Dense -> CSR support batch dimensions (#83084) Dense->BSR performance improvment (#83085) Dense -> CSC support batch dimensions (#83086) Sparse Compressed Transpose add support for Batch dims and BSR/BSC layouts (#82122) resize_as_sparse support all compressed layouts (#85378) sparse mm/addmm enable dense x csc, csc x dense and simplify layout check logic. (#85307) Enable dense x bsc mm/addmm (#85308) Enable CSC @ CSC addmm (#85379) [Docs] Update sparse Maintainers (#85126) Andrew McCollum (1): Fix DistributedSampler mem usage on large datasets (#51841) Andrew Or (27): [Quant][fx] Lower reference conv[1-3]d module (#69228) [Quant][DBR] Add test for serialization (#70078) Add lowering path for LinearReLU module (#71427) DBR Quantization: Add support for functional conv variants (#71795) [Quant][improvement] Rename ReferenceableQuantizedModule (#72717) [Quant][fx] Add lowering for functional linear (#72855) [Quant][fx] Add lowering for Linear-Bn1d in QAT mode (#73509) [Quant][fx] Add lowering for functional conv (#73708) [Quant][fx] Reenable serialization test after convert refactor (#74204) [Quant][fx] Refactor lowering code (part 2) (#74619) [Quant][fx] Define native backend_config_dict for linear and conv (#74636) [Quant][fx] Decouple prepare_*fx from training/eval modes (#75401) [Quant][fx] Fix get_default_qconfig_dict for fused modules [Quant][fx][bc-breaking] Replace qconfig_dict with a config object (#78452) [Quant][docs] Replace qconfig_dict with QConfigMapping in docs [Quant][fx] Add get_default_qconfig_mapping [Quant][fx][bc-breaking] Replace *custom_config_dict with config objects [Quant][fx] Hide equalization_config from prepare APIs (#80164) [Quant][fx][bc-breaking] Replace is_reference with convert_to_reference (#80091) [Quant][fx] Add default configs for fixed qparams ops (#80184) [Quant][fx][bc-breaking] Do not move models to CPU in convert (#80555) [Quant][fx] Rename convert_to_reference to convert_to_reference_fx (#81326) [Quant][fx] Implement BackendConfig (part 1) (#81469) [Quant][fx] Remove dequant-quant around getitem (#82675) [Quant][fx][bc-breaking] Integrate BackendConfig with quantization flow (part 2) (#82557) [Quant] Make quantizable LSTM scriptable (#83304) [Quant] Separate FBGEMM/QNNPACK BackendConfigs (#83566) Andrew Tulloch (4): [CUDA Pinned Memory] Event recording with non-blocking copies should track the storage context, not the tensor data pointer (#68749) [CUDA Pinned Memory] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#68906) [C10D] [Easy] Use pinned memory for HtoD copies in Reducer:: sync_bucket_indices (#69298) [CUDA Pinned Memory] [Retry] Alternative implementation of pinned memory allocator focusing on multi-threaded scalability (#69299) Andrey (1): [c10d] Reorder macros so they are defined before getting used (#85850) Andrey Malevich (1): [PT] Make error message from jit.trace more meaningful. (#75056) Andrey Talman (66): Add timeouts for GHA jobs for pytorch/pytorch (#67912) replace platform specific CI environment variables with generic ones (#68022) replace platform specific CI environment variables with generic ones (#68133) Support cuda 11.5: install magma for cuda in conda (#68665) Adding Linux cuda 11.5 workflows (#68745) Enabling CUDA 11.5 for binary builds, Adding windows workflows for CUDA 11.5 (#69262) Revert of adding windows CUDA 11.5 workflow (#69365) Making cuda 11.5 workflows periodic (#69323) Adding windows cuda 11.5 workflows (#69377) Deprecating python 3.6 (#70325) Deprecating Python 3.6 (#70493) Deprecating Python 3.6 (#70493) Python3.10 migration adding to binary linux tests (#71130) Adding python 3.10 binary workflows (#71132) Fix for windows builds with python 3.10 , getting rid of ssize_t (ssize_t is not a C++ defined type) (#71390) Adding wheels with py3.10 (#71419) Implement labelling for release notes and topics check (#71726) Fixes pr-labels workflow trigger (#71871) CUDNN changes for cuda 11.5 (#71869) Run the pr-label check on PR closed action and validate closed_by (#71917) Revert D33820822: [pytorch][PR] Run the pr-label check on PR closed action and validate closed_by Remove code for using our own build cudnn image, use nvidia image (#71952) Run the pr-label check on PR closed action and validate closed_by (#71917) Make sure we set GITHUB token in the header for pr-label GHA (#72085) Bump torch version to 1.12 (#72221) Remove forcing CUDNN_STATIC when CAFFE2_STATIC_LINK_CUDA (#72290) Fix for builder repo not pinned in release branch (#72719) Release documentation update to include latest GHA changes (#72740) Set `BLAS_LIBRARIES` to `${MKL_LIBRARIES}` for MKL case (#72806) Documenting cuda 11.5 windows issue (#73013) Remove cuda 11.1 references (#73514) Fix alignment, make sure release labels are included (#73739) Excluding ASAN and periodic jobs from slow job calculation (#74253) Update Release.md with release day steps Update Release.md with release day steps Update RELEASE.md with steps to prepare before cutting RC Cuda 11.6 Disable failing tests (#75420) CUDA 11.6 workflows (#75518) Update docker-builds to add CUDA 11.6 Update Release.md with latest details (#78285) Don't include libiomp with conda install on MacOS (#78632) Move slow-grad checks to CUDA-11.6 (#84313) Release 1.13, Install torch from test channel, Pin build… (#86290) Fix for the binary upload (#86385) Fix binary builds for the release - unblock release (#86484) Conditionally build the TestApp benchmark based on lite interpreter (#86314) (#86377) ci: Just use regular checkout (#86824) (#86895) [CI] Fix builder ref for release, linux only (#86904) Reenable aot tests on windows for cuda 11.7 and up (#87193) (#87307) [ci] handle libomp upgrade on github (#87382) (#87408) Avoid calling logging.basicConfig (#86959) (#87455) Delete torch::deploy from pytorch core (#85953) (#85953) (#87454) Move PadNd from ATen/native to ATen (#87456) Reenable `isinstance` with `torch.distributed.ReduceOp` (#87303) (#87463) [ONNX] Reland: Update training state logic to support ScriptedModule (#86745) (#87457) Fix distributed issue by including distributed files (#87612) Add General Project Policies (#87385) (#87613) fix docs push (#87498) (#87628) attempted fix for nvrtc with lovelace (#87611) (#87618) [JIT][Security] Do not blindly eval input string (#89189) (#89925) Update masked.rst (#89758) (#89923) [Release only change] Uninstall sympy while running windows tests (#90210) Add platform markers for linux only extra_install_requires (#88826) (#89924) [Release only change] Fix rocm5.1.1 docker image (#90321) Add manual cuda deps search logic (#90411) (#90426) [BE] Do not package caffe2 in wheel (#87986) (#90433) Andrij David (2): Update argmin docs to reflect the code behavior (#78888) Propagate map_location arg to torch.jit.load in torch.load (#78733) Angela Yi (31): [quant][fx] Validate qconfig_dict keys (#58566) [quant] Eager mode equalization support for ConvReLU and LinearReLU (#58792) [quant] Implemented InputWeightObserver for Linear inputs [quant] EqualizationQConfig to distinguish input/output activations (#59739) [quant] Input Weight Equalization - prepare modifications (#59747) [quant] Equalization Observer modifications (#59953) [quant] Input-Weight Equaliaztion - convert modifications (#59963) [quant] Input-Weight Equalization - support for F.linear layers (#59964) [quant] Input-Weight Equalization - support for connected linear layers (#60034) [quant] Input-Weight Equalization - support for connected F.linear layer (#60272) [quant] Input-Weight Equalization - tests (#60378) [quant][fx][fix] Fused modules with object_type in qconfig (#60779) [quant] Input-Weight Equalization - support for LinearReLU layers (#60653) [quant] Added reset_min_max_vals() function to observers (#60883) [quant] Input-Weight Equalization - Conv observer support (#61285) [quant] Input-Weight Equalization - Conv prepare support (#61286) [quant] Input-Weight Equalization - Conv convert support (#61287) [quant] Input-Weight Equalization - ConvReLU support (#61350) [quant] Input-weight equalization - branch support (#62366) [quant] Input-Weight Equalization - selective equalization (#61916) [quant] Input-Weight Equalization - allow logical evaluation (#61603) [FX] Modified __deepcopy__ to also copy _codegen Added PassBase implementation [fx] PassResult (#81366) Serialize memory_format (#81332) [fx] PassManager changes (#80531) [fx] Minor modifications to pass infra (#82485) [fx][pass infra] Adding error catching (#83933) [fx][pass] Fix type of exception (#84094) [fx][pass] Fix type of exception (#84094) [fx] Add metadata to fx.GraphModule (#84378) Animesh Jain (58): [NNC] Add Softplus operator (#64589) [TensorExpr] Adding missing python binding for operators (#66336) [NNC] Normalize loops in SplitWithTail (#66242) [NNC] Adding more python bindings for missing operators (#66612) [NNC Testing] Randomized loop nest infrastructure (#70174) [NNC Testing] Randomized loop nest infrastructure (#70410) FX graph module - prevent infinite recursion (#73866) Monkey patch Variable module to fix FX codegen Minor fix in FX tests with TorchDynamo (#79174) Minor FX test fix for TorchDynamo (#79206) Minor fix in jit tests to pass TorchDynamo (#79903) Setting validation flag for Distributions tests to work with TorchDynamo (#80081) Setup for running PyTorch tests with TorchDynamo and skips for known failing tests (#80106) Switch on TorchDynamo for PyTorch tests (#81083) [functorch] [Op-Authoring] Adding mapping from torch ops to ExprHandles (pytorch/functorch#205) [functorch] [Partitioning] Recompute forward in the backward pass (pytorch/functorch#213) [functorch] [Operator Authoring] Memory efficient pointwise fusion (pytorch/functorch#233) [functorch] Rename pointwise operator CompileCache to PointwiseOperatorCompileCache (pytorch/functorch#243) [functorch] [CompileCache] Adding compilation cache (pytorch/functorch#250) [functorch] [Benchmarking] Adding scripts for lightseq benchmarking (pytorch/functorch#310) [functorch] [Benchmark] Layer norm patterns (pytorch/functorch#311) [functorch] Single cache (pytorch/functorch#319) [functorch] [Compile Cache] Handle non tensor args (pytorch/functorch#383) [functorch] Cleanup for memory efficient fusion (pytorch/functorch#388) [functorch] Use num_bytes instead of numel in min-cut partitioning (pytorch/functorch#398) [functorch] Minor refactor for default partition. (pytorch/functorch#390) [functorch] Cleaning up TVM compiler integration (pytorch/functorch#405) [functorch] [Compile Cache] Caching the forward and backward compiler ids (pytorch/functorch#413) [functorch] Better error message for static argnums (pytorch/functorch#431) [functorch] Decompose aten std op (pytorch/functorch#399) [functorch] Don't trace static_args (pytorch/functorch#435) [functorch] Doc strings (pytorch/functorch#463) [functorch] [Compile Cache] Dont pass None args to Compile Cache (pytorch/functorch#470) [functorch] Aot Autograd tutorial (pytorch/functorch#476) [functorch] Workaround to avoid Torchscript bug for new_empty (pytorch/functorch#538) [functorch] Fix the tutorial index (pytorch/functorch#577) [functorch] Setting tensor_meta attr for inplace ops (pytorch/functorch#565) [functorch] Compile readme (pytorch/functorch#585) [functorch] Sphinx and docstrings for AOT Autograd (pytorch/functorch#580) [functorch] Make default_decompositions visible in functorch.compile namespace (pytorch/functorch#613) [functorch] Trace the backward pass assuming contiguous tensors (pytorch/functorch#536) [functorch] Reduce overhead of AOT Module (pytorch/functorch#660) [functorch] Disbale torchdynamo on AOT Autograd generated graphs (pytorch/functorch#662) [functorch] Removing the hack to fix the avg_pool2d backward (pytorch/functorch#619) [functorch] Add cudnn_batch_norm decomposition to default nvfuser decompositions (pytorch/functorch#661) [functorch] Handle -inf for Fx to TS (pytorch/functorch#671) [functorch] Skip extracting meta tensor info for sparse tensors (pytorch/functorch#676) [functorch] Present Random state (pytorch/functorch#887) [functorch] Disable autocast (pytorch/functorch#794) [functorch] Use Functionalization pass (pytorch/functorch#810) Minifier fixes (#83754) Decomposition - batch_norm, save_mean and save_variance always float32 (#84013) Update Dynamo pin (#83829) [AOT Autograd] Redirect named_parameters to original mod (#84157) [AOT Autograd] Redirect named_parameters to original mod (#84157) TorchDynamo Remove context manager (#85124) Fix gelu repr (#85790) Return contiguous tensor from softmax decomposition (#85788) Anirudh Dagar (3): Support `torch.concat` alias, add `cat` OpInfo & remove OpInfo test_out skips {cat, stack, hstack, vtack, dstack} (#62560) Array API: Add `torch.linalg.matmul` alias to `torch.matmul` (#63227) Array API: Add torch.linalg.cross (#63285) Anish Mahishi (1): Refactoring the AO experimental sparsity tests Anjali Chourdia (6): Revert D29190420: [nnc][tests] Tests and benchmarks for computeSum Add neg bit (#56058) Revert D29698486: [pytorch][PR] Remove torch._bmm and remove torch.bmm deterministic arg documentation Revert D33834916: Set correct device id on efficientzerotensors Reland torch.ops API change machinery with the core functionality disabled (#71785) Back out Dispatcher change that makes Messenger Desktop crash on M1 devices (#77414) Ankita Sharma (1): fixed minor issues for index_add in docs (#65806) Ankur Singla (2): [DistributedInference] Relax the assertion for uniqueness of blob name across external inputs and outputs (#72492) Back out "[const_fold] Set requires_grad based on the folded tensor; add device_for_folding option" (#79655) Ansh Radhakrishnan (1): [nn] Add support for +=, * and *= operations for nn.Sequential objects (#81279) Ansha Yu (16): [static runtime] out variant for full_like (#58079) [static runtime][fix] resize to the input tensor size for full_like (#60229) [static runtime] Remove hasOperation() check (#61496) [static runtime] port c2 argmin kernel (#63632) [static runtime] fuse gather+to+lengths_to_offsets (#64075) [static runtime][dper] multi_env tests for static runtime: selective enable (#67467) [static runtime] dequantize out variant (#67873) [static runtime] dequantize out variant (#68664) [pyper] add flag to disable clip_ranges_gather fusions (#69198) [pyper] to + lengths_to_offsets (#73879) Back out D34696255 "[pyper] to + lengths_to_offsets" (#74906) [pyper] to + lengths_to_offsets with nnpi shape inference support (#5931) [sr] remove max_indices argument of embedding_bag when unncessary (#75993) [scuba] log to pytorch_model_stats when we've tried and failed to enable static runtime [RF][scuba] add pytorch_operator_stats column for Static Runtime out variant (#76566) [sr][pyper] add fusion broadcast_concat_batch_matmul_batch_gather (#76839) Ansley Ussery (15): Improve `CONTRIBUTING.md` (#58396) Improve error message when Proxy object is iterated (#58302) Add parsing logic for `Tuple[()]` annotation (#58340) Remove `Optional[None]` annotations (#60704) Support default values on NamedTuple fields (#54682) Allow for heterogenous List and Dict values + Improve container typing algorithm (#57137) Fix bug in `check_empty_containers` (#63492) Allow uncompiled strings as input to `checkScriptRaisesRegex` (#63901) Support Union in TorchScript (#64234) Preserve types during empty container assignment (#58911) Revert logic in `mobile/type_parser.cpp` (#65556) Clean up `ListLiteral` and `ListComprehension` emission logic (#64952) Clean up `DictLiteral` and `DictComprehension` emission logic (#64953) Port `amax` to structured kernel (#72124) Port `mse_loss` to structured (#72294) Anthony Barbier (2): Add new keys for Graphcore IPU (DispatchKey / Backend / DeviceType) Move IPU tensors to the CPU for printing. (#79287) Anthony Shoumikhin (1): [torch][edge] Add int to the copy kernel. (#69297) Anton Jansson (1): Remove duplicate call to objective function in strong wolfe line search in L-BFGS optimizer. (#72773) Antonio Cuni (1): update the pytorch-gdb example so that it works on current master (#61175) Antonio Kim (16): Move shape and operand definitions to base node (#75223) Decouple Lazy Node Shape Cache (#75324) Decouple LTC from TS Backend using Lazy IR Builder [LTC] Mark Step Indicator (#76840) Fix 'Code below assumes there is at least one tensor arg' assumption (#76917) Codegen Non-Native IR Nodes (#76535) Deprecate `TSNodeLoweringInterface` (#78273) Fix warning: cast from type `const char*` to type `char*` casts away qualifiers (#79520) Fix SequentialLR initialization (#72856) Add missing LTC headers to setup.py (#81424) [LTC] Add custom lazy tensor save function (#83294) Add Lazy backend type string (#84228) Add step closures (#84300) Add ShouldSyncTensor interface (#84418) Make addmm meta kernel consistent with mm (#84960) Add torch_lazy_all_numbers_special_scalars flag (#85902) Anush Elangovan (1): reorder cpuinfo and clog deps in TorchConfig.cmake (#79551) Apoorva Garg (1): Back out "[pytorch][PR] Support dataclasses in TorchScript" Arash Bakhtiari (1): Fix a typo in JIT overview.md (#82269) Ariel Kwiatkowski (1): Update empty and empty_like examples in docs (#68874) Arindam Roy (3): ROCM: Increase timeout for flaky test_with_kwargs (#76706) ROCM: Enable few more tests for ROCM (#77669) [ROCm] re-enable tensorexpr and test_openmp (#81367) Arpan Abhishek (1): fix type error in hipify_python.py (#66164) Artsiom Sanakoyeu (1): [pytorch] Fix loading from checkpoint after "maximize" flag was introduced in SGD (#68733) Arvind Kannan (1): Revert D33246843: [pytorch][PR] Implementation of Wishart distribution Ashish Solanki (1): Upgrade to ubuntu:trusty-20190515 (#63468) Ashwin Hari (1): CMake option for using static MKL libraries AspenStars (1): DOC Improve documentation for LayerNorm (#63144) Aswin John Mathews (2): Remove test linalg test skips from MAGMA integration (#58232) ROCm MIOpen NHWC Convolution support (#63617) Aswin Murali (1): Adds return type annotation for fork_rng function (#63724) Atul Jangra (2): [RFC] Reduce logging noise from AdagradOptimizer (#66443) Make sure that exit code is propagated from Child to parent process (#81408) Avery Wang (1): Added logging for the Reducer's non-member functions. (#65023) Ayaka Mikazuki (1): [docs] Move a sentence from `nn.Transformer` to `nn.TransformerEncoder` (#78337) Ayman Yousef (1): Add Hpu to the rebuild component list BBuf (1): fix resize bug (#61166) Baichuan Yuan (2): Weighted decay with frequency (count-based) (#60382) support counter-based fused rowwise adagrad (#66177) Bairen Yi (1): Fix incorrect decomposition for native_dropout (#77933) Balaji (1): Bug in CosineAnnealingWarmRestarts in optim/lr_scheduler.py (#64758) Bangsheng Tang (2): graceful failure for draw_graph() in acc_utils.py (#66631) [hpc][inference] enable cuda graph in engine holder (#66738) Banit Agrawal (1): [PyTorch GPU Allocator] Better use of blocks with rounding of allocation sizes (#74213) Baoshuo Ren (1): chore: remove git.io Bartek Rymkowski (1): CoreML .mlmodel export support (#84784) Basil Hosmer (7): remove redundant getDispatchKeySetUnboxed(eligibleKeys) (#58535) fix nn.MHA scriptability (#58727) bump out repeat_interleave BC allow date (#59057) configurable pre/post LayerNorm in nn.Transformer (#60593) faster generate_square_subsequent_mask in nn.Transformer (#60631) preserve residual in transformer norm_first (#61692) MaybeOwned page for dev wiki (#63450) Behrooz (1): Fix lists in the docstring Beilei Zheng (1): Add BFloat16 support for multinomial and poisson on CPU Ben Ahlbrand (1): [functorch] update typo in README.md (pytorch/functorch#596) Ben Koopman (17): [quant] Add fp32/fp16 zero_point support for CPU fakeQuant (#65055) [quant] Add op benchmark for CPU FakeQuantizePerChannel with float zero_points (#65241) Clean up unused model instantiation (#65487) [quant][embedding qat] Add basic EmbeddingBag QAT fakeQuant workflow (#65443) [quant][embedding qat] Enable quint4 in EmbeddingBag QAT workflow (#66348) [quant][embedding qat] Add eager QAT test for EmbeddingBag+Linear model (#66334) [quant][embedding qat][bugfix] Fix and test QAT EmbeddingBag from_float error message (#66989) [quant] Fix comparison against reference for test_qat_functional_linear (#68061) [quant][embedding qat] Support non-partial functions in qconfig comparison (#68067) [quant][embedding qat] eager mode QAT for Embeddings (#66429) [quant][embedding qat] Add benchmarks for QAT Embedding+EmbeddingBag (#66560) [quant][embedding qat] Set FakeQuant zeropoint dtype matches observer (#68390) [quant][embedding qat] Fix bug enforcing quant_min <= zero_point <= quant_max for float zeropoint (#68852) [quant][embedding qat] Support Embedding QAT via FX API (#68296) [quant][embdding qat] Add FX support for QAT EmbeddingBag (#68121) [quant][embedding qat] Re-Land Support Embedding QAT via FX API (#69333) [quant][embdding qat] Re-land Add FX support for QAT EmbeddingBag (#69334) Ben Wallace (1): Fix typos in `torch.package` documentation (#82994) Benjamin Rowell (1): Adds keyword only args to gradcheck (#65290) Benoit Steiner (1): Revert D39583438: Multisect successfully blamed D39583438 for test or build failures (#85277) Bert Maher (80): [nnc][scripts] Add a script for bisecting the TE fuser pass (#58357) [nnc] Make the pretty printer prettier (#57874) [nnc] Do not fuse unsqueeze with variable dim (#58346) VaryingShape<Strides>::isComplete() needs to consider whether each Stride is complete (#58510) [nnc] Enable CPU fusion inside Facebook, take 2 (#58347) Revert D28461013: [nnc] Enable CPU fusion inside Facebook, take 2 [nnc] Use int64 to compute matmul flops heuristic (#58676) [nnc] Concat input shapes must be known to fuse (#58974) [nnc] LLVMCodeGen for any target (#58713) [nnc] Enable CPU fusion inside Facebook, take 3 (#59253) Revert D28800692: [nnc] Enable CPU fusion inside Facebook, take 3 [nnc] Enable CPU fusion inside Facebook, take 4 Revert D28859795: [nnc] Enable CPU fusion inside Facebook, take 4 [nnc] Add hardsigmoid (#59069) Fix symbolic derivative of hardswish (#59405) [nnc] Infer device type from nodes if inputs are all scalars (#59430) [nnc] Enable CPU fuser inside FB, take 5 (#59461) [nnc] Do not fuse matmul/conv2d if inputs are discontiguous. (#59754) [nnc] Limit the number of inputs to a fusion group. [nnc] Handle more cases of excessive # of cat args (#60043) [nnc] Move operator implementations into a subdirectory (#59988) [nnc] Move batchnorm to operators library (#59992) [nnc] Speed up batchnorm benchmark [nnc][tests] Tests and benchmarks for computeSum (#60160) Reland D29190420: [nnc][tests] Tests and benchmarks for computeSum (#60550) [nnc] Merge inconsistent profiling information (#60510) Fix the NNC-disabled path in static runtime for perf comparisons [nnc] Serialize initialization of LLVM targets (#60996) [nnc] Get rid of fuser trigger counters (#57334) [nnc] Insert alloc/free at global scope (#61725) Linker version script to hide LLVM symbols (#62906) Hide all symbols in llvm namespace (#63272) Retry apt-get during setup_ci_workspace (#63319) [nnc] Support thread level parallelism in fused kernels (#63386) Remove flag to toggle CPU fusion in the presence of parallelism (#63514) [nnc] Enable CPU fusion (#63545) Revert D30417127: Remove flag to toggle CPU fusion in the presence of parallelism Revert D30360382: [nnc] Support thread level parallelism in fused kernels [nnc] Re-enable CPU fusion" (#63665) Fix some memory bugs in onnx passes (#63754) [nnc] Disable erf and erfc (#63775) Don't switch executors mid test (#63830) Re-apply: [nnc] Support thread level parallelism in fused kernels (#63776) [nnc] Fix dtype promotion involving scalars (#64002) [nnc] Fix batchnorm implementation (#64112) Parse int64 sizes/strides (#64076) [nnc] Make 64-bit dimensions work (#64077) [nnc] Fix half2float conversion and re-enable float16 (#64199) [nnc] Enable fusion of bfloat16 ops (#64196) [nnc] Make our exceptions c10::Errors, get C++ stacktraces (#64332) Revert D30745610: [nnc] Make our exceptions c10::Errors, get C++ stacktraces [nnc] Provide helpful error messages about turning off the fuser (#64516) Lock unpickling of source ranges Avoid UB when indexing into size-0 tensors (#65878) [nnc] Add call_with_numel interface for fast CUDA calls (#65213) [nnc] Add BufHandle.store to python API (#65213) Fix typo in name of LayerNormBackwardCUDAKernel (#66000) Make handle_torch_function_no_python_arg_parser public (#66054) Rename tensorexpr::Value so that it can coexist with torch::jit::Value (#66467) [nnc] Use a descriptive name for fused kernels when profiling (#66990) Benchmarks for various fusers (#67622) [pytorch/tensorexpr] Update use of LLJIT::lookup for LLVM 15 [functorch] Support functions with multiple outputs in `compiled_function` (pytorch/functorch#127) [functorch] Introduce compiled_module for eager compilation of modules (pytorch/functorch#133) [functorch] Remove some commented code (pytorch/functorch#146) [functorch] Support buffers in compiled_module (pytorch/functorch#147) [functorch] Shape-specialization key for op caching [functorch] Helper to convert SpecializationKey to python object [functorch] Class for caching compilation results [functorch] Proxies for binding compilation results to python objects [functorch] Num arg-and-dim specialized cache for generated kernels [functorch] Num arg specialized cache [functorch] Complete compile cache, with in-out specialization [functorch] Python bindings for compilation cache [functorch] Python pointwise compiler implementation (pytorch/functorch#163) [functorch] Revert the compile cache (pytorch/functorch#168) [functorch] Re-land the compile cache (pytorch/functorch#169) [functorch] "Scorecard" benchmarks for pointwise op authoring (pytorch/functorch#193) [functorch] Fix PointwiseCompiler on CUDA (pytorch/functorch#203) [functorch] Clean up perf scorecard and add barplot generation script (pytorch/functorch#212) Bhavya Medishetty (1): To add hipify_torch as a submodule in pytorch/third_party (#74704) Bill Darrow (1): [rpc/distributed] eliminate code duplication in distributed/rendezvou… (#81577) Bin Bao (29): Enable NNC fusion for relu6 (#58773) [JIT] Add a phase to perform inplace<->functional conversion for activation operators (#57477) [NNC] Add a dtypeToCppString virtual method in IRPrinter (#59449) [NNC] Handle int64 indices and loop bounds (#59769) [JIT] Initialize CUDA context before launching fused kernel (#65064) [LT] Add ir_util for ComputePostOrder (#67282) [LT] Merge permutation_util into master (#67766) [LT] Merge cache.h (#67929) Add lazy::Shape::numel() (#68314) [LT] Sync LTC branch changes on torch/csrc/lazy/core (#69012) [LT] Upstream more util functions (#69098) [LT] Upstream LazyView and view ops IR Nodes (#69277) [LT] Sync with the lazy_tensor_staging branch (#69527) [LTC] Upstream utils in computation_client (#69621) [LTC] Upstream several internal ops (#69716) [LTC] Upstream LazyTensor and LazyGraphExecutor (#69815) [LTC] Fix stride accessors in LTCTensorImpl (#70623) Dispatch to at::convolution intead of at::_convolution in _convolution_double_backward (#70661) [LT] Add a flag to control IR reusing [LT] Move MakeNode into ir_builder.h [LT] Add a trie data structure for caching IR nodes [LT] Store OpKind for each IR subclass in a static field [LT] Move device lock in LazyGraphExecutor to a later place Revert "Revert "[LT] Store OpKind for each IR subclass in a static field"" [LT] Codegen ReuseNode for supported ops Revert "Revert "[LT] Codegen ReuseNode for supported ops"" [LT] Add IR resuing support for manually-implemented ops [LTC] Pass a BackendDevice parameter into GetIrValueForScalarFromCodegen (#82970) Add a flag to trigger inductor testing (#85183) Bin Chen (3): Named pipe based watchdog timer (#83695) Add watchdog to TorchElastic agent and trainers (#84081) Log Watchdog events to scuba (#85391) Bin Wen (5): Add a timeout argument to RPC shutdown() (#65425) add gather to ShardedTensor (#65671) [fbcode] Fix operator_benchmark with jit mode (#67382) [fbcode][static runtime] out-variant for quantized::linear_dynamic_fp16 (#67663) [torch.package][doc] PackageExporter does not have file_structure (#79948) Bo Tan (1): Only set sccache_epilogue to run on build job exits (#67798) Bo Wang (9): Make broadcast_object_list accept a device parameter. (#61305) Compare DDP static graph (C++ core) with legacy DDP forward and backward delay. (#61507) Add driver function to run test_sharded_tensor.py and test_sharding_spec.py (#63189) Extend _sharded_tensor constructor to support other ops like torch.ones (#63378) Merge common fields from TensorInitParams and ShardedTensorMetadata into TensorProperties (#63731) More sharded_tensor creation ops: harded_tensor.zeros, sharded_tensor.full, sharded_tensor.rand (#63732) Add torch.nn.init.uniform_ operator to ShardedTensor. (#63997) Enroll bowangbj@ to PyTorch distributed package (#67062) Add torch.nn.init.normal_ and torch.nn.init.kaiming_uniform_ ops to ShardedTensor (#67057) Bo Wu (1): Back out "Make TorchScript Preserve Fully Qualified Class Name for Python Exceptions" BoTorch website deployment script (1): Update SobolEngine docstring w/ correct behavior (#62548) Bobby Impollonia (1): Fix typo in comment (#85635) Bowen Bao (13): [ONNX] Support conv-bn fusion in blocks (#66152) (#67272) [ONNX] Update value name copying logic for onnx (#66170) (#67275) [ONNX] Update onnx function export with comments and clean up (#66817) (#67803) [ONNX] Suppress ort warnings in onnx related test (#67054) (#67804) [ONNX…

facebook-github-bot added the cla signed label May 11, 2022

Disable autocast

3138665

anijain2305 force-pushed the disable-autocast branch from bc5ecb2 to 3138665 Compare June 17, 2022 20:22

Add global flag

be6f21a

anijain2305 changed the title ~~WIP - Disable autocast~~ Disable autocast Jun 18, 2022

Add a test

46de012

Chillee approved these changes Jun 18, 2022

View reviewed changes

functorch/_src/aot_autograd.py Show resolved Hide resolved

anijain2305 mentioned this pull request Jun 18, 2022

AOTAutograd with Torchscript backend requires better support for Amp #888

Open

Comment

92afd77

anijain2305 merged commit 12553c5 into main Jun 18, 2022

anijain2305 mentioned this pull request Jun 20, 2022

[v0.2.0 Release Tracker] #826

Closed

Chillee pushed a commit that referenced this pull request Jun 21, 2022

Disable autocast (#794)

b12ddf4

* Disable autocast * Add global flag * Add a test

zou3519 pushed a commit to zou3519/pytorch that referenced this pull request Jul 20, 2022

[functorch] Disable autocast (pytorch/functorch#794)

ee941bf

* Disable autocast * Add global flag * Add a test

bigfootjon pushed a commit to pytorch/pytorch that referenced this pull request Jul 21, 2022

[functorch] Disable autocast (pytorch/functorch#794)

97a5d9b

* Disable autocast * Add global flag * Add a test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable autocast #794

Disable autocast #794

anijain2305 commented May 11, 2022

eellison commented May 11, 2022 •

edited

Loading

anijain2305 commented May 11, 2022

eellison commented May 11, 2022

anijain2305 commented May 11, 2022

eellison commented May 11, 2022 •

edited

Loading

rwightman commented Jun 14, 2022

anijain2305 commented Jun 16, 2022 •

edited

Loading

Chillee left a comment

rwightman commented Jun 19, 2022

Disable autocast #794

Disable autocast #794

Conversation

anijain2305 commented May 11, 2022

eellison commented May 11, 2022 • edited Loading

anijain2305 commented May 11, 2022

eellison commented May 11, 2022

anijain2305 commented May 11, 2022

eellison commented May 11, 2022 • edited Loading

rwightman commented Jun 14, 2022

anijain2305 commented Jun 16, 2022 • edited Loading

Chillee left a comment

Choose a reason for hiding this comment

rwightman commented Jun 19, 2022

eellison commented May 11, 2022 •

edited

Loading

eellison commented May 11, 2022 •

edited

Loading

anijain2305 commented Jun 16, 2022 •

edited

Loading