Releases: oneapi-src/oneDNN
v1.4-rc
This is a release candidate for DNNL v1.4. Please provide feedback and report bugs in Github issues.
v1.3
Performance optimizations
- Introduced broad release quality optimizations for future Intel(R) Xeon(R) Scalable processor (code name Cooper Lake).
- Improved performance of matmul primitive for 3D tensors (batched matrix-matrix multiplication) on all supported processors.
- Improved performance of binary primitive for the case when one of the tensors have to be broadcasted on all supported processors.
- Improved performance of convolution primitive for 3D tensors and 1x1 kernel size on all supported processors.
New functionality
- Introduced fused depthwise convolution and convolution with 1x1 filter. The implementation is available for all supported processors and data types. The functionality is not implemented for Intel Processor Graphics.
- Introduced peephole support for LSTM cell on all supported processors. The functionality is not implemented for Intel Processor Graphics.
- Implemented matmul primitive for Intel Processors Graphics.
- Extended binary primitive with min and max algorithms support.
- Extended eltwise primitive:
- Introduced erf-based implementation of gelu algorithm
- Introduced pow algorithm
- Introduced backpropagation flavor relying on destination tensor as input for elu, exp, logistic, relu, sqrt, and tanh algorithms
- Extended set of operations for memory descriptors:
*Added support for changing the number of dimensions with existingdnnl::memory::desc::reshape()
method- Introduced
dnnl::memory::desc::permute_axes()
) method to change logical axes order
- Introduced
Thanks to the contributors
This release contains contributions from the project core team as well as Araujo Mitrano, Arthur @aaraujom, Aaron Mark Johnson @aaronjohnson, Benjamin Hipple @bhipple, Sergey Nesterov @cepera, @gaurav1086, Ilya Taraban @itaraban, Mesut Meterelliyoz @mmeterel, @nSircombe, Peter Caday @petercad, and Rafik Saliev @rsaliev. We would also like to thank everyone who asked questions and reported issues.
v1.2.2
This is a patch release containing following changes to v1.2.1:
- Fixed overflow in transposition in bfloat16 weights gradient convolution (0d28389)
- Added work around corrupted unique_ptr usage in scratchpad (91c89a9)
- Fixed int8 deconvolution with int32 output on Intel AVX2 systems (ef2d652)
- Fixed fixed segmentation fault in concat due to incorrect memory alighment #668 (7a0c3a9)
- Fixed performance regression in no-copy gemm dispatching #525 (89a303b)
- Fixed segmentation fault in fp32 weights gradient convolution with dilation and large padding (50546ad)
- Fixed bfloat16/fp32 scalability for eltwise primitive (e281a4a)
v1.3-rc
This is a release candidate for DNNL v1.3. Please provide feedback and report bugs in Github issues.
v0.21.4
This is a patch release containing following changes to v0.21.3:
v2.0-beta05
This is a preview release for oneDNN v2.0. The release is a patch release based on DNNL v2.0-beta04.
Binary distribution of this software is available as Intel(R) oneAPI Deep Neural Network Library in Intel(R) oneAPI.
Known Limitations
- Weight gradient convolution for bfloat16 datatype with 1d spatial tensor and dilation may produce incorrect result on CPU.
- Weight gradient convolution for bfloat16 datatype with 2d spatial tensor and dilation may crash on Intel AVX512 systems.
- Optimized primitives can crash or fail for huge spatial sizes on CPU.
- dnnl_sgemm, dnnl_gemm_u8s8u32, and inner product functionality does not support sizes exceeding 2^32.
- Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
- Intel Processor Graphics Gen11 is not supported.
- When running GPU kernels that take longer than a certain time (it depends on OS and system settings) you may face a situation resulting in apparent hang of the application. Configure driver to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including DNNL examples.
On Linux:
$ sudo bash -c 'echo N > /sys/module/i915/parameters/enable_hangcheck'
On Windows increase TdrDelay and TdrDdiDelay values using registry.
v1.2.1
This is a patch release containing following changes to v1.2:
- Improved GEMM performance for 1 thread (1fd2bc0)
- Fixed RNN cell backpropagation computations (4b15a0c)
- Fixed alpha and beta handling in vanilla RNN cell (70f8b87)
- Reduced sizes in performance profiling example to avoid memory overflow for systems with less than 2 GB memory (f6e2ef9)
- Fix correctness for strided convolution with 1x1 filter with non-matching source and destination formats (0405c9a)
- Removed lambda calls from OpenMP loops as a workaround for Intel C/C++ Compiler 19.1 (a603593)
- Added -O1 flag for backward convolution gtests as a workaround for Intel C/C++ Compiler 19.1 (495b91f)
v2.0-beta04
This is a preview release for oneDNN v2.0. The release is based on oneDNN v1.2.
Binary distribution of this software is available as Intel(R) oneAPI Deep Neural Network Library in Intel(R) oneAPI.
Known Limitations
- Non-Intel GPUs are not supported. The library API allows to create a DNNL engine by index (the order of devices is determined by the SYCL runtime), and there is no check for GPU devices being non-Intel. To have more control, users can create a DNNL engine passing SYCL device and context explicitly.
- Intel Processor Graphics Gen11 is not supported.
- When running GPU kernels that take longer than a certain time (it depends on OS and system settings) you may face a situation resulting in apparent hang of the application. Configure driver to disable this timeout and avoid hanging of DPC++ or OpenCL programs, including DNNL examples.
On Linux:
$ sudo bash -c 'echo N > /sys/module/i915/parameters/enable_hangcheck'
On Windows increase TdrDelay and TdrDdiDelay values using registry.
v1.2
Performance optimizations
- Improved 1D backward convolution performance on CPU.
- Improved int8 inference performance on pre-Intel AVX512 systems.
- Improved int8 inference performance for 3D spatial data on CPU.
- Improved performance of convolution and other primitives on GPU.
New functionality
- Introduced general purpose matrix-matrix multiplication primitive. The functionality supports fp32, bfloat16, and int8 data types with asymmetric quantization.
- Introduced logsoftmax and resampling primitives.
- Introduced clip and log algorithms support in elementwise primitive.
- Introduced int8 and bf16 data types support for binary primitive (CPU only).
- Introduced fully functional support of int8 (inference) and bfloat16 (inference and training) datatypes on GPU. The functionality is not intended for getting performance improvement over f32 on current Intel Integrated Graphics, but to make conformance experiments.
Usability improvements
- Added JIT code annotations for linux-perf profiler.
- Added mechanism to control CPU dispatcher behavior at runtime via DNNL_MAX_CPU_ISA environment variable or a function call.
- Extended DNNL_VERBOSE output with more information about runtimes and devices.
Thanks to the contributors
This release contains contributions from the project core team as well as Aaron Johnson @aaronjohnson, Attila T. Áfra @atafra, Ben Fitch, Ilya Taraban @itaraban, Michał Gallus @Sand3r-, Peter Caday @petercad, Qiyou Chen @chenqy4933 and Jun Luan @junluan. We would also like to thank everyone who asked questions and reported issues.