Release v0.8.0
The new stable version offers significant performance and code quality improvements of the generated kernel programs.
- Added support for on-the-fly specialization of kernels using dynamic partial evaluation.
- Added support for dynamic shared memory (CPU & Cuda backends).
- Added new KernelConfig structure to specify launch dimensions for explicitly grouped kernels.
- Added new Index1 structure to avoid name clashes with new System.Index structure.
- Added additional tuple conversion methods to Index2 and Index3 types.
- Added new EntryPointDescription structure to specify an entry point and its index type.
- Added RuntimeKernelConfig structure to combine static and dynamic information about a particular kernel launch.
- Added support for linear arrays in local memory.
- Added support for enum-value interop (#66).
- Reworked explicitly grouped kernel launchers to use the new KernelConfig structure instead of GroupedIndex types.
- Simplified static Grid and Group properties.
- Removed all GroupedIndex types.
- Updated the whole compilation pipeline to enable more aggressive optimizations.
- Significantly improved performance of emitted PTX and OpenCL code by enabling more aggressive optimizations and clever code generation (#70).
- Added Support for "unmanaged" C# structures in the scope of buffers and views.
- Reworked PTX backend to support all API changes and to fix several critical code-generation issues. This also includes emission of PTX instructions that mimic the Cuda compiler (#68).
- Reworked OpenCL backend to support all API changes and to fix several
critical code-generation issues (#67, #72, #73, #74, #78, #85, #88, #91, #92). - New debug information input module to support the latest PDB format updates.
- Considerably improved error messages using debug information. (#86)
- Reduced memory consumption during the compilation process.
- Performance improvements of the internal compilation pipeline.
- Improved performance of kernel launchers.
- Extended CudaAPI to supported paged-lock host-memory allocation functions.
- Extended ExchangeBuffer to use new page-locked memory allocation (if available).
- Added new IR-rewriter API to perform more advanced IR transformations.
- Adapted all existing transformations to use the new rewriter API.
- Reduced memory consumption of all nodes by compressing information.
- Redesigned several IR nodes to support global program transformations.
- Reworked implementation of
GetSubView
in the context of generic and multidimensional array views (#19). - Fixed several issues in the scope of address-space inference.
- Fixed critical code generation issues that could occur when replacing values.
Special thanks to @MoFtZ for contributing to this release.