🔥🔥🔥 This repository lists some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.
- Awesome-CUDA-Triton-HPC
- Official Version
- Awesome List
- Learning Resources
- Frameworks
- Applications
- Blogs
- Videos
- Interview
-
CUDA : CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs).
-
NVIDIA/cuda-python : CUDA Python is the home for accessing NVIDIA’s CUDA platform from Python. CUDA Python Low-level Bindings. nvidia.github.io/cuda-python/
-
cuBLAS : Basic Linear Algebra on NVIDIA GPUs. NVIDIA cuBLAS is a GPU-accelerated library for accelerating AI and HPC applications. It includes several API extensions for providing drop-in industry standard BLAS APIs and GEMM APIs with support for fusions that are highly optimized for NVIDIA GPUs. The cuBLAS library also contains extensions for batched operations, execution across multiple GPUs, and mixed- and low-precision execution with additional tuning for the best performance.
-
cuDNN : The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, attention, matmul, pooling, and normalization.
-
CUTLASS : CUDA Templates for Linear Algebra Subroutines. CUTLASS is a collection of CUDA C++ template abstractions for implementing high-performance matrix-matrix multiplication (GEMM) and related computations at all levels and scales within CUDA. It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS and cuDNN.
-
TensorRT : NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT. developer.nvidia.com/tensorrt
-
TensorRT-LLM : TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. nvidia.github.io/TensorRT-LLM
-
Triton : Triton is a language and compiler for parallel programming. It aims to provide a Python-based programming environment for productively writing custom DNN compute kernels capable of running at maximal throughput on modern GPU hardware. triton-lang.org/
-
TVM : Open deep learning compiler stack for cpu, gpu and specialized accelerators. tvm.apache.org/
-
MLIR : Multi-Level Intermediate Representation Compiler Framework. The MLIR project is a novel approach to building reusable and extensible compiler infrastructure. MLIR aims to address software fragmentation, improve compilation for heterogeneous hardware, significantly reduce the cost of building domain specific compilers, and aid in connecting existing compilers together.
-
awesome-cuda-triton-hpc : some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.
-
Erkaman/Awesome-CUDA : This is a list of useful libraries and resources for CUDA development.
-
jslee02/awesome-gpgpu : 😎 A curated list of awesome GPGPU (CUDA/OpenCL/Vulkan) resources.
-
mikeroyal/CUDA-Guide : A guide covering CUDA including the applications and tools that will make you a better and more efficient CUDA developer.
-
rkinas/triton-resources : A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
-
chenzomi12/AISystem : AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术。
-
chenzomi12/AIFoundation : AIFoundation 主要是指AI系统遇到大模型,从底层到上层如何系统级地支持大模型训练和推理,全栈的核心技术。
-
-
NVIDIA CUDA Toolkit Documentation : CUDA Toolkit Documentation.
-
NVIDIA CUDA C++ Programming Guide : CUDA C++ Programming Guide.
-
NVIDIA CUDA C++ Best Practices Guide : CUDA C++ Best Practices Guide.
-
NVIDIA/cuda-samples : Samples for CUDA Developers which demonstrates features in CUDA Toolkit.
-
NVIDIA/CUDALibrarySamples : CUDA Library Samples.
-
NVIDIA/cuda-python : CUDA Python is the home for accessing NVIDIA’s CUDA platform from Python. CUDA Python Low-level Bindings. nvidia.github.io/cuda-python/
-
CuPy : CuPy : NumPy & SciPy for GPU. cupy.dev. CuPy User Guide
-
NVIDIA-developer-blog/code-samples : Source code examples from the Parallel Forall Blog.
-
HeKun-NVIDIA/CUDA-Programming-Guide-in-Chinese : This is a Chinese translation of the CUDA programming guide. 本项目为 CUDA C Programming Guide 的中文翻译版。
-
cuda-mode/lectures : Material for cuda-mode lectures.
-
cuda-mode/resource-stream : CUDA related news and material links.
-
brucefan1983/CUDA-Programming : Sample codes for my CUDA programming book.
-
YouQixiaowu/CUDA-Programming-with-Python : 关于书籍CUDA Programming使用了pycuda模块的Python版本的示例代码。
-
QINZHAOYU/CudaSteps : 基于《cuda编程-基础与实践》(樊哲勇 著)的cuda学习之路。
-
MAhaitao999/CUDA_Programming : 《CUDA编程基础与实践》一书的代码。
-
sangyc10/CUDA-code : bilibili视频【CUDA编程基础入门系列(持续更新)】配套代码。
-
RussWong/CUDATutorial : A CUDA tutorial to make people learn CUDA program from 0.
-
DefTruth//CUDA-Learn-Notes : 🎉CUDA/C++ 笔记 / 大模型手撕CUDA / 技术博客,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
-
BBuf/how-to-optim-algorithm-in-cuda : how to optimize some algorithm in cuda.
-
PaddleJitLab/CUDATutorial : A self-learning tutorail for CUDA High Performance Programing. 从零开始学习 CUDA 高性能编程。
-
ifromeast/cuda_learning : learning how CUDA works.
-
leimao/CUDA-GEMM-Optimization : CUDA Matrix Multiplication Optimization. This repository contains the CUDA kernels for general matrix-matrix multiplication (GEMM) and the corresponding performance analysis.
-
interestingLSY/CUDA-From-Correctness-To-Performance-Code : Codes & examples for "CUDA - From Correctness to Performance". The lecture can be found at https://wiki.lcpu.dev/zh/hpc/from-scratch/cuda.
-
Liu-xiandong/How_to_optimize_in_GPU : This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
-
tpoisonooo/how-to-optimize-gemm : row-major matmul optimization. zhuanlan.zhihu.com/p/65436463.
-
Bruce-Lee-LY/matrix_multiply : Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.
-
Bruce-Lee-LY/cuda_hgemm : Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
-
Bruce-Lee-LY/cuda_hgemv : Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
-
enp1s0/ozIMMU : FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme. arxiv.org/abs/2306.11975
-
Cjkkkk/CUDA_gemm : A simple high performance CUDA GEMM implementation.
-
AyakaGEMM/Hands-on-GEMM : A GEMM tutorial.
-
zpzim/MSplitGEMM : Large matrix multiplication in CUDA.
-
jundaf2/CUDA-INT8-GEMM : CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API.
-
chanzhennan/cuda_gemm_benchmark : Base on gtest/benchmark, refer to https://github.com/Liu-xiandong/How_to_optimize_in_GPU.
-
YuxueYang1204/CudaDemo : Implement custom operators in PyTorch with cuda/c++.
-
CoffeeBeforeArch/cuda_programming : Code from the "CUDA Crash Course" YouTube series by CoffeeBeforeArch.
-
rbaygildin/learn-gpgpu : Algorithms implemented in CUDA + resources about GPGPU.
-
godweiyang/NN-CUDA-Example : Several simple examples for popular neural network toolkits calling custom CUDA operators.
-
yhwang-hub/Matrix_Multiplication_Performance_Optimization : Matrix Multiplication Performance Optimization.
-
caiwanxianhust/ClusteringByCUDA : 使用 CUDA C++ 实现的一系列聚类算法。
-
ulrichstern/cuda-convnet : Alex Krizhevsky's original code from Google Code. "微信公众号「人工智能大讲堂」《找到了AlexNet当年的源代码,没用框架,从零手撸CUDA/C++》"。
-
PacktPublishing/Learn-CUDA-Programming : Learn CUDA Programming, published by Packt.
-
PacktPublishing/Hands-On-GPU-Programming-with-Python-and-CUDA : Hands-On GPU Programming with Python and CUDA, published by Packt.
-
PacktPublishing/Hands-On-GPU-Accelerated-Computer-Vision-with-OpenCV-and-CUDA : Hands-On GPU Accelerated Computer Vision with OpenCV and CUDA, published by Packt.
-
BobMcDear/neural-network-cuda : Neural network from scratch in CUDA/C++.
-
zjhellofss/KuiperLLama : 《动手自制大模型推理框架》。KuiperLLama 动手自制大模型推理框架,支持LLama2/3和Qwen2.5。校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。
-
zjhellofss/KuiperInfer : 校招、秋招、春招、实习好项目!带你从零实现一个高性能的深度学习推理库,支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step。
-
zjhellofss/kuiperdatawhale : 从零自制深度学习推理框架。
-
zjhellofss/kuiperdatawhale : 从零自制深度学习推理框架。
-
MarioSieg/magnetron : (WIP) A small but powerful, homemade PyTorch from scratch. Minimalistic homemade PyTorch alternative, written in C99 and Python.
-
lucasdelimanogueira/PyNorch : Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)
-
xgqdut2016/cuda_code : easy cuda code. CUDA代码简单入门。
-
xgqdut2016/hpc_project : some hpc project for learning.
-
xgqdut2016/hpc2torch : 这个仓库打算搭建一个高性能底层库的测试框架,将会针对onnx的算子编写相关的高性能kernel,作为pytorch的补充,从python端对比手写kernel和pytorch库函数的性能以及精度对比。
-
-
-
NVIDIA TensorRT Docs : NVIDIA Deep Learning TensorRT Documentation.
-
TensorRT : NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT. developer.nvidia.com/tensorrt
-
TensorRT-LLM : TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. nvidia.github.io/TensorRT-LLM
-
HeKun-NVIDIA/TensorRT-Developer_Guide_in_Chinese : 本项目是NVIDIA TensorRT的中文版开发手册, 有个人翻译并添加自己的理解。
-
kalfazed/tensorrt_starter : This repository give a guidline to learn CUDA and TensorRT from the beginning.
-
LitLeo/TensorRT_Tutorial : TensorRT_Tutorial.
-
-
-
Triton : Development repository for the Triton language and compiler. triton-lang.org/
-
Triton Docs : Triton Documentation.
-
hyperai/triton-cn : Triton Documentation in Chinese Simplified / Triton 中文文档. triton.hyper.ai
-
-
- Apache TVM 中文站 : Apache TVM 中文文档!
-
-
LLVM Docs : LLVM Documentation.
-
MLIR Docs : MLIR Code Documentation.
-
BBuf/tvm_mlir_learn : compiler learning resources collect.
-
j2kun/mlir-tutorial : This is the code repository for a series of articles on the MLIR framework for building compilers.
-
KEKE046/mlir-tutorial : Hands-On Practical MLIR Tutorial.
-
AyakaGEMM/Hands-on-MLIR : Hands-on-MLIR.
-
yao-jiashu/KernelCodeGen : GEMM/Conv2d CUDA/HIP kernel code generation using MLIR.
-
-
-
LAFF-On-PfHP : LAFF-On Programming for High Performance.
-
flame/how-to-optimize-gemm : How To Optimize Gemm wiki pages. https://github.com/flame/how-to-optimize-gemm/wiki
-
flame/blislab : BLISlab: A Sandbox for Optimizing GEMM. Check the tutorial for more details.
-
tpoisonooo/how-to-optimize-gemm : row-major matmul optimization. zhuanlan.zhihu.com/p/65436463.
-
YichengDWu/matmul.mojo : High Performance Matrix Multiplication in Pure Mojo 🔥
-
-
-
-
-
CCCL : CUDA C++ Core Libraries. The concept for the CUDA C++ Core Libraries (CCCL) grew organically out of the Thrust, CUB, and libcudacxx projects that were developed independently over the years with a similar goal: to provide high-quality, high-performance, and easy-to-use C++ abstractions for CUDA developers.
-
HIP : HIP: C++ Heterogeneous-Compute Interface for Portability. HIP is a C++ Runtime API and Kernel Language that allows developers to create portable applications for AMD and NVIDIA GPUs from single source code. rocmdocs.amd.com/projects/HIP/
-
-
-
NVIDIA/cuda-python : CUDA Python is the home for accessing NVIDIA’s CUDA platform from Python. CUDA Python Low-level Bindings. nvidia.github.io/cuda-python/
-
PyCUDA : PyCUDA: Pythonic Access to CUDA, with Arrays and Algorithms. mathema.tician.de/software/pycuda
-
-
-
jessfraz/advent-of-cuda : Doing advent of code with CUDA and rust.
-
Bend : A massively parallel, high-level programming language.higherorderco.com
-
HVM : A massively parallel, optimal functional runtime in Rust.higherorderco.com
-
ZLUDA : CUDA on AMD GPUs.
-
Rust-CUDA : Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.
-
cudarc : cudarc: minimal and safe api over the cuda toolkit.
-
bindgen_cuda : Similar crate than bindgen in philosophy. It will help create automatic bindgen to cuda kernels source files and make them easier to use directly from Rust.
-
cuda-driver : 基于 CUDA Driver API 的 cuda 运行时环境。
-
async-cuda : Asynchronous CUDA for Rust.
-
async-tensorrt : Asynchronous TensorRT for Rust.
-
krnl : Safe, portable, high performance compute (GPGPU) kernels.
-
custos : A minimal OpenCL, CUDA, WGPU and host CPU array manipulation engine / framework.
-
spinorml/nvlib : Rust interoperability with NVIDIA CUDA NVRTC and Driver.
-
DoeringChristian/cuda-rs : Cuda Bindings for rust generated with bindgen-cli (similar to cust_raw).
-
romankoblov/rust-nvrtc : NVRTC bindings for RUST.
-
solkitten/astro-cuda : CUDA Driver API bindings for Rust.
-
bokutotu/curs : cuda&cublas&cudnn wrapper for Rust.
-
rust-cuda/cuda-sys : Rust binding to CUDA APIs.
-
bheisler/RustaCUDA : Rusty wrapper for the CUDA Driver API.
-
tmrob2/cuda2rust_sandpit : Minimal examples to get CUDA linear algebra programs working with Rust using CC & FFI.
-
PhDP/rust-cuda-template : Simple template for Rust + CUDA.
-
neka-nat/cuimage : Rust implementation of image processing library with CUDA.
-
yanghaku/cuda-driver-sys : Rust binding to CUDA Driver APIs.
-
Canyon-ml/canyon-sys : Rust Bindings for Cuda, CuDNN.
-
cea-hpc/HARP : Small tool for profiling the performance of hardware-accelerated Rust code using OpenCL and CUDA.
-
Conqueror712/CUDA-Simulator : A self-developed version of the user-mode CUDA emulator project and a learning repository for Rust.
-
cszach/rust-cuda-template : A Rust CUDA template with detailed instructions.
-
exor2008/fluid-simulator : Rust CUDA fluid simulator.
-
chichieinstein/rustycuda : Convenience functions for generic handling of CUDA resources on the Rust side.
-
Jafagervik/cruda : CRUDA - Writing rust with cuda.
-
lennyerik/cutransform : CUDA kernels in any language supported by LLVM.
-
cjordan/hip-sys : Rust bindings for HIP.
-
rust-gpu : 🐉 Making Rust a first-class language and ecosystem for GPU shaders 🚧 shader.rs
-
wgpu : Safe and portable GPU abstraction in Rust, implementing WebGPU API. wgpu.rs
-
Vulkano : Safe and rich Rust wrapper around the Vulkan API. Vulkano is a Rust wrapper around the Vulkan graphics API. It follows the Rust philosophy, which is that as long as you don't use unsafe code you shouldn't be able to trigger any undefined behavior. In the case of Vulkan, this means that non-unsafe code should always conform to valid API usage.
-
Ash : Vulkan bindings for Rust.
-
ocl : OpenCL for Rust.
-
opencl3 : A Rust implementation of the Khronos OpenCL 3.0 API.
-
-
-
CUDA.jl : CUDA programming in Julia. juliagpu.org/
-
AMDGPU.jl : AMD GPU (ROCm) programming in Julia.
-
-
-
-
FlagPerf : FlagPerf is an open-source software platform for benchmarking AI chips. FlagPerf是智源研究院联合AI硬件厂商共建的一体化AI硬件评测引擎,旨在建立以产业实践为导向的指标体系,评测AI硬件在软件栈组合(模型+框架+编译器)下的实际能力。
-
te42kyfo/gpu-benches : collection of benchmarks to measure basic GPU capabilities.
-
-
-
cuBLAS : Basic Linear Algebra on NVIDIA GPUs. NVIDIA cuBLAS is a GPU-accelerated library for accelerating AI and HPC applications. It includes several API extensions for providing drop-in industry standard BLAS APIs and GEMM APIs with support for fusions that are highly optimized for NVIDIA GPUs. The cuBLAS library also contains extensions for batched operations, execution across multiple GPUs, and mixed- and low-precision execution with additional tuning for the best performance.
-
CUTLASS : CUDA Templates for Linear Algebra Subroutines.
-
MUTLASS : MUSA Templates for Linear Algebra Subroutines.
-
MatX : MatX - GPU-Accelerated Numerical Computing in Modern C++. An efficient C++17 GPU numerical computing library with Python-like syntax. nvidia.github.io/MatX
-
GenericLinearAlgebra.jl : Generic numerical linear algebra in Julia.
-
custos-math : This crate provides CUDA, OpenCL, CPU (and Stack) based matrix operations using custos.
-
-
-
FlashAttention : Fast and memory-efficient exact attention. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". (arXiv 2022).
-
66RING/tiny-flash-attention : flash attention tutorial written in python, triton, cuda, cutlass.
-
weishengying/tiny-flash-attention : 使用 cutlass 实现 flash-attention 精简版,具有教学意义。
-
jepeake/tiny-flash-attention : flash attention in ~20 lines.
-
-
-
cuDNN : The NVIDIA CUDA® Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, attention, matmul, pooling, and normalization.
-
PyTorch : Tensors and Dynamic neural networks in Python with strong GPU acceleration. pytorch.org
-
MooreThreads/torch_musa : torch_musa is an open source repository based on PyTorch, which can make full use of the super computing power of MooreThreads graphics cards.
-
PaddlePaddle : PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署). www.paddlepaddle.org/
-
flashlight/flashlight : A C++ standalone library for machine learning. fl.readthedocs.io/en/latest/
-
yhwang-hub/dl_model_infer : his is a c++ version of the AI reasoning library. Currently, it only supports the reasoning of the tensorrt model. The follow-up plan supports the c++ reasoning of frameworks such as Openvino, NCNN, and MNN. There are two versions for pre- and post-processing, c++ version and cuda version. It is recommended to use the cuda version., This repository provides accelerated deployment cases of deep learning CV popular models, and cuda c supports dynamic-batch image process, infer, decode, NMS.
-
NVlabs/tiny-cuda-nn : Lightning fast C++/CUDA neural network framework.
-
zjhellofss/KuiperLLama : 《动手自制大模型推理框架》。KuiperLLama 动手自制大模型推理框架,支持LLama2/3和Qwen2.5。校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。
-
zjhellofss/KuiperInfer : 校招、秋招、春招、实习好项目!带你从零实现一个高性能的深度学习推理库,支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step。
-
zjhellofss/kuiperdatawhale : 从零自制深度学习推理框架。
-
zjhellofss/kuiperdatawhale : 从零自制深度学习推理框架。
-
MarioSieg/magnetron : (WIP) A small but powerful, homemade PyTorch from scratch. Minimalistic homemade PyTorch alternative, written in C99 and Python.
-
lucasdelimanogueira/PyNorch : Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)
-
-
-
-
TensorRT : NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT. developer.nvidia.com/tensorrt
-
TensorRT-LLM : TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines. nvidia.github.io/TensorRT-LLM
-
vLLM : A high-throughput and memory-efficient inference and serving engine for LLMs. docs.vllm.ai
-
MLC LLM : Enable everyone to develop, optimize and deploy AI models natively on everyone's devices. mlc.ai/mlc-llm
-
Lamini : Lamini: The LLM engine for rapidly customizing models 🦙.
-
datawhalechina/self-llm : 《开源大模型食用指南》基于Linux环境快速部署开源大模型,更适合中国宝宝的部署教程。
-
ninehills/llm-inference-benchmark : LLM Inference benchmark.
-
-
-
llm.c : LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. For example, training GPT-2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation.
-
llama2.c : Inference Llama 2 in one file of pure C. Train the Llama 2 LLM architecture in PyTorch then inference it with one simple 700-line C file (run.c).
-
-
-
gemma.cpp : gemma.cpp is a lightweight, standalone C++ inference engine for the Gemma foundation models from Google.
-
whisper.cpp : High-performance inference of OpenAI's Whisper automatic speech recognition (ASR) model.
-
ChatGLM.cpp : C++ implementation of ChatGLM-6B and ChatGLM2-6B.
-
MegEngine/InferLLM : InferLLM is a lightweight LLM model inference framework that mainly references and borrows from the llama.cpp project.
-
DeployAI/nndeploy : nndeploy是一款模型端到端部署框架。以多端推理以及基于有向无环图模型部署为内核,致力为用户提供跨平台、简单易用、高性能的模型部署体验。nndeploy-zh.readthedocs.io/zh/latest/
-
zjhellofss/KuiperInfer (自制深度学习推理框架) : 带你从零实现一个高性能的深度学习推理库,支持llama 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step.
-
skeskinen/llama-lite : Embeddings focused small version of Llama NLP model.
-
Const-me/Whisper : High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model.
-
wangzhaode/ChatGLM-MNN : Pure C++, Easy Deploy ChatGLM-6B.
-
ztxz16/fastllm : 纯c++实现,无第三方依赖的大模型库,支持CUDA加速,目前支持国产大模型ChatGLM-6B,MOSS; 可以在安卓设备上流畅运行ChatGLM-6B。
-
davidar/eigenGPT : Minimal C++ implementation of GPT2.
-
Tlntin/Qwen-TensorRT-LLM : 使用TRT-LLM完成对Qwen-7B-Chat实现推理加速。
-
FeiGeChuanShu/trt2023 : NVIDIA TensorRT Hackathon 2023复赛选题:通义千问Qwen-7B用TensorRT-LLM模型搭建及优化。
-
TRT2022/trtllm-llama : ☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化。
-
-
-
llama2.mojo : Inference Llama 2 in one file of pure 🔥
-
dorjeduck/llm.mojo : port of Andrjey Karpathy's llm.c to Mojo.
-
-
-
Candle : Minimalist ML framework for Rust.
-
Safetensors : Simple, safe way to store and distribute tensors. huggingface.co/docs/safetensors
-
Tokenizers : 💥 Fast State-of-the-Art Tokenizers optimized for Research and Production. huggingface.co/docs/tokenizers
-
Burn : Burn - A Flexible and Comprehensive Deep Learning Framework in Rust. burn-rs.github.io/
-
dfdx : Deep learning in Rust, with shape checked tensors and neural networks.
-
luminal : Deep learning at the speed of light. www.luminalai.com/
-
crabml : crabml is focusing on the reimplementation of GGML using the Rust programming language.
-
TensorFlow Rust : Rust language bindings for TensorFlow.
-
tch-rs : Rust bindings for the C++ api of PyTorch.
-
rustai-solutions/candle_demo_openchat_35 : candle_demo_openchat_35.
-
llama2.rs : A fast llama2 decoder in pure Rust.
-
Llama2-burn : Llama2 LLM ported to Rust burn.
-
gaxler/llama2.rs : Inference Llama 2 in one file of pure Rust 🦀
-
whisper-burn : A Rust implementation of OpenAI's Whisper model using the burn framework.
-
stable-diffusion-burn : Stable Diffusion v1.4 ported to Rust's burn framework.
-
coreylowman/llama-dfdx : LLaMa 7b with CUDA acceleration implemented in rust. Minimal GPU memory needed!
-
tazz4843/whisper-rs : Rust bindings to whisper.cpp.
-
rustformers/llm : Run inference for Large Language Models on CPU, with Rust 🦀🚀🦙.
-
Chidori : A reactive runtime for building durable AI agents. docs.thousandbirds.ai.
-
llm-chain : llm-chain is a collection of Rust crates designed to help you work with Large Language Models (LLMs) more effectively. llm-chain.xyz
-
Atome-FE/llama-node : Believe in AI democratization. llama for nodejs backed by llama-rs and llama.cpp, work locally on your laptop CPU. support llama/alpaca/gpt4all/vicuna model. www.npmjs.com/package/llama-node
-
Noeda/rllama : Rust+OpenCL+AVX2 implementation of LLaMA inference code.
-
lencx/ChatGPT : 🔮 ChatGPT Desktop Application (Mac, Windows and Linux). NoFWL.
-
Synaptrix/ChatGPT-Desktop : Fuel your productivity with ChatGPT-Desktop - Blazingly fast and supercharged!
-
Poordeveloper/chatgpt-app : A ChatGPT App for all platforms. Built with Rust + Tauri + Vue + Axum.
-
mxismean/chatgpt-app : Tauri 项目:ChatGPT App.
-
sonnylazuardi/chat-ai-desktop : Chat AI Desktop App. Unofficial ChatGPT desktop app for Mac & Windows menubar using Tauri & Rust.
-
yetone/openai-translator : The translator that does more than just translation - powered by OpenAI.
-
m1guelpf/browser-agent : A browser AI agent, using GPT-4. docs.rs/browser-agent
-
sigoden/aichat : Using ChatGPT/GPT-3.5/GPT-4 in the terminal.
-
uiuifree/rust-openai-chatgpt-api : "rust-openai-chatgpt-api" is a Rust library for accessing the ChatGPT API, a powerful NLP platform by OpenAI. The library provides a simple and efficient interface for sending requests and receiving responses, including chat. It uses reqwest and serde for HTTP requests and JSON serialization.
-
1595901624/gpt-aggregated-edition : 聚合ChatGPT官方版、ChatGPT免费版、文心一言、Poe、chatchat等多平台,支持自定义导入平台。
-
Cormanz/smartgpt : A program that provides LLMs with the ability to complete complex tasks using plugins.
-
femtoGPT : femtoGPT is a pure Rust implementation of a minimal Generative Pretrained Transformer. discord.gg/wTJFaDVn45
-
shafishlabs/llmchain-rs : 🦀Rust + Large Language Models - Make AI Services Freely and Easily. Inspired by LangChain.
-
flaneur2020/llama2.rs : An rust reimplementatin of https://github.com/karpathy/llama2.c.
-
Heng30/chatbox : A Chatbot for OpenAI ChatGPT. Based on Slint-ui and Rust.
-
fairjm/dioxus-openai-qa-gui : a simple openai qa desktop app built with dioxus.
-
purton-tech/bionicgpt : Accelerate LLM adoption in your organisation. Chat with your confidential data safely and securely. bionic-gpt.com
-
-
-
llama2.zig : Inference Llama 2 in one file of pure Zig.
-
renerocksai/gpt4all.zig : ZIG build for a terminal-based chat client for an assistant-style large language model with ~800k GPT-3.5-Turbo Generations based on LLaMa.
-
EugenHotaj/zig_inference : Neural Network Inference Engine in Zig.
-
-
-
Ollama : Get up and running with Llama 2, Mistral, Gemma, and other large language models. ollama.com
-
go-skynet/LocalAI : 🤖 Self-hosted, community-driven, local OpenAI-compatible API. Drop-in replacement for OpenAI running LLMs on consumer-grade hardware. Free Open Source OpenAI alternative. No GPU required. LocalAI is an API to run ggml compatible models: llama, gpt4all, rwkv, whisper, vicuna, koala, gpt4all-j, cerebras, falcon, dolly, starcoder, and many other. localai.io
-
-
-
-
NVIDIA/nccl : Optimized primitives for collective multi-GPU communication.
-
NVIDIA/multi-gpu-programming-models : Examples demonstrating available options to program multiple GPUs in a single node or a cluster.
-
wilicc/gpu-burn : Multi-GPU CUDA stress test.
-
SCUDA : SCUDA: GPU-over-IP. SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
-
-
- Cupoch : Robotics with GPU computing.
-
-
Tachyon : Modular ZK(Zero Knowledge) backend accelerated by GPU.
-
Blitzar : Zero-knowledge proof acceleration with GPUs for C++ and Rust. www.spaceandtime.io/
-
blitzar-rs : High-Level Rust wrapper for the blitzar-sys crate. www.spaceandtime.io/
-
ICICLE : ICICLE is a library for ZK acceleration using CUDA-enabled GPUs.
-
-
-
-
- BobMcDear/attorch : A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
-
-
Liger-Kernel : Efficient Triton Kernels for LLM Training. arxiv.org/pdf/2410.10989
-
FlagGems : FlagGems is a high-performance general operator library implemented in OpenAI Triton. It aims to provide a suite of kernel functions to accelerate LLM training and inference.
-
linxihui/dkernel : This repo contains customized CUDA kernels written in OpenAI Triton. As of now, it contains the sparse attention kernel used in phi-3-small models. The sparse attention is also supported in vLLM for efficient inference.
-
-
- harleyszhang/lite_llama : The llama model inference lite framework by triton.
-
-
-
-
'gpu' Dialect : This dialect provides middle-level abstractions for launching GPU kernels following a programming model similar to that of CUDA or OpenCL.
-
'amdgpu' Dialect : The AMDGPU dialect provides wrappers around AMD-specific functionality and LLVM intrinsics.
-
-
- pyMLIR : Python interface for MLIR - the Multi-Level Intermediate Representation. pyMLIR is a full Python interface to parse, process, and output MLIR files according to the syntax described in the MLIR documentation. pyMLIR supports the basic dialects and can be extended with other dialects.
-
-
Torch-MLIR : The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.
-
ONNX-MLIR : Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure.
-
TPU-MLIR : Machine learning compiler based on MLIR for Sophgo TPU. TPU-MLIR is an open-source machine-learning compiler based on MLIR for TPU. This project provides a complete toolchain, which can convert pre-trained neural networks from different frameworks into binary files bmodel that can be efficiently operated on TPUs.
-
IREE : IREE: Intermediate Representation Execution Environment. A retargetable MLIR-based machine learning compiler and runtime toolkit. iree.dev/
-
ByteIR : The ByteIR Project is a ByteDance model compilation solution. ByteIR includes compiler, runtime, and frontends, and provides an end-to-end model compilation solution. byteir.ai
-
Xilinx/mlir-aie : An MLIR-based toolchain for AMD AI Engine-enabled devices. This repository contains an MLIR-based toolchain for AI Engine-enabled devices, such as AMD Ryzen™ AI and Versal™.
-
-
-
-
BLAS : BLAS (Basic Linear Algebra Subprograms). The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations.
-
LAPACK : LAPACK development repository. LAPACK — Linear Algebra PACKage. LAPACK is written in Fortran 90 and provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision.
-
OpenBLAS : OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version. www.openblas.net
-
BLIS : BLAS-like Library Instantiation Software Framework.
-
NumPy : The fundamental package for scientific computing with Python. numpy.org
-
SciPy : SciPy library main repository. SciPy (pronounced "Sigh Pie") is an open-source software for mathematics, science, and engineering. It includes modules for statistics, optimization, integration, linear algebra, Fourier transforms, signal and image processing, ODE solvers, and more. scipy.org
-
Gonum : Gonum is a set of numeric libraries for the Go programming language. It contains libraries for matrices, statistics, optimization, and more. www.gonum.org/
-
YichengDWu/matmul.mojo : High Performance Matrix Multiplication in Pure Mojo 🔥. Matmul.🔥 is a high performance muilti-threaded implimentation of the BLIS algorithm in pure Mojo 🔥.
-
-
-
- emptysoal/cuda-image-preprocess : Speed up image preprocess with cuda when handle image or tensorrt inference. Cuda编程加速图像预处理。
-
-
laugh12321/TensorRT-YOLO : 🚀 TensorRT-YOLO: Support YOLOv3, YOLOv5, YOLOv6, YOLOv7, YOLOv8, YOLOv9, YOLOv10, PP-YOLOE using TensorRT acceleration with EfficientNMS! TensorRT-YOLO 是一个支持 YOLOv3、YOLOv5、YOLOv6、YOLOv7、YOLOv8、YOLOv9、YOLOv10、PP-YOLOE 和 PP-YOLOE+ 的推理加速项目,使用 NVIDIA TensorRT 进行优化。项目不仅集成了 EfficientNMS TensorRT 插件以增强后处理效果,还使用了 CUDA 核函数来加速前处理过程。TensorRT-YOLO 提供了 C++ 和 Python 推理的支持,旨在提供快速而优化的目标检测解决方案。
-
l-sf/Linfer : 基于TensorRT的C++高性能推理库,Yolov10, YoloPv2,Yolov5/7/X/8,RT-DETR,单目标跟踪OSTrack、LightTrack。
-
Melody-Zhou/tensorRT_Pro-YOLOv8 : This repository is based on shouxieai/tensorRT_Pro, with adjustments to support YOLOv8. 目前已支持 YOLOv8、YOLOv8-Cls、YOLOv8-Seg、YOLOv8-OBB、YOLOv8-Pose、RT-DETR、ByteTrack、YOLOv9、YOLOv10、RTMO 高性能推理!!!🚀🚀🚀
-
shouxieai/tensorRT_Pro : C++ library based on tensorrt integration.
-
shouxieai/infer : A new tensorrt integrate. Easy to integrate many tasks.
-
kalfazed/tensorrt_starter : This repository give a guidline to learn CUDA and TensorRT from the beginning.
-
hamdiboukamcha/yolov10-tensorrt : YOLOv10 C++ TensorRT : Real-Time End-to-End Object Detection.
-
triple-Mu/YOLOv8-TensorRT : YOLOv8 using TensorRT accelerate !
-
FeiYull/TensorRT-Alpha : 🔥🔥🔥TensorRT for YOLOv8、YOLOv8-Pose、YOLOv8-Seg、YOLOv8-Cls、YOLOv7、YOLOv6、YOLOv5、YOLONAS......🚀🚀🚀CUDA IS ALL YOU NEED.🍎🍎🍎
-
cyrusbehr/YOLOv8-TensorRT-CPP : YOLOv8 TensorRT C++ Implementation. A C++ Implementation of YoloV8 using TensorRT Supports object detection, semantic segmentation, and body pose estimation.
-
VIDIA-AI-IOT/torch2trt : An easy to use PyTorch to TensorRT converter.
-
zhiqwang/yolort : yolort is a runtime stack for yolov5 on specialized accelerators such as tensorrt, libtorch, onnxruntime, tvm and ncnn. zhiqwang.com/yolort
-
Linaom1214/TensorRT-For-YOLO-Series : YOLO Series TensorRT Python/C++. tensorrt for yolo series (YOLOv8, YOLOv7, YOLOv6....), nms plugin support.
-
wang-xinyu/tensorrtx : TensorRTx aims to implement popular deep learning networks with tensorrt network definition APIs.
-
DefTruth/lite.ai.toolkit : 🛠 A lite C++ toolkit of awesome AI models with ONNXRuntime, NCNN, MNN and TNN. YOLOX, YOLOP, YOLOv6, YOLOR, MODNet, YOLOX, YOLOv7, YOLOv5. MNN, NCNN, TNN, ONNXRuntime. “🛠Lite.Ai.ToolKit: 一个轻量级的C++ AI模型工具箱,用户友好(还行吧),开箱即用。已经包括 100+ 流行的开源模型。这是一个根据个人兴趣整理的C++工具箱,, 涵盖目标检测、人脸检测、人脸识别、语义分割、抠图等领域。”
-
PaddlePaddle/FastDeploy : ⚡️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.
-
enazoe/yolo-tensorrt : TensorRT8.Support Yolov5n,s,m,l,x .darknet -> tensorrt. Yolov4 Yolov3 use raw darknet *.weights and *.cfg fils. If the wrapper is useful to you,please Star it.
-
guojianyang/cv-detect-robot : 🔥🔥🔥🔥🔥🔥Docker NVIDIA Docker2 YOLOV5 YOLOX YOLO Deepsort TensorRT ROS Deepstream Jetson Nano TX2 NX for High-performance deployment(高性能部署)。
-
BlueMirrors/Yolov5-TensorRT : Yolov5 TensorRT Implementations.
-
lewes6369/TensorRT-Yolov3 : TensorRT for Yolov3.
-
CaoWGG/TensorRT-YOLOv4 :tensorrt5, yolov4, yolov3,yolov3-tniy,yolov3-tniy-prn.
-
isarsoft/yolov4-triton-tensorrt : YOLOv4 on Triton Inference Server with TensorRT.
-
TrojanXu/yolov5-tensorrt : A tensorrt implementation of yolov5.
-
tjuskyzhang/Scaled-YOLOv4-TensorRT : Implement yolov4-tiny-tensorrt, yolov4-csp-tensorrt, yolov4-large-tensorrt(p5, p6, p7) layer by layer using TensorRT API.
-
Syencil/tensorRT : TensorRT-7 Network Lib 包括常用目标检测、关键点检测、人脸检测、OCR等 可训练自己数据。
-
SeanAvery/yolov5-tensorrt : YOLOv5 in TensorRT.
-
Monday-Leo/YOLOv7_Tensorrt : A simple implementation of Tensorrt YOLOv7.
-
ibaiGorordo/ONNX-YOLOv6-Object-Detection : Python scripts performing object detection using the YOLOv6 model in ONNX.
-
ibaiGorordo/ONNX-YOLOv7-Object-Detection : Python scripts performing object detection using the YOLOv7 model in ONNX.
-
triple-Mu/yolov7 : End2end TensorRT YOLOv7.
-
hewen0901/yolov7_trt : yolov7目标检测算法的c++ tensorrt部署代码。
-
tsutof/tiny_yolov2_onnx_cam : Tiny YOLO v2 Inference Application with NVIDIA TensorRT.
-
Monday-Leo/Yolov5_Tensorrt_Win10 : A simple implementation of tensorrt yolov5 python/c++🔥
-
Wulingtian/yolov5_tensorrt_int8 : TensorRT int8 量化部署 yolov5s 模型,实测3.3ms一帧!
-
Wulingtian/yolov5_tensorrt_int8_tools : tensorrt int8 量化yolov5 onnx模型。
-
MadaoFY/yolov5_TensorRT_inference : 记录yolov5的TensorRT量化及推理代码,经实测可运行于Jetson平台。
-
ibaiGorordo/ONNX-YOLOv8-Object-Detection : Python scripts performing object detection using the YOLOv8 model in ONNX.
-
we0091234/yolov8-tensorrt : yolov8 tensorrt 加速.
-
FeiYull/yolov8-tensorrt : YOLOv8的TensorRT+CUDA加速部署,代码可在Win、Linux下运行。
-
cvdong/YOLO_TRT_SIM : 🐇 一套代码同时支持YOLO X, V5, V6, V7, V8 TRT推理 ™️ 🔝 ,前后处理均由CUDA核函数实现 CPP/CUDA🚀
-
cvdong/YOLO_TRT_PY : 🐰 一套代码同时支持YOLOV5, V6, V7, V8 TRT推理 ™️ PYTHON
✈️ -
Psynosaur/Jetson-SecVision : Person detection for Hikvision DVR with AlarmIO ports, uses TensorRT and yolov4.
-
tatsuya-fukuoka/yolov7-onnx-infer : Inference with yolov7's onnx model.
-
MadaoFY/yolov5_TensorRT_inference : 记录yolov5的TensorRT量化及推理代码,经实测可运行于Jetson平台。
-
ervgan/yolov5_tensorrt_inference : TensorRT cpp inference for Yolov5 model. Supports yolov5 v1.0, v2.0, v3.0, v3.1, v4.0, v5.0, v6.0, v6.2, v7.0.
-
AlbinZhu/easy-trt : TensorRT for YOLOv10 with CUDA.
-
-
-
-
微信公众号「NVIDIA英伟达」
-
微信公众号「NVIDIA英伟达企业解决方案」
- 2024-04-24,NVIDIA GPU 架构下的 FP8 训练与推理
- 2024-06-14,初创加速计划 | 基于 NVIDIA Jetson 平台,国讯芯微实现大小脑端到端协同控制
- 2024-06-20,NVIDIA Isaac Sim 4.0 和 NVIDIA Isaac Lab 为机器人工作流和仿真提供强大助力
- 2024-06-21,消除仿真与现实之间的差距:使用 NVIDIA Isaac Lab 训练 Spot 四足机器人运动
- 2024-07-01,NVIDIA 端到端解决方案助力理想汽车打造智能驾驶体验与个性化车内空间
- 2024-11-27,NVIDIA TensorRT-LLM Roadmap 现已在 GitHub 上公开发布!
-
微信公众号「AI不止算法」
-
微信公众号「澎峰科技PerfXLab」
- 2022-10-18,深入浅出GPU优化系列:reduce优化
- 2022-10-31,深入浅出GPU优化系列:spmv优化
- 2023-05-24,深入浅出GPU优化系列:gemv优化
- 2023-05-24,深入浅出GPU优化系列:GEMM优化(一)
- 2023-06-02,深入浅出GPU优化系列:GEMM优化(二)
- 2023-06-16,深入浅出GPU优化系列:GEMM优化(三)
- 2023-06-26,深入浅出GPU优化系列:elementwise优化及CUDA工具链介绍
- 2023-06-27,漫谈高性能计算与性能优化:访存
- 2024-07-04,澎峰科技研发的高性能计算原语库PerfIPP库技术白皮书发布(附下载)
-
微信公众号「大猿搬砖简记」
-
微信公众号「oldpan博客」
- 2024-03-19,NVIDIA大语言模型落地的全流程解析
- 2024-03-20,TensorRT-LLM初探(二)简析了结构,用的更明白
- 2024-03-21,高性能 LLM 推理框架的设计与实现
- 2024-04-15,[深入分析CUTLASS系列] 0x01 cutlass 源码分析(零) --- 软件架构(附ncu性能分析方法)
- 2024-04-21,搞懂 NVIDIA GPU 性能指标 很容易弄混的一个概念: Utilization vs Saturation
- 2024-04-22,快速提升性能,如何更好地使用GPU(上)
- 2024-05-14,快速提升性能,如何更好地使用GPU(下)
- 2024-05-22,大模型精度(FP16,FP32,BF16)详解与实践
- 2024-07-24,CUDA性能简易优化(一)背景知识
- 2024-08-06,如何把 PyTorch 的 GPU 利用率提升到 100% ?
- 2024-08-13,TensorRT-LLM初探(三)最佳部署实践
-
微信公众号「DeepPrompting」
-
微信公众号「GiantPandaCV」
- 2024-04-20,Tensor Cores 使用介绍
- 2024-05-27,[并行训练]Context Parallelism的原理与代码浅析
- 2024-06-20, FP8量化解读--8bit下最优方案?(一)
- 2024-07-01,CUDA-MODE 课程笔记 第一课: 如何在 PyTorch 中 profile CUDA kernels
- 2024-07-04,CUDA-MODE 第一课课后实战(上)
- 2024-07-06,CUDA-MODE 课程笔记 第二课: PMPP 书的第1-3章速通
- 2024-07-13,CUDA-MODE 课程笔记 第四课: PMPP 书的第4-5章笔记
- 2024-07-18,CUDA-MODE课程笔记 第6课: 如何优化PyTorch中的优化器
- 2024-07-19,CUDA-MODE 第一课课后实战(下)
- 2024-07-23,CUTLASS 2.x & CUTLASS 3.x Intro 学习笔记
- 2024-07-28,CUDA-MODE课程笔记 第7课: Quantization Cuda vs Triton
- 2024-08-01,TRT-LLM中的Quantization GEMM(Ampere Mixed GEMM)CUTLASS 2.x 课程学习笔记
- 2024-08-05,CUDA-MODE课程笔记 第8课: CUDA性能检查清单
- 2024-09-12,CUDA-MODE课程笔记 第12课,Flash Attention
-
微信公众号「GPUS开发者」
-
微信公众号「机器学习研究组订阅」
-
微信公众号「自动驾驶之心」
-
微信公众号「Meet DSA」
-
微信公众号「AI寒武纪」
-
微信公众号「关于NLP那些你不知道的事」
-
微信公众号「InfoQ」
-
微信公众号「机器之心」
-
微信公众号「新智元」
-
微信公众号「GitHubStore」
-
微信公众号「云云众生s」
-
微信公众号「手写AI」
-
微信公众号「美团技术团队」
-
微信公众号「GitHubFun网站」
-
微信公众号「大模型生态圈」
-
微信公众号「苏哲管理咨询」
-
微信公众号「后来遇见AI」
- 2022-08-08,【机器学习】K均值聚类算法原理
- 2022-08-11,【CUDA编程】基于CUDA的Kmeans算法的简单实现
- 2024-01-23,【CUDA编程】基于 CUDA 的 Kmeans 算法的进阶实现(一)
- 2024-01-24,【CUDA编程】基于 CUDA 的 Kmeans 算法的进阶实现(二)
- 2024-04-08,【CUDA编程】CUDA 统一内存
- 2024-08-06,【CUDA编程】cuBLAS 库中矩阵乘法参数设置问题
-
微信公众号「江大白」
-
微信公众号「Tim在路上」
-
微信公众号「潮观世界」
-
微信公众号「DeepDriving」
-
微信公众号「人工智能大讲堂」
-
微信公众号「未来科技潮」
-
微信公众号「AI道上」
-
微信公众号「科技译览」
-
微信公众号「小白学视觉」
-
微信公众号「卡巴斯」
-
微信公众号「码砖杂役」
-
微信公众号「星想法」
-
微信公众号「太极图形」
-
微信公众号「硅星人Pro」
-
微信公众号「3D视觉之心」
-
微信公众号「中国企业家杂志」
-
微信公众号「CSharp与边缘模型部署」
-
微信公众号「NeuralTalk」
-
微信公众号「小吴持续学习AI」
-
微信公众号「大模型新视界」
-
微信公众号「量子位」
-
微信公众号「HPC智能流体大本营」
-
微信公众号「人工智能前沿讲习」
-
微信公众号「AI让生活更美好」
-
微信公众号「NE时代智能车」
-
微信公众号「OpenCV与AI深度学习」
-
微信公众号「InfiniTensor」
-
微信公众号「GeekSavvy」
-
微信公众号「阿木实验室」
-
微信公众号「吃果冻不吐果冻皮」
-
微信公众号「AI大模型实验室」
-
微信公众号「科技最前线」
-
微信公众号「AI范儿」
-
微信公众号「DataFunTalk」
-
- 2023-09-02,CUDA(一):CUDA 编程基础
- 2023-09-09,CUDA(二):GPU的内存体系及其优化指南
- 2023-09-29,CUDA(三):通用矩阵乘法:从入门到熟练
- 2024-04-29,ops(1):LayerNorm 算子的 CUDA 实现与优化
- 2024-04-30,ops(2):SoftMax算子的 CUDA 实现
- 2024-05-01,ops(3):Cross Entropy 的 CUDA 实现
- 2024-05-01,ops(4):AdamW 优化器的 CUDA 实现
- 2024-05-02,ops(5):激活函数与残差连接的 CUDA 实现
- 2024-05-03,ops(6):embedding 层与 LM head 层的 CUDA 实现
- 2024-05-06,ops(7):self-attention 的 CUDA 实现及优化 (上)
- 2024-05-08,ops(8):self-attention 的 CUDA 实现及优化 (下)
- 2024-05-14,CUDA(四):使用 CUDA 实现 Transformer 结构
-
-
-
微信公众号「智源研究院」
-
微信公众号「智源FlagOpen」
-
微信公众号「摩尔线程」
-
微信公众号「HyperAI超神经」
-
微信公众号「InfiniTensor」
-
微信公众号「吃果冻不吐果冻皮」
-
微信公众号「GiantPandaCV」
-
微信公众号「新智元」
-
微信公众号「CV技术指南」
-
微信公众号「AI时代窗口」
-
微信公众号「先进编译实验室」
-
-
- 微信公众号「小喵学AI」
-
-
微信公众号「RVBoards」
-
微信公众号「猿禹宙」
-
微信公众号「NeuralTalk」
-
微信公众号「有限元语言与编程」
-
微信公众号「鸟窝聊技术」
-
微信公众号「OpenCV与AI深度学习」
-
- 2023-03-23,AI’s compute fragmentation: what matrix multiplication teaches us
- 2023-04-20,The world's fastest unified matrix multiplication
- 2023-05-02,A unified, extensible platform to superpower your AI
- 2023-08-18,How Mojo🔥 gets a 35,000x speedup over Python – Part 1
- 2023-08-28,How Mojo🔥 gets a 35,000x speedup over Python – Part 2
- 2023-09-06,Mojo🔥 - A journey to 68,000x speedup over Python - Part 3
- 2024-02-12,Mojo vs. Rust: is Mojo 🔥 faster than Rust 🦀 ?
- 2024-04-10,Row-major vs. column-major matrices: a performance analysis in Mojo and NumPy
-
- bilibili「深圳王哥的科技频道」
- bilibili「HITsz-OSA」
- bilibili「权双」
- 微信公众号「大模型生态圈」
- 微信公众号「Cver」
- 微信公众号「高通内推王」
- 知乎「Tim在路上」