Skip to content

PaddlePaddle 2.5.0 Release Note

Compare
Choose a tag to compare
@XiaoguangHu01 XiaoguangHu01 released this 25 Jul 11:09
· 40 commits to release/2.5 since this release
feff99f

PaddlePaddle 2.5.0 Release Note

1. 重要更新

  • 动静统一新架构:实现基础算子组合的动转静加编译器执行新模式,在ResNet50&Bert模型上完成动转静、组合算子、神经网络编译器优化加速全流程。动转静完成整图fallback核心功能开发,支持动转静失败时回退到动态图训练执行;组合算子设计一套包含150多个基础算子的基础算子体系,实现python层前向算子拆分机制和支持动、静态图的反向算子拆分机制,实现70多个常用前、反向算子的拆分;CINN编译器修复正确性问题,开发关键Pass,添加手工Schedule规则,实现内核代码自动生成,ResNet50模型性能提升12%,Bert模型性能提升10%。
  • PHI算子库算子架构统一:将原算子体系下剩余的350+算子内核全部统一到PHI算子库中,以及原算子体系中的算子定义方式也都统一为PHI算子库的算子定义形式(基于YAML配置定义算子),提升了架构统一性,降低了框架开发的理解成本;将PHI算子库依赖的Fluid头文件全部解耦,并独立编译为动态链接库,为框架的二次开发提供更轻量的算子库复用方式;继续对飞桨框架中不规范的算子以及算子内核进行规范化调整,便于开发者理解,降低了硬件的接入成本。
  • 静态图新执行器全面上线:静态图新执行器实现多项功能和性能优化,完成对原有多套旧执行器的统一和替换,成为静态图单卡和分布式训练python端入口以及动转静、控制流、CINN等后端默认使用的执行引擎,大幅提升框架调度性能,功能架构更加清晰,二次开发能力显著增强。
  • Python API 支持0维tensor:为形状为[1,] 及形状为 [] 的张量定义了清晰的语义。
  • 新的环境适配:适配了CUDA 12,并支持使用gcc12进行编译。

2. 不兼容升级

  • 飞桨API支持0维tensor。飞桨之前用shape为[1]的1维tensor来替代0维tensor,这种替代方式和当前主流习惯有差异,增加模型的开发调试成本,有时还会导致非预期错误。本版本对需支持0维tensor的376个API进行了修正,和社区广泛使用的工具如EinOps等实现。例如,在之前的情况下,模型训练中输出的loss为1维tensor,如果要取出或打印loss,往往需要使用 loss.numpy()[0] 这样的代码。经过本次修改后,模型训练中输出的loss为0维tensor,使用 loss.numpy() 即可取出或打印loss,代码简短、易懂且符合业界使用习惯。
  • paddle.fluid API全面退场。按照上个版本已预告的计划,本次退场了1116个paddle.fluidAPI及相关内部接口,剩余少量相关内部接口会在下个版本全部清理完成。fluid API属于飞桨2.0本计划移除但考虑到兼容性等因素延缓清理的历史API,本次退场清理不会影响基于飞桨2.0开发的程序,飞桨API体系也会更加简洁易懂。
  • 旧版动态图Python端代码完成清理。至此,Python端仅使用新版动态图调用C++核心逻辑。
  • 为统一静态图模型数据并行的训练方式,废弃原有的单进程多卡训练方式,包括 paddle.static.ParallelExecutorpaddle.static.CompiledProgram().with_data_parallel() 两个接口,原因是这套接口只支持单机多卡,不支持多机多卡,且底层执行性能较差。推荐统一使用多进程多卡训练方式,即 paddle.distributed.launch 接口来进行数据并行的分布式训练。该升级只影响静态图,不影响动态图和动转静训练,如果使用了废弃接口,请参考 数据并行 的文档修改模型代码。#50351#50501#51240#51701#51616#51369#52671
  • 移除框架中原有的昇腾NPU和寒武纪MLU的适配代码,全部升级为CustomDevice插件式适配方式,并将昇腾NPU和寒武纪MLU的适配代码迁移至PaddleCustomDevice仓库。

3. 训练框架(含分布式)

Python API

API 支持0维tensor

new API

  • 新增 jacobian 和 hessian API,用于科学计算。#53331
  • 新增稀疏计算API。例如 paddle.sparse.reshapepaddle.sparse.sumpaddle.sparse.slice 等。#46694, #51513, #53794, #51406
  • 新增其它API。例如 paddle.optimizer.LBFGSpaddle.index_putpaddle.logaddexp 等。#53314, #51912, #52886, #50843, #47282, #52284

动态图

新功能

功能优化

  • 优化了动态图的日志打印,包括日志内容优化、VLog级别优化、报错内容优化等。PR45783, PR46349, PR46934, PR47724
  • 新增了FLAGS_auto_growth_chunk_size_in_mb用于auto_growth_allocator最小chunk size的设置 PR52204

bug fix

性能优化

静态图

静态图新执行器全面上线

静态图新执行器实现多项功能和性能优化,完成对原有多套旧执行器的统一和替换,成为静态图单卡和分布式训练python端入口以及动转静、控制流、CINN等后端默认使用的执行引擎,大幅提升框架调度性能,功能架构更加清晰,二次开发能力显著增强。#45913#46025#48911#50239#45696#46092#48158,#51389#49708#49275,#48789#49939#51149#52652

算子库

自定义算子等功能增强

包括:全新支持了自定义扩展机制,实现将 C++ 扩展的运算函数绑定至Python端使用,进一步提升了框架的二次开发能力;扩展支持自定义硬件上使用自定义算子机制,以满足硬件厂商实现非Paddle已有算子的需求;扩展支持了在自定义算子中实现inplacevector<Tensor>输出、optional<Tnesor>输入等高阶机制;优化了自定义算子在动态图模式下的调度性能,多输入参数的算子性能提升 25.4%;为自定义算子Tensor扩展新增了常用运算符及API,支持链式调用,简化代码写法。对算子内核选择机制进行了优化;对部分算子内核进行了逻辑完善、支持数据类型增强以及性能优化;新增以及完善 XPU 内核 100+;修复各项 Bug 累计 170+。
#49222, #51773, #51923, #53080, #50731, #50563, #50840, #50983, #51713, #48733, #50558, #50764, #51973, #52216, #51027, #50745, #50756, #50886, #50813, #50869, #51085, #51646, #51620, #51844, #52421, #52872, #52597, #50582, #52114, #52915, #50928, #48272, #48702, #52191, #52191, #47374, #47375, #47378, #54126, #47638, #47661, #50606, #53528, #50599, #51727, #50825, #50773, #50979, #53336, #53555, #53716, #53753, #53981, #53977, #53980, #54043, #54066, #52866, #53043, #53325, #54323, #54367, #51353, #53749, #50013, #47570, #50997, #51241, #49537

算子体系架构统一

具体包括:将原算子体系下剩余的350+算子内核全部统一到PHI算子库中,以及原算子体系中的算子定义方式也都统一为PHI算子库的算子定义形式(基于YAML配置定义算子),提升了架构统一性,降低了框架开发的理解成本;将PHI算子库依赖的Fluid头文件全部解耦,并独立编译为动态链接库,为框架的二次开发提供更轻量的算子库复用方式;继续对飞桨框架中不规范的算子以及算子内核进行规范化调整,便于开发者理解,降低了硬件的接入成本。
#47856, #49328, #49138, #52014, #52044, #52116, #52486, #52101, #52882, #53003, #53034, #51914, #49116, #52626, #52878, #52879, #52880, #52875, #51600, #51601, #51590, #51887, #51891, #52036, #52130, #52134, #51951, #51886, #52274, #52263, #51913, #52145, #52347, #52370, #52437, #52424, #52231, #52522, #52529, #52802, #52799, #52855, #52711, #52940, #53309, #47817, #48001, #48063, #48049, #48168, #48415, #48696, #48970, #50183, #50407, #50498, #50419, #50282, #50870, #50911, #50865, #51288, #53735, #47248, #47787, #52202,
#47579, #49444, #45772, #51264, #51634, #51631, #47385, #46342, #47510, #47532, #47702, #47860, #49470, #50358, #49121, #50190, #52374, #52372, #52375, #52371

动转静加组合算子

新功能

  • 组合算子添加dropout, silu, stack, relu, expand, unsqueeze, pow, squeeze, meshgrid, batch_norm, layer_norm, group_norm, instance_norm, full_like, split, split_with_num, gelu, mean, flatten, rsqrt, hadswish算子的组合规则 #50497, #50838, #50861, #50819, #50810, #51527, #51070, #51539, #51061, #49894, #50422, #51874, #51341, #50295, #50298, #50672, #51432, #51003
  • 组合算子添加gather_nd, reduce_max, group_norm, relu, reduce_max, gather, topk, sqrt, elementwise_pow, softmax, batch_norm, prod, multiply, expand, div, relu, slice, cumsum, sigmoid, layer_norm, sin, cos, roll, instance_norm, abs, assign, tile, scatter_nd_add, erf, floor, log, silu, leaky_relu, pad算子的vjp规则 #50966, #51653, #52663, #51742, #52203, #50794, #50305, #50786, #50679, #51045, #51230, #51474, #51283, #51238, #49831, #51838, #50771, #50565, #51768, #51750, #51748, #52532, #52935, #50963, #51430, #53141, #52469, #50436, #51059, #51296, #52533, #53374
  • 组合算子添加matmul, tanh, elementwise二阶微分规则 #50452, #52192, #53014
  • 组合算子添加exp, reduce_mean, softmax, divide, cast, layer_norm, prod, meshgrid, expand_as, dropout, concat, gather_nd, elementwise_max, elementwise_pow, reduce_max组合算子bf16数据类型支持 #54263#54236, #53865, #54175, #54399
  • 动转静新增控制流中的容器添加赋值语义支持 #51248
  • 动转静新增全图回退功能,当动转静转换失败时,可全图回退到动态图方式执行; 回退机制增加set_eval_frame接口 #50111, #52006
  • 动转静to_static支持算子组合机制;支持被to_static装饰下使用register_hook的场景; #49836, #52948, #53572
  • 动转静to_static接口增加backend参数, 可以指定为 CINN 或者 None,当该参数指定为 CINN 时,将会使用 CINN 编译器来加速训练和推理 #52596
  • 新增primitive接口代码自动生成功能,根据ops.yaml和legacy_ops.yaml中的算子定义;自动生成primitive接口的代码;自动生成Tensor运算接口 #50315, #49654, #50642
  • 新增算子前向组合功能,通过注册前向算子的组合规则,实现将前向算子拆分成基础算子 #49605
  • 新增组合算子开关,可以在shell中通过设置环境变量,实现算子按照不同方式进行拆分 #50309
  • 添加OpTest新增组合测试功能,对算子精度进行保障;添加elementwise类基础算子单测;添加batch_norm的CINN单测 #50509, #50807, #52815

功能优化

  • 添加组合算子支持FP16运算和AMP O1运算;添加softmax和layer_norm算子AMP逻辑 #52397, #52598, #51473
  • 简化组合算子batch_norm的组合规则和vjp规则 #54012, #51827, #51933,
  • 组合算子优化组合规则,提升含scalar组合规则的性能;优化组合算子日志打印 #51960, #50160
  • 组合算子支持jit.save接口;新增自定义VJP规则接口 #52344, #50885
  • 组合算子gather_grad删除overwrite参数。 #52707
  • 动转静代码风格清理,报错信息优化,规范日志 #48637, #46128, #52527, #46800,#46415
  • 动转静通过调用append backward的方式获取grad var name以修复高阶梯度计算时的错误 #53250
  • 动转静功能升级,清理to_static的临时目录以加速代码转换;增强to_static自动略过内部接口;支持在程序使用to_static装饰器 #47102, #50596, #45768
  • 动转静优化print函数转换以支持在组网阶段打印 Tensor 参数;升级参数收集机制 #48672, #50336

bug fix

性能优化

  • 动转静调用run_program_op的执行过程中,增加scope缓存和复用机制,避免每个step都会传入新的scope #45813

分布式训练

动态图分布式

自动并行

  • 静态图半自动并行功能完善:
    • 新增多个算子的FLOPs计算函数,并新增基于FLOPs的计算Cost建模 #48083,#47978,#47595,#48083,#48084,#47816
    • 接口易用性提升,完善 DistAttr, Process Mesh, Engine API、信息打印、输入输出等模块;执行Engine新增cost接口,可用于理论分析模型运行的时间和显存开销 #47503,#46416,#46554, #46633,#49214,#53848,#46552, #47043, #49665, #52912, #45776, #47263
    • 优化Pass的通用性和易用性升级,支持更多场景、减少Pass预分析耗时 #46519,#47358,#46391, #51035
    • 调试能力增强,添加分布式随机性控制机制和混合并行精度对齐工具 #52903,#49865
    • 支持推理生成任务组网的自动切分, 适配生成模型中的控制流、conditional block等特殊用法 #46771, #54067
    • 完善grad_clip,支持了数据并行场景下的负载均衡。#49510, #49249
  • 静态图半自动并行性能提升:
    • 新增 Sharding Pass 自动化通信Fuse 和 多流通信功能,GPT 6.7B 模型两机上吞吐性能提升 26% #48604, #47180,#46180
    • 新增 Recompute 优化策略调优功能,支持根据显存和模型大小选择最优 recompute checkpoint 设置 #48608,#47846,#49010
    • 流水线并行新增 1F1B 调度优化 Pass #54260, #45915
    • 数据并行优化,支持融合通信和通信计算Overlap 等优化, GPT 1.3B模型内性能提升 5% #48092,#45643,#49744, #47578
    • 优化 Reshard模块concate 性能,减少部分场景下concate 次数。#47809
    • 混合精度优化Pass性能升级, 支持 BF16 低精度, 适配 while 循环控制流的自动混合并行等 #51285,#51147, #49219, #49079
  • 静态图全自动并行功能完善:

参数服务器

CUDA

新功能

  • 新增对CUDA 12.0的编译支持,并修复相关单测 (#49539, #54542)
  • 新增CUDNN Frontend API的编译支持及相关单测,可以使用WITH_CUDNN_FRONTEND=ON 的编译选项进行开启。(#47524, #47612)

功能优化

bug fix

  • 修复trace、roll、dropout_nd、log_softmax等多个算子计算出错、栈溢出,以及部分单测问题。(#50243, #52012, #53795, #53149, #53654, #51054, #49373, #53038)
  • 修复conv算子穷举搜索在部分场景不生效的问题。(#47065)
  • 修复collective_reduce_scatter等算子在A100上出现timeout的问题。(#54513)
  • 修复FusedLinear单测中属性错误的问题。 (#50359)
  • 修复在使用Profiler时可能出现的OOM等问题 (#46089)

性能提升

文档

  • 修复index_put文档中的错误 (#53727)

Intermediate Representation

为了飞桨IR体系存在的稳定性、降低研发成本问题,孵化了飞桨新的IR体系,完成了基础的数据结构定义、算子定义生成和执行体系适配。为了更好的支持科学计算场景的高阶需求,完成了silu、cast等算子的高阶适配。

CINN编译器

新功能

  • 新增CINN对0D-Tensor的支持,目前为配合主框架升级,暂时采用增加pass的临时方案进行支持,后续会对该方案进行替换升级。 (#53382, #53955, #54064, #54118, #54216, #53454)
  • 新增CINN对int8/uint8/int16/uint16/bf16等数据类型的支持 (#50566, #53637)
  • 新增CINN expand算子的支持 (#46776)
  • 新增CINN对PaddleInference的支持. (#45009)

功能优化

  • CINN编译器,传递skip_gc_vars属性到CINN子图;CINN为skip_gc_vars添加fetch算子 #49471, #49553
  • CINN编译器,conv2d和conv2d_grad默认不使用cinn算子 #51645
  • 将 build_cinn_pass 添加到 BuildStrategy,以便于在动转静中使用 (#49496)
  • 增加reshape算子在组合算子机制下的单测 (#51276)
  • 主框架联编CINN的版本从固定commit改为develop (#49775)
  • 为CINN设置默认Target参数 (#50182)

bug fix

  • 修复CINN符号化过程中拓扑排序后的出现的算子顺序不一致的问题。 (#52556)
  • 修复一些算子计算错误、精度下降,以及单测相关问题 (#53859, #54261, #46801, #53676, #53772)
  • 修复CINN对float16类型支持的问题。(#48249)
  • 修复build_cinn_pass中的问题。 (#46843)
  • 修复了组合算子+动转静 在开启CINN时,出现反向因误被GC而导致的无数据区的问题 (#50116)
  • 修复编译器dropout amp出错,组合算子跑resnet出错,inplace变量未找到等问题 #51688, #52813, #51769

性能提升

  • 优化reshape相关融合策略 (#53066)
  • 优化BuildCINNPass的性能 (#49696)
  • 优化子图检测模块的性能 (#45040, #46937)

硬件接入

CustomDevice

  • 训练侧新增分布式策略 MP/Sharding/PP/MoE 以及 recompute 重计算功能的支持,推理侧新增分布式策略MP的支持,支持通过CustomDevice接入的硬件昇腾NPU和寒武纪MLU无需修改任何代码即可自动继承CustomDevice新增的所有分布式策略。 #52872, #54384, #53220, #54572, #54573, #54676, #53044, #53719, #53701, #53702, #53703
  • 新增API paddle.device.is_compiled_with_custom_device,方便用户判断当前环境是否支持某硬件的插件式设备后端 #49271
  • 增加环境变量 CUSTOM_DEVICE_BLACK_LIST 设置,支持黑名单内的算子自动异构到CPU上运行 #50409, #50666
  • 优化 CustomDevice 性能,减少对runtime中get_device_count接口的调用次数 #46963

昆仑芯XPU

4. 部署方向(Paddle Inference)

新功能

  • 支持Paddle TensorRT多个子图TensorRT engine 或者不同Predictor的之间的TensorRT engine共享显存,以便节约显存。#45842 #47631
  • C++ API增加获取输入Tensor的Shape和数据类型接口,增加获取输出Tensor的Shape和数据类型接口。C API增加SetExecStream、EnableMkldnnInt8等C++已有接口,用于服务化部署。 #49758
  • 新增paddle.inference.Predictor.register_output_hook()接口,可支持调试时打印GPU推理下每层的输出,同时也支持在While等控制流模型中使用。注意此接口不支持Paddle-TensorRT。#54433#47050#54254
  • Paddle Inference推理的Predictor接口支持paddle::Tensor作为输入和输出,以便用户直接复用飞桨动态图做推理前、后处理。 (#50445)
  • 增强Paddle TensorRT动态shape运行能力,config.enable_tuned_tensorrt_dynamic_shape()接口,不传任何参数时,在运行时构建TensorRT Engine。不再需要先收集shape信息再运行,但为了避免运行时的重新构建,需要在前几次运行时,覆盖最小及最大Shape的情况, #52162
  • Paddle-TensorRT支持NHWC格式的模型输入,#49633
  • 扩展config.Exp_DisableTensorRtOPs接口通过指定Tensor变量的名字来禁止进入TensorRT,#49497

功能优化

  • GPU混合精度推理(非Paddle TensorRT场景)功能增强,Config.enable_use_gpu增强可设置精度类型。 #47993
  • 支持double类型输入进行推理, #51786
  • 由于TensorRT 算子不支持INT64类型导致模型中存在INT64数据类型式运行失败问题,Paddle-TensorRT做了增强,当模型中包含INT64数据类型时,进行自动转换,降低到INT32类型运行。 #45547
  • Paddle-TensorRT支持更多算子进入TensorRT推理,包含:
  • 增强Paddle-TensorRT映射算子strided_slice,instance_norm,prelu,argmax,cast,nearest_interp_v2,elementwise,bilinear实现,#46819#47998#48043#48998#49675 , #47495
  • Paddle-TensorRT部分算子(scale, square, sum, swish, expand_as_v2, prelu, gelu, hard_swish, hard_sigmoid, leaky_relu,softmax, stack, clip, cast, flatten_contiguous_range,unary,equal, elementwise_op) 支持0维Tensor,#53660#53627#53634#53714#53729#53769#53506#53704
  • 支持GCC12 + CUDA 12.0以下版本编译, #50106
  • Paddle-TensorRT的DeformableConv插件支持动态Shape输入,#50698
  • Paddle-TensorRT增加lookup_table算子的插件支持, #46613
  • 新增config.enable_low_precision_io()接口支持Paddle-TensorRT场景下低精度类型输入, #52485
  • Paddle-TensorRT的LayerNorm插件支持FP16计算, #45043
  • Predictor的输入数据paddle_infer::Tensor支持bool类型,#49388
  • Paddle-TensorRT增强Convolution实现采用ConvolutionNd,#47653
  • conv2d_fusion融合算子支持NHWC格式,#49047
  • 调整C++推理库下Phi算子相关目录结构,#53091
  • 当TensorRT序列化和加载版本不匹配时,支持重新构建TensorRT Engine,而不是报错,#50775
  • 优化Paddle-TensorRT运行时打印日志信息,#50181
  • 基于oneDNN的CPU推理支持elementwise的0维Tensor输入,#51656
  • 清理和规范化Paddle-TensorRT的FC、matmul、matmul_v2算子的支持,统一升级到使用TensorRT的IMatrixMultiplyLayer进行支持,#52222

性能提升

  • 支持多个lookup_tables进入Paddle-TensorRT的Embedding+Eltwise+LayerNorm的融合 #46243#46230
  • 增加MoE融合Phi算子,提升MoE模型性能推理性能, #48703
  • 在INT8量化推理的场景下,Paddle-TensorRT 插件fallback到FP16计算而不是FP32计算,#50554
  • 优化推理时内存、显存, #49051#49046#53930
  • Layout排布优化Pass增强, #52997
  • 支持对算子Shape推断进行缓存,提升模型推理性能, #48312
  • 使用half2指令优化bias+add+relu融合,#49048
  • 使用向量化操作优化多个输入的Concat Kernel,#49540
  • 基于CUTLASS实现Convolution、Depthwise Convolution及相关融合算子,提升推理速度。 #47989#50603#51792#50603
  • Paddle-TensorRT支持FlashAttention的插件,提升StableDiffusion等模型的推理速度,#49438
  • 增加Transpose+LayerNorm的融合PASS,提升StableDiffusion等模型的推理速度,#50082
  • 增加Elementwise+Transpose的融合,#50081
  • 优化Paddle-TensorRT Group Norm插件实现 ,#49160
  • Config.EnableTensorRtEngine()接口增加use_cuda_graph参数,可以支持开启CUDA Graph,注意在使用时,需要保证模型输入shape不变,可以降低运行时耗时,#53406
  • 支持对Reshape的inplace操作减少模型运行时的拷贝耗时, #49146
  • 基于oneDNN优化LayerNorm kernel实现,#47782
  • 基于oneDNN支持quantize+transpose 以及 transpose+dequantize融合,#49509
  • CPU推理下当开启MKLDNN时,默认开启FC相关的融合Pass,提升性能,#45704
  • CPU的OneDNN推理支持suqeeze2 + transpose2融合,#47592

XPU推理提升和性能优化

  • 新增 ExpRunWithRuntimeConfig 接口与 XpuRuntimeConfig 允许推理期间设置外部流、L3 cache 等参数;GetExecStream 接口支持获得昆仑外部流对象;输入、输出支持昆仑设备内存减少 D2H 和 H2D 开销,#53334#52466#53240
  • 新增 multi-encoder, fused_multi_transformer 算子和融合 pass,提升 ERNIE 和 Transformer 类模型性能,#50570#51346#50499#53982#50759#51571#53144#53306
  • 优化BeamSearch性能,当beam_size=1 时对 write_read_array, gather 等细粒度算子进行变换、去除和融合提升模型性能,#53130
  • 多个相同输入的 stack 算子变换为支持 broadcast 的 unsqueeze 算子,unsquee/squeeze 支持 inplace 计算, #52099
  • 新增支持导出适用于昆仑芯的多卡推理模型, #50490
  • 新增 embedding_with_eltwise_add 融合 pass 及算子 phi kernel,减小显存占用并提升推理性能, #50590
  • interpolate 类算子 phi kernel 支持 FP16, #52358
  • argmax 算子支持 INT32 类型输出, #51303
  • 修复开启混合精度推理模式后, 保存序列化模型时只有model文件时的报错, #52994
  • 修复 instance_norm 在 scale 和 bias 为空时出现的段错误, #52627
  • conv_transpose 算子支持 FP16,#53626
  • 添加 yolo_box_xpu 融合 pass 及算子 phi kernel,优化 YOLO 模型通用子结构, #54163
  • 添加 conv2d_xpu 融合 pass 以及算子 phi kernel,并支持FP16推理,优化卷积操作推理耗时,#52247#53626
  • 添加 sigmoid_elementmul 通用融合 pass,融合为 swish 算子以匹配 conv2d_fusion pass 提升 YOLO 模型推理性能, #53580
  • 添加 act_add 融合 pass 及算子 phi kernel 提升推理性能,#53965
  • 添加 fold_interp_outsize 融合 pass 提升推理性能, #54245
  • 解决当FC存在共享 weight 时因重复融合导致结果错误的问题。 #51108#51039
  • 删除算子仅用于训练的 op_device 属性,防止在推理期间错误的选择训练时的 place, #51029
  • 支持优化后模型的保存,允许再次推理时跳过 PASS优化减少第一次推理时间, #53696
  • 解决算子 Kernel 的 CPUPlace 输入被强制拷贝到 XPU 而导致的计算错误问题, #51306
  • subblock 支持参数 H2D 提前拷贝以提升推理性能。#51876
  • 修复昆仑芯 2 代芯片输出激活的 scale 存储空间大小。 #53505
  • 新执行器昆仑芯 D2D 拷贝支持异步执行, #51876
  • 删除只有一个输入的 concat 算子,#52304
  • lookup_table_v2 支持 FP16 删除冗余 cast 算子, #52888
  • 控制流While算子支持缓存scope,降低每次新建scope 的开销, #52628
  • scatter 新增支持 FP16,删除冗余 cast 算子以及某一个输入为 1 的 elementwise_mul 算子。#52831

模型量化

  • 动态图量化功能全面升级
    • 新增动态图模型下量化训练的API为 paddle.quantization.QAT ,支持通过配置传入量化相关参数,简化量化训练使用流程和二次开发难度 (#49398)
    • 新增离线量化的API为 paddle.quantization.PTQ ,支持量化模型导出成推理支持的模型格式 (#50107)
    • 新增STUB算子,在训练过程中模拟实际的量化操作(#50510)
  • 支持量化训练模型加载离线量化模型的参数,支持更多算子量化,包含matmul, scale,conv1d,#47892#45911#48912
  • 支持静态图量化训练的混合并行训练,#52219
  • 修复动态图量化过程中的问题:
    • 导出量化训练模型时候重复插入量化节点,#48751
    • 修复给模型输入插入量化节点的问题,#49926

5. 环境适配

为提升源码编译效率,完善和推广setuptools + ninja编译方式,提升开发效率,CPU场景下,全量编译耗时减少20min,编译速度提升24.52%,GPU场景下全量编译耗时减少22min,编译速度提升29.31%; 为了适配较为主流的开发环境,飞桨在源码编译支持了gcc12编译和C++17标准,适配了最新的CUDA12; 代码质量完成了编译warning的清理,提升编译体验;第三方依赖层级,为减少依赖冲突,升级了底层的protobuf版本,并清理了一些低版本依赖库的废弃属性和老旧的代码格式,并移除了对于python2.x的支持。

6. 安全

Thanks to our Contributors

This release contains contributions from:
1want2sleep, 201716010711, 404988613, 5u13, 6clc, Ackeraa, Aganlengzi, ahahahahahaha, Ainavo, Allen Guo, andyj, Asthestarsfalll, Aurelius84, Ayuan, BellaZYL, Bjmw3, Bo Zhang, bukejiyu, caozhou, carryyu, Ccc, ccrrong, ceci3, chalsliu, Chang Xu, CHANGer, Charles-hit, Chen Weihang, chenjian, Chenxiao Niu, chenxiao120660, chenxujun, Chitsing KUI, cifar10, co63oc, CollaborativeFiltering, csy0225, cxxly, cyber-pioneer, cyberslack_lee, czr-gc, Dandelight, danleifeng, Danyang Zhang, dasen, denglianbin, Difer, dongfangshenzhu, DrowFish19, duanboqiang, duanyanhui, engineer, engineer1109, Epsilon Luoo, feifei-111, Feiyu Chan, Feng Ni, feng_shuai, Fisher, FlyingQianMM, Frank Lin, Galaxy1458, GaoYuYang, gaoziyuan, gem5, GGBond8488, Ghost Screaming, gongenlei, gouzil, Guanghua Yu, Guo Sheng, Guoxia Wang, Hamid Zare, Hanchiao, handiz, Haohongxiang, haosicheng, haozi, Happyd99, heliqi, hellockx, hellolllw, heyanru, hg-1099255210, hh-qiao, hjyp, hong, HongyuJia, houj04, hua-zi, Huang Jiyi, Huang Zhengjie, huangjiyi, huangjun12, Hui Zhang, Huihuang Zheng, Hulek, hwa, HydrogenSulfate, Ikko Eltociear Ashimine, iLeGend, Infinity_lee, Infrared1029, Jacek Czaja, jakpiase, james, jameszhang, Jiabin Yang, jiahongyu, jiangcheng, jiangfan06, Jianghai, jiaqianjing, jingsongliu, JingZhuangzhuang, jjyaoao, joanna.wozna.intel, junxiu777, Jx-qi, JYChen, JZ-LIANG, jzhang533, Kai Song, Kai Xing, Kaipeng Deng, Kang Zhao, kangguangli, Kevin吴嘉文, Kim, Kim Yann, knamg, kuizhiqing, lanxianghit, Leding Li, Leo Chen, Leo Guo, levi131, Li Min, Li-fAngyU, Ligoml, lijialin03, lijin23, limingshu, Lin Manhui, LinearTemporalLogic, Linjie Chen, lishicheng1996, Little-chick, littleforest, liu zhengxi, liulinduo, liuruyan, liuzhenhai93, LiYuRio, lj970926, LokeZhou, LoneRanger, lubiu, Lucas, lugimzzz, Lux et Veritas, lxsbupt, LyndonKong, lzy, lzydev, Mahmoud Ashraf, Manan Goel, Maple Xie, Matsumoto Ruko, mayang002, MayYouBeProsperous, megemini, mengziheng, Meteor Liu, mhy, mhy-666, Ming-Xu Huang, ming1753, minghaoBD, mjxs, Moqim, Mountagha, Mr.Juice, mrcangye, NetPunk, Netpunk, nihao, niuliling123, Nyakku Shigure, OccupyMars2025, Ouyang Chao, pangengzheng, pangyoki, parap1uie-s, Paulina Gacek, Piotr Paturej, PommesPeter, PPGitub, PPPPzhang, PuQing, Qi Li, Qi Shao, QingshuChen, qipengh, qizhaoaoe, Rayman, RedContritio, RichardWooSJTU, risemeup1, Roc, ronnywang, Ruibiao Chen, Ruibin Cheung, RuohengMa, Ryan, SaltFish11, Sanbu, Scotty, scotty, seemingwang, Shaojie WANG, ShenLiang, shentanyue, Shijie, Shuangchi He, Siming Dai, Sing_chan, sneaxiy, Sonder, sprouteer, Sqhttwl, sunli, superwinner1, supplyout, SylarTiaNII, Sylwester Fraczek, Sławomir Siwek, taixiurong, Tao Luo, Taylor-Layrose, TeFeng Chen, Thomas Young, thunder95, Thunderbrook, Tian, Tian Zheng, tiancaishaonvjituizi, tianshuo78520a, tifa, Tinson Lai, Tomasz Socha, Tony Cao, ucsk, umiswing, ustiniankw, Vegetable dog, Vigi Zhang, Vvsmile, Wang Bojun, Wang Xin, Wang Xinyu, wangfengsheng1999, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, wangshengxiang, wangxiaoning, wangxinxin08, Wangzheee, WangZhen, wangzhen38, wasupandceacar, wawltor, Wei Shengyu, Weilong Wu, weishengying, Wen Sun, wenbin, wentao yu, wenzhe.wang, westfish, whisky-12, whs, Wilber, will-jl944, winter-wang, Winters Montagne, WJJ1995, wuhuachaocoding, wuyefeilin, wz1qqx, XiangGao, xiaoguoguo626807, xiaohemaikoo, xiaoluomi, xiaoting, xiaoxiaohehe001, Xiaoxu Chen, xiaoyuanzi914, Xinger, Xinyu Chen, xiongkun, xjmxyt, xu98bin, xysheng-baidu, yangguohao, yangjianfengo1, YangQun, YangZhou, yeliang2258, YepKong, Yichen Zhang, yikaikkk, Yiqun Liu, yjphhw, ykkk2333, Young-Flash, yu wentao, Yuang Liu, Yuanle Liu, YuanRisheng, yuchen202, yuehuayingxueluo, YuhangLi, Yulong Ao, YUNSHEN XIE, yunyaoXYY, YuRonan, zachary sun, ZeKai Zhou, Zenghui Yuan, zengshao0622, Zero Rains, Zhan Rongrui, Zhang Jun, Zhang Na, Zhang Ting, Zhang Zheng, zhangbo9674, ZhangDY-6483, zhangkaihuo, zhangxin81, zhangyikun02, zhangyingying520, zhangyuqin1998, zhaocaibei123, zhaoyingli, Zhen Wang, Zheng-Bicheng, Zhenghai Zhang, Zheng_Bicheng, zhenyun, Zhibao Li, zhiboniu, Zhong Hui, Zhou Wei, ZhouMengLei1999, zhoutianzi666, zhouzj, zhupengyang, zhurou603, zhuyipin, zhwesky2010, ziyoujiyi, zlsh80826, Zman, zmxdream, zqw_1997, Zuza Gawrysiak, zxcd, zyfncg, ZZK, zzk0, 丁一, 傅剑寒, 六个骨头, 卢林, 周周周, 姜永久, 学渣戊, 张春乔, 张正海, 柠檬味~, 王明冬, 石晓伟, 超级码牛, 陈沧夜, 骑马小猫

PaddlePaddle 2.5.0 Release Note

1. Highlights

  • New dynamic-static unification architecture: Implement a new dynamic-to-static plus compiler execution model in combination with the basic operator, and complete the whole dynamic-to-static, combinator and neural network compiler optimization and acceleration process on the ResNet50&Bert model. For the dynamic-to-static, complete the whole graph fallback core function development, and support the fallback to dynamic graph training execution in case of dynamic-to-static failure. For the combinator, design a set of basic operator systems containing more than 150 basic operators, to achieve the python layer forward operator splitting mechanism and the reverse operator splitting mechanism of static graphs, to realize splitting of more than 70 commonly used forward and reverse operators. For the CINN compiler, fix the correctness bug, develop the key Pass, add manual schedule rules, achieve automatic generation of kernel codes, and improve performance of ResNet50 model by 12% and Bert model by 10%.
  • Operator architecture unification of PHI operator library: Unify all remaining 350+ operator kernels under the original operator system into PHI operator Library. Unify the way of defining operator in the original operator system into the operator definition form of PHI operator library (configuration of operator definition based on YAML), enhancing unity of the architecture, and reducing comprehension cost of framework development. Decouple all the Fluid header files that the PHI operator library depends on and compile them independently as dynamic link libraries to provide a lighter reuse of the operator library for secondary development of the framework. Continue to standardize and adjust unspecified operators, as well as operator kernels in the PaddlePaddle framework. It is easy for developers to understand and reduce the cost of accessing the hardware.
  • Full go-live of new actuator for static graph: The new actuator for static graph implements a number of functions and performance optimization, and completes unification and replacement of the original multiple sets of old actuators. The new actuator becomes the back-end default execution engine for the static graph single card and distributed training python side entrance, as well as dynamic-to-static, control flow, CINN, etc. This significantly improves scheduling performance of the framework, and the functional architecture is clearer. Secondary development capability is significantly enhanced.
  • Python API supporting 0-dimensional tensor: clear semantics are defined between tensor of shape [1,] and tensor of shape [], and fixed many API behaviors to support tensor of shape [], such as paddle.sum etc.
  • New environment adaptation: Adapt to CUDA 12. Compilation with gcc12 is supported.

2. Incompatibility Upgrade

  • PaddlePaddle API supports 0-dimensional tensor.PaddlePaddle previously used a 1-dimensional tensor with a shape of [1] instead of a 0-dimensional tensor, which is different from current mainstream habits. It increases development and debugging cost of the model, and sometimes leads to unintended errors. This release fixes 376 APIs that need to support 0-dimensional tensor, and implements tools widely used by the community such as EinOps. For example, in previous cases, output loss in model training was a 1-dimensional tensor. To take out or print the loss, it was often necessary to use codes like loss.numpy()[0].After this modification, output loss in model training is a 0-dimensional tensor. When using loss.numpy(), users can take out or print the loss. The codes are short, easy to understand, and in line with the industry's habit.
  • paddle.fluid API is fully decommissioned. According to the plan that has been previewed in the last version, 1116 paddle.fluid APIs and related internal interfaces have been decommissioned, and the remaining few related internal interfaces will be cleaned up in the next version.fluid API belongs to the historical APIs that PaddlePaddle 2.0 had planned to remove, but delayed the cleanup in consideration of compatibility and other factors. This decommissioning cleanup will not affect programs developed based on PaddlePaddle 2.0, and the PaddlePaddle API system will be more concise and easier to understand.
  • Complete code cleanup at the old version of the dynamic graph Python side.So far, the Python side only uses the new version of dynamic graph to call the C++ core logic.
  • In order to unify the training method of data parallel for static graph model, original single-process multi-card training method is abandoned, including paddle.static.ParallelExecutor and paddle.static. CompiledProgram(). with_data_parallel( ) APIs, because this set of APIs only supports single-computer multi-card, does not support multi-computer multi-card, and the underlying execution performance is poor.It is recommended to use the multi-process multi-card training method uniformly, i.e., paddle.distributed.launch API for distributed training with data parallel. This upgrade affects only static graphs, and does not affect dynamic graphs and dynamic-to-static training. If you use the decommissioned API, please refer to the documentation on data parallel to modify model code. #50351#50501#51240#51701#51616#51369#52671
  • Remove the original adaptation code of Ascend NPU and Cambricon MLU in the framework, upgrade all to CustomDevice plug-in adaptation, and migrate the adaptation code of Ascend NPU and Cambricon MLU to PaddleCustomDevice warehouse.

3. more details release note 2.5.0