x是未知的标量,对应的函数f(x)是关于x的一元函数
- 标量对标量求导结果仍为标量
$df=f'(x)dx$
f(x)是关于$x=[x_1, x_2, .., x_n]$的标量,计算f(x)关于$x_i$
- 标量对标量求导结果仍为标量
$df=\frac{\partial f}{\partial x_i}dx_i$
若$x$是$R^n$的向量,$x=[x_1, x_2, ..., x_n]$,是形状为1*n的向量,f(x)是关于x的函数,输出结果是标量
- 标量对向量求导结果为向量
$\frac{\partial f}{\partial x}=[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ..., \frac{\partial f}{\partial x_n}]$ $dx=[dx_1, dx_2, ..., dx_n]$ $df=\sum_{i=1}^n\frac{\partial f}{\partial x_i}dx_i=\frac{\partial f}{\partial x}^Tdx$
若$X$是$R^{m\times n}$的矩阵,$X=\left[ \begin{matrix} x_{11} & x_{12} & ... & x_{1n}\ x_{21} & x_{22} & ... & x_{2n} \ & & ... & \ x_{m1} & x_{m2} & ... & x_{mn} \end{matrix} \right]$,是形状为m*n的矩阵,f(X)是关于X的函数,输出的结果是标量
-
标量对矩阵求导结果为矩阵
-
$\frac{\partial f}{\partial X}=\left[ \begin{matrix} \frac{\partial f}{\partial x_{11}} & \frac{\partial f}{\partial x_{12}} & ... & \frac{\partial f}{\partial x_{1n}}\ \frac{\partial f}{\partial x_{21}} & \frac{\partial f}{\partial x_{22}} & ... & \frac{\partial f}{\partial x_{2n}} \ & & ... & \ \frac{\partial f}{\partial x_{m1}} & \frac{\partial f}{\partial x_{m1}} & ... & \frac{\partial f}{\partial x_{mn}} \end{matrix} \right]$
-
$dX=\left[ \begin{matrix} dx_{11} & dx_{12} & ... & dx_{1n}\ dx_{21} & dx_{22} & ... & dx_{2n} \ & & ... & \ dx_{m1} & dx_{m2} & ... & dx_{mn} \end{matrix} \right]$
-
$df=\sum_{i=1}^m\sum_{j=1}^n\frac{\partial f}{\partial x_{ij}}dx_{ij}=tr(\frac{\partial f}{\partial X}^TdX)$ -
这里其实用$sum(\frac{\partial f}{\partial X}\bigodot dX)$即可,但是为了便于书写,转换成线代的运算
-
这里有个结论:$tr(A^TB)=\sum_{i,j}A_{ij}B_{ij}$
- 简略地观察一下: $$ A=\left[ \begin{matrix} a_{11} & a_{12} & ... & a_{1n}\ a_{21} & a_{22} & ... & a_{2n} \ & & ... & \ a_{m1} & a_{m2} & ... & a_{mn} \end{matrix} \right],B=\left[ \begin{matrix} b_{11} & b_{12} & ... & b_{1n}\ b_{21} & b_{22} & ... & b_{2n} \ & & ... & \ b_{m1} & b_{m2} & ... & b_{mn} \end{matrix} \right]\ \begin{aligned} A^T\times B&=\left[ \begin{matrix} a_{11} & a_{21} & ... & a_{m1}\ a_{12} & a_{22} & ... & a_{m2} \ & & ... & \ a_{1n} & a_{2n} & ... & a_{mn} \end{matrix} \right]\times\left[ \begin{matrix} b_{11} & b_{12} & ... & b_{1n}\ b_{21} & b_{22} & ... & b_{2n} \ & & ... & \ b_{m1} & b_{m2} & ... & b_{mn} \end{matrix} \right]\ &=\left[ \begin{matrix} a_{11}\times b_{11}+...+a_{m1}\times b_{m1} & a_{11}\times b_{12}+...+a_{m1}\times b_{m2} & ... & a_{11}\times b_{1n}+...+a_{m1}\times b_{mn}\ a_{12}\times b_{11}+...+a_{m2}\times b_{m1} & a_{12}\times b_{12}+...+a_{m2}\times b_{m2} & ... & a_{12}\times b_{1n}+...+a_{m2}\times b_{mn} \ & & ... & \ a_{1n}\times b_{11}+...+a_{mn}\times b_{m1} & a_{1n}\times b_{12}+...+a_{1n}\times b_{m2} & ... & a_{1n}\times b_{1n}+...+a_{mn}\times b_{mn} \end{matrix} \right]\ &=C\ \end{aligned} $$ 观测可以得出:$tr(C)=tr(A^TB)=\sum_{i,j}A_{ij}B_{ij}$
-
-
向量、矩阵求导的运算法则:
- 加减法:$d(X\pm Y)=dX\pm dY$
- 矩阵乘法:$d(XY)=d(X)Y+Xd(Y)$
- 转置:$d(X^T)=dX^T$
- 迹:$d(tr(X))=tr(dX)$
- 逐元素乘法:$d(X\bigodot Y)=dX\bigodot Y+X\bigodot dY$
- 逐元素函数:$d\sigma(X)=\sigma'(X)\bigodot dX$
-
迹技巧:
- 标量套上迹:$x=tr(x)$
- 转置:$tr(A^T)=tr(A)$
- 线性:$tr(A\pm B)=tr(A)\pm tr(B)$
- 乘法交换:$tr(AB)=tr(BA)$,其中$A$与$B^T$尺寸相同
- 逐元素乘法交换:$tr(A^T(B\bigodot C))=tr((A\bigodot B)^TC)$
-
求导技巧:先给标量套上迹,通过迹技巧变换成$tr((g(X))^TdX)$的形式,$g(X)$即为所求
-
例:$Y=AXB$ $$ \begin{aligned} df&=tr(\frac{\partial f}{\partial Y}^TdY)\ &=tr(\frac{\partial f}{\partial Y}^T AdXB)(这里用了迹技巧的乘法交换,目的是将dX变到租后面)\ &=tr(B\frac{\partial f}{\partial Y}^TAdx)(然后将迹内的表达式凑成tr(X^TdX)的形式)\ &=tr((A^T\frac{\partial f}{\partial Y}B^T)^TdX)\ &=A^T\frac{\partial f}{\partial Y}B^TdX \end{aligned} $$
-
例:$f=a^Texp(Xb)$ $$ \begin{aligned} df&=tr(a^T(exp(Xb)\bigodot dXb))\ &=tr((a^Texp(Xb)\bigodot d(Xb)))\ &=tr(b(a\bigodot exp(Xb))^T dX)\ &=tr(((a\bigodot exp(Xb))^Tb^T)^TdX)\ &=(a\bigodot exp(Xb))b^TdX\ \end{aligned} $$
-
例:$f=a^TXb$ $$ \begin{aligned} df&=tr(a^TdXb)\ &=tr(ba^TdX)\ &=tr((ab^T)^TdX)\ &=ab^TdX \end{aligned} $$
-
例:[线性回归]$f=||Xw-y||^2=(Xw-y)^T(Xw-y)$ $$ \begin{aligned} df&=(d(Xw-y)^T)(Xw-y)+(Xw-y)^Td(Xw-y)\ &=X^Tdw^T(Xw-y)+(Xw-y)^TXdw\ &=tr(X^Tdw^T(Xw-y))+(Xw-y)^TXdw\ &=tr(dw^T(Xw-y)X^T)+(Xw-y)^TXdw\ &=tr((X(Xw-y)^T)^Tdw)+(Xw-y)^TXdw\ &=X(Xw-y)^Tdw+(Xw-y)^TXdw\ &=2(Xw-y)^TXdw \end{aligned} $$ 故$\frac{\partial f}{\partial w}=2(Xw-y)^TX$,令$\frac{\partial f}{\partial w}$=0,则: $$ w^TX^TX-y^TX=0\ w^TX^TX=y^TX\ w^T=y^TX(X^TX)^{-1}\ w=(X^TX)^{-1}X^Ty $$
-
例:[Softmax回归]$f=-y^T\log Softmax(Wx)$,其中y为形状为m*1的one-hot向量,W是m*n的矩阵,x是n*1的向量
- 需要注意的点为
-
$\log u/c$ (u为向量,c为常数)时,展开成:$\log u - 1\log c$,其中1表示与u同形的向量 -
$softmax(a)=\frac{exp(a)}{1^Texp(a)}$ ,a是向量,得到的结果也是向量 - y是one-hot向量,那么$1^Ty=\sum_{i=1}^ny_i=1$
-
$$ \begin{aligned} f&=-y^T(\log exp(Wx)-1\log 1^Texp(Wx))\ &=-y^TWx+log1^Texp(Wx) \end{aligned} $$
$$ \begin{aligned} df&=-y^TdWx+\frac{1^Td(exp(Wx))}{1^Texp(Wx)}\ &=-y^TdWx+\frac{1^T(exp(Wx)\bigodot dWx)}{1^Texp(Wx)}\ &=tr(-y^TdWx)+tr(\frac{1^T(exp(Wx)\bigodot dWx)}{1^Texp(Wx)})\ &=tr(-xy^TdW)+\frac{tr((1\bigodot exp(Wx))^TdWx)}{1^Texp(Wx)}\ &=-tr((yx^T)^TdW)+\frac{tr(x(1\bigodot exp(Wx))^TdW)}{1^Texp(Wx)}\ &=-yx^TdW+\frac{(1\bigodot exp(Wx))x^T}{1^Texp(Wx)}dW\ &=-yx^TdW+\frac{exp(Wx)x^T}{1^Texp(Wx)}dW\ &=(\frac{exp(Wx)}{1^Texp(Wx)}-y)x^TdW(对比一下可以发现)\ &=(Softmax(Wx)-y)x^TdW \end{aligned} $$
- 需要注意的点为