最小二乘法的矩阵形式推导
利用必要的矩阵求导法则,推导最小二乘法的矩阵形式。
\[
\begin{aligned}
{\color{red}{\boldsymbol{\beta}}}&\color{red}{=(\boldsymbol{X}^{\top} \boldsymbol{X})^{-1} \boldsymbol{X}^{\top} \boldsymbol{y}}
\end{aligned}
\]
多元线性回归的矩阵形式
为了方便理解,以 5 条数据为例,构建的线性回归模型为:
\[
y_1 = \beta_0 + \beta_1x_{11} + \beta_2x_{12} + \epsilon_1\\\
y_2 = \beta_0 + \beta_1x_{21} + \beta_2x_{22} + \epsilon_2\\\
y_3 = \beta_0 + \beta_1x_{31} + \beta_2x_{32} + \epsilon_3\\\
y_4 = \beta_0 + \beta_1x_{41} + \beta_2x_{42} + \epsilon_4\\\
y_5 = \beta_0 + \beta_1x_{51} + \beta_2x_{52} + \epsilon_5
\]
写成矩阵的形式:
\[
\boldsymbol{y} = \begin{bmatrix} y_1 \\\ y_2 \\\ y_3 \\\ y_4 \\\ y_5 \end{bmatrix},\quad \boldsymbol{X} = \begin{bmatrix} 1 & x_{11} & x_{12} \\\ 1 & x_{21} & x_{22} \\\ 1 & x_{31} & x_{32} \\\ 1 & x_{41} & x_{42} \\\ 1 & x_{51} & x_{52} \end{bmatrix},\quad \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\\ \beta_1 \\\ \beta_2 \end{bmatrix},\quad \boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \\\ \epsilon_2 \\\ \epsilon_3 \\\ \epsilon_4 \\\ \epsilon_5 \end{bmatrix}\\\
\]
\[
\boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}
\]
注意,\(\boldsymbol{X}\)的第一列均为\(1\),这是我们加入的偏置项。对应地,\(\boldsymbol{\beta}\)的第一项\(\beta_0\)就是回归模型中的截距项。
最小二乘法的思想就是找到 \(\boldsymbol{\beta}\) 让误差的平方和\(\boldsymbol{\epsilon}^{\top} \boldsymbol{\epsilon}\)最小。
\[
\underset{\boldsymbol{\beta}}{min} \quad \boldsymbol{\epsilon}^{\top} \boldsymbol{\epsilon}
\]
\[
\begin{aligned}
\boldsymbol{\epsilon}^{\top} \boldsymbol{\epsilon} &=(\boldsymbol{y}-\boldsymbol{X} \boldsymbol{\beta})^{\top}(\boldsymbol{y}-\boldsymbol{X} \boldsymbol{\beta}) \\\
&=(\boldsymbol{y}-\boldsymbol{X} \boldsymbol{\beta})^{\top} \boldsymbol{y}-(\boldsymbol{y}-\boldsymbol{X} \boldsymbol{\beta})^{\top} \boldsymbol{X} \boldsymbol{\beta} \\\
&=\boldsymbol{y}^{\top} \boldsymbol{y}-(\boldsymbol{X} \boldsymbol{\beta})^{\top} \boldsymbol{y}-\boldsymbol{y}^{\top} \boldsymbol{X} \boldsymbol{\beta}+(\boldsymbol{X} \boldsymbol{\beta})^{\top} \boldsymbol{X} \boldsymbol{\beta} \\\
&=\boldsymbol{y}^{\top} \boldsymbol{y}-\boldsymbol{\beta}^{\top} \boldsymbol{X} ^{\top} \boldsymbol{y}-\boldsymbol{y}^{\top} \boldsymbol{X} \boldsymbol{\beta}+\boldsymbol{\beta}^{\top} \boldsymbol{X} ^{\top} \boldsymbol{X} \boldsymbol{\beta}
\end{aligned}
\]
应用下一节的推导结论,把上式最后一步结果中的每一项对\(\boldsymbol{\beta}\)求导:
-
\(\frac{\partial( \boldsymbol{y}^{\top}\boldsymbol{y})}{\partial{\boldsymbol{\beta}}} = 0\)
-
\(\frac{\partial( \boldsymbol{\beta}^{\top} \boldsymbol{X} ^{\top} \boldsymbol{y})}{\partial{\boldsymbol{\beta}}} = \boldsymbol{X}^{\top}\boldsymbol{y}\)
-
\(\frac{\partial( \boldsymbol{y}^{\top} \boldsymbol{X} \boldsymbol{\beta})}{\partial{\boldsymbol{\beta}}} =(\boldsymbol{y}^{\top}\boldsymbol{X})^{\top}= \boldsymbol{X}^{\top}\boldsymbol{y}\)
-
\(\frac{\partial( \boldsymbol{\beta}^{\top} \boldsymbol{X} ^{\top} \boldsymbol{X} \boldsymbol{\beta})}{\partial{\boldsymbol{\beta}}} =\boldsymbol{X} ^{\top} \boldsymbol{X} \boldsymbol{\beta}+(\boldsymbol{X} ^{\top} \boldsymbol{X})^{\top} \boldsymbol{\beta}= 2\boldsymbol{X} ^{\top} \boldsymbol{X} \boldsymbol{\beta}\)
因此:
\[
\begin{aligned}
\frac{\partial( \boldsymbol{\epsilon}^{\top} \boldsymbol{\epsilon})} {\partial{\boldsymbol{\beta}}} &=0-\boldsymbol{X}^{\top} \boldsymbol{y}-\boldsymbol{X}^{\top} \boldsymbol{y}+2\boldsymbol{X} ^{\top} \boldsymbol{X} \boldsymbol{\beta}\\\
&=2\boldsymbol{X} ^{\top} \boldsymbol{X} \boldsymbol{\beta}-2\boldsymbol{X}^{\top}\boldsymbol{y}
\end{aligned}
\]
令其等于\(0\),得:
\[
\begin{aligned}
2 \boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{\beta}-2\boldsymbol{X}^{\top} \boldsymbol{y}&=0\\\
\boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{\beta}&=\boldsymbol{X}^{\top} \boldsymbol{y}\\\
{\color{red}{\boldsymbol{\beta}}}&\color{red}{=(\boldsymbol{X}^{\top} \boldsymbol{X})^{-1} \boldsymbol{X}^{\top} \boldsymbol{y}}
\end{aligned}
\]
必要的矩阵求导法则
PDF 版推导过程在这里。
\[
\frac{\partial( \boldsymbol{x}^{\top} \boldsymbol{a})}{\partial{\boldsymbol{x}}} = \frac{\partial( \boldsymbol{a}^{\top}\boldsymbol{x})}{\partial{\boldsymbol{x}}} = \boldsymbol{a}
\]
证明:
\[
\begin{aligned}
\frac{\partial\left(\boldsymbol{x}^{\top} \boldsymbol{a}\right)}{\partial \boldsymbol{x}} &=\frac{\partial\left(\boldsymbol{a}^{\top} \boldsymbol{x}\right)}{\partial \boldsymbol{x}} \\\
&=\frac{\partial\left(a_1 x_1+a_2 x_2+\cdots+a_n x_n\right)}{\partial \boldsymbol{x}} \\\
&=\left[\begin{array}{c}
\left.\frac{\partial\left(a_1 x_1+a_2 x_2+\cdots+a_n x_n\right)}{\partial x_1}\right] \\\
\frac{\partial\left(a_1 x_1+a_2 x_2+\cdots+a_n x_n\right)}{\partial x_2} \\\
\vdots \\\
\frac{\partial\left(a_1 x_1+a_2 x_2+\cdots+a_n x_n\right)}{\partial x_n}
\end{array}\right] \\\
&=\left[\begin{array}{c}
a_1 \\\
a_2 \\\
\vdots \\\
a_n
\end{array}\right] \\\
&=\boldsymbol{a}
\end{aligned}
\]
\[
\frac{\partial( \boldsymbol{x}^{\top} \boldsymbol{x})}{\partial{\boldsymbol{x}}} = 2\boldsymbol{x}
\]
证明:
\[
\begin{aligned}
\frac{\partial\left(\boldsymbol{x}^{\top} \boldsymbol{x}\right)}{\partial \boldsymbol{x}}& =\frac{\partial\left(x_1^2 +x_2^2 +\cdots+ x_n^2\right)}{\partial \boldsymbol{x}} \\\
& =\left[\begin{array}{c}
\frac{\partial\left(x_1^2 +x_2^2 +\cdots+ x_n^2\right)}{\partial x_1} \\\
\frac{\partial\left(x_1^2 +x_2^2 +\cdots+ x_n^2\right)}{\partial x_2} \\\
\vdots \\\
\frac{\partial\left(x_1^2 +x_2^2 +\cdots+ x_n^2\right)}{\partial x_n}
\end{array}\right] \\\
& =\left[\begin{array}{c}
2 x_1 \\\
2 x_2 \\\
\vdots \\\
2 x_n
\end{array}\right] \\\
& =2\left[\begin{array}{c}
x_1 \\\
x_2 \\\
\vdots \\\
x_n
\end{array}\right] \\\
& =2 \boldsymbol{x} \\\
\end{aligned}
\]
\[
\frac{\partial( \boldsymbol{x}^{\top} \boldsymbol{A}\boldsymbol{x})}{\partial{\boldsymbol{x}}} = \boldsymbol{A}\boldsymbol{x}+\boldsymbol{A}^{\top} \boldsymbol{x}
\]
证明:
\[
\begin{aligned}
&\partial\left(a_{11} x_1 x_1+a_{12} x_1 x_2+\cdots+a_{1 n} x_1 x_n\right.\\\
&+a_{21} x_2 x_1+a_{22} x_2 x_2+\cdots+a_{2 n} x_2 x_n\\\
\frac{\partial\left(\boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x}\right)}{\partial \boldsymbol{x}}=&\frac{\left.+a_{n 1} x_n x_1+a_{n 2} x_n x_2+\cdots+a_{n n} x_n x_n\right)}{\partial \boldsymbol{x}}\\\
=&\left[\begin{array}{c}
\left(a_{11} x_1+a_{12} x_2+\cdots+a_{1 n} x_n\right)+\left(a_{11} x_1+a_{21} x_2+\cdots+a_{n 1} x_n\right) \\\
\left(a_{21} x_1+a_{22} x_2+\cdots+a_{2 n} x_n\right)+\left(a_{12} x_1+a_{22} x_2+\cdots+a_{n 2} x_n\right) \\\
\vdots \\\
\left(a_{n 1} x_1+a_{n 2} x_2+\cdots+a_{n n} x_n\right)+\left(a_{1 n} x_1+a_{2 n} x_2+\cdots+a_{n n} x_n\right)
\end{array}\right] \\\
=&\left[\begin{array}{c}
a_{11} x_1+a_{12} x_2+\cdots+a_{1 n} x_n \\\
a_{21} x_1+a_{22} x_2+\cdots+a_{2 n} x_n \\\
\vdots \\\
a_{n 1} x_1+a_{n 2} x_2+\cdots+a_{n n} x_n
\end{array}\right]+\left[\begin{array}{c}
a_{11} x_1+a_{21} x_2+\cdots+a_{n 1} x_n \\\
a_{12} x_1+a_{22} x_2+\cdots+a_{n 2} x_n \\\
\vdots \\\
a_{1 n} x_1+a_{2 n} x_2+\cdots+a_{n n} x_n
\end{array}\right]\\\
=&\left[\begin{array}{cccc}
a_{11} & a_{12} & \cdots & a_{1 n} \\\
a_{21} & a_{22} & \cdots & a_{2 n} \\\
\vdots & \vdots & \ddots & \vdots \\\
a_{n 1} & a_{n 2} & \cdots & a_{n n}
\end{array}\right]\left[\begin{array}{c}
x_1 \\\
x_2 \\\
\vdots \\\
x_n
\end{array}\right]+\left[\begin{array}{cccc}
a_{11} & a_{21} & \cdots & a_{n 1} \\\
a_{12} & a_{22} & \cdots & a_{n 2} \\\
\vdots & \vdots & \ddots & \vdots \\\
a_{1 n} & a_{2 n} & \cdots & a_{n n}
\end{array}\right]\left[\begin{array}{c}
x_1 \\\
x_2 \\\
\vdots \\\
x_n
\end{array}\right] \\\
=&\boldsymbol{A} \boldsymbol{x}+\boldsymbol{A}^{\top} \boldsymbol{x}
\end{aligned}
\]