跳转至

最小二乘法的矩阵形式推导

利用必要的矩阵求导法则,推导最小二乘法的矩阵形式。

\[ \begin{aligned} {\color{red}{\boldsymbol{\beta}}}&\color{red}{=(\boldsymbol{X}^{\top} \boldsymbol{X})^{-1} \boldsymbol{X}^{\top} \boldsymbol{y}} \end{aligned} \]

多元线性回归的矩阵形式

为了方便理解,以 5 条数据为例,构建的线性回归模型为:

\[ y_1 = \beta_0 + \beta_1x_{11} + \beta_2x_{12} + \epsilon_1\\\ y_2 = \beta_0 + \beta_1x_{21} + \beta_2x_{22} + \epsilon_2\\\ y_3 = \beta_0 + \beta_1x_{31} + \beta_2x_{32} + \epsilon_3\\\ y_4 = \beta_0 + \beta_1x_{41} + \beta_2x_{42} + \epsilon_4\\\ y_5 = \beta_0 + \beta_1x_{51} + \beta_2x_{52} + \epsilon_5 \]

写成矩阵的形式:

\[ \boldsymbol{y} = \begin{bmatrix} y_1 \\\ y_2 \\\ y_3 \\\ y_4 \\\ y_5 \end{bmatrix},\quad \boldsymbol{X} = \begin{bmatrix} 1 & x_{11} & x_{12} \\\ 1 & x_{21} & x_{22} \\\ 1 & x_{31} & x_{32} \\\ 1 & x_{41} & x_{42} \\\ 1 & x_{51} & x_{52} \end{bmatrix},\quad \boldsymbol{\beta} = \begin{bmatrix} \beta_0 \\\ \beta_1 \\\ \beta_2 \end{bmatrix},\quad \boldsymbol{\epsilon} = \begin{bmatrix} \epsilon_1 \\\ \epsilon_2 \\\ \epsilon_3 \\\ \epsilon_4 \\\ \epsilon_5 \end{bmatrix}\\\ \]
\[ \boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

注意,\(\boldsymbol{X}\)的第一列均为\(1\),这是我们加入的偏置项。对应地,\(\boldsymbol{\beta}\)的第一项\(\beta_0\)就是回归模型中的截距项。

最小二乘法的思想就是找到 \(\boldsymbol{\beta}\) 让误差的平方和\(\boldsymbol{\epsilon}^{\top} \boldsymbol{\epsilon}\)最小。

\[ \underset{\boldsymbol{\beta}}{min} \quad \boldsymbol{\epsilon}^{\top} \boldsymbol{\epsilon} \]
\[ \begin{aligned} \boldsymbol{\epsilon}^{\top} \boldsymbol{\epsilon} &=(\boldsymbol{y}-\boldsymbol{X} \boldsymbol{\beta})^{\top}(\boldsymbol{y}-\boldsymbol{X} \boldsymbol{\beta}) \\\ &=(\boldsymbol{y}-\boldsymbol{X} \boldsymbol{\beta})^{\top} \boldsymbol{y}-(\boldsymbol{y}-\boldsymbol{X} \boldsymbol{\beta})^{\top} \boldsymbol{X} \boldsymbol{\beta} \\\ &=\boldsymbol{y}^{\top} \boldsymbol{y}-(\boldsymbol{X} \boldsymbol{\beta})^{\top} \boldsymbol{y}-\boldsymbol{y}^{\top} \boldsymbol{X} \boldsymbol{\beta}+(\boldsymbol{X} \boldsymbol{\beta})^{\top} \boldsymbol{X} \boldsymbol{\beta} \\\ &=\boldsymbol{y}^{\top} \boldsymbol{y}-\boldsymbol{\beta}^{\top} \boldsymbol{X} ^{\top} \boldsymbol{y}-\boldsymbol{y}^{\top} \boldsymbol{X} \boldsymbol{\beta}+\boldsymbol{\beta}^{\top} \boldsymbol{X} ^{\top} \boldsymbol{X} \boldsymbol{\beta} \end{aligned} \]

应用下一节的推导结论,把上式最后一步结果中的每一项对\(\boldsymbol{\beta}\)求导:

  1. \(\frac{\partial( \boldsymbol{y}^{\top}\boldsymbol{y})}{\partial{\boldsymbol{\beta}}} = 0\)

  2. \(\frac{\partial( \boldsymbol{\beta}^{\top} \boldsymbol{X} ^{\top} \boldsymbol{y})}{\partial{\boldsymbol{\beta}}} = \boldsymbol{X}^{\top}\boldsymbol{y}\)

  3. \(\frac{\partial( \boldsymbol{y}^{\top} \boldsymbol{X} \boldsymbol{\beta})}{\partial{\boldsymbol{\beta}}} =(\boldsymbol{y}^{\top}\boldsymbol{X})^{\top}= \boldsymbol{X}^{\top}\boldsymbol{y}\)

  4. \(\frac{\partial( \boldsymbol{\beta}^{\top} \boldsymbol{X} ^{\top} \boldsymbol{X} \boldsymbol{\beta})}{\partial{\boldsymbol{\beta}}} =\boldsymbol{X} ^{\top} \boldsymbol{X} \boldsymbol{\beta}+(\boldsymbol{X} ^{\top} \boldsymbol{X})^{\top} \boldsymbol{\beta}= 2\boldsymbol{X} ^{\top} \boldsymbol{X} \boldsymbol{\beta}\)

因此:

\[ \begin{aligned} \frac{\partial( \boldsymbol{\epsilon}^{\top} \boldsymbol{\epsilon})} {\partial{\boldsymbol{\beta}}} &=0-\boldsymbol{X}^{\top} \boldsymbol{y}-\boldsymbol{X}^{\top} \boldsymbol{y}+2\boldsymbol{X} ^{\top} \boldsymbol{X} \boldsymbol{\beta}\\\ &=2\boldsymbol{X} ^{\top} \boldsymbol{X} \boldsymbol{\beta}-2\boldsymbol{X}^{\top}\boldsymbol{y} \end{aligned} \]

令其等于\(0\),得:

\[ \begin{aligned} 2 \boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{\beta}-2\boldsymbol{X}^{\top} \boldsymbol{y}&=0\\\ \boldsymbol{X}^{\top} \boldsymbol{X} \boldsymbol{\beta}&=\boldsymbol{X}^{\top} \boldsymbol{y}\\\ {\color{red}{\boldsymbol{\beta}}}&\color{red}{=(\boldsymbol{X}^{\top} \boldsymbol{X})^{-1} \boldsymbol{X}^{\top} \boldsymbol{y}} \end{aligned} \]

必要的矩阵求导法则

PDF 版推导过程在这里

\[ \frac{\partial( \boldsymbol{x}^{\top} \boldsymbol{a})}{\partial{\boldsymbol{x}}} = \frac{\partial( \boldsymbol{a}^{\top}\boldsymbol{x})}{\partial{\boldsymbol{x}}} = \boldsymbol{a} \]

证明:

\[ \begin{aligned} \frac{\partial\left(\boldsymbol{x}^{\top} \boldsymbol{a}\right)}{\partial \boldsymbol{x}} &=\frac{\partial\left(\boldsymbol{a}^{\top} \boldsymbol{x}\right)}{\partial \boldsymbol{x}} \\\ &=\frac{\partial\left(a_1 x_1+a_2 x_2+\cdots+a_n x_n\right)}{\partial \boldsymbol{x}} \\\ &=\left[\begin{array}{c} \left.\frac{\partial\left(a_1 x_1+a_2 x_2+\cdots+a_n x_n\right)}{\partial x_1}\right] \\\ \frac{\partial\left(a_1 x_1+a_2 x_2+\cdots+a_n x_n\right)}{\partial x_2} \\\ \vdots \\\ \frac{\partial\left(a_1 x_1+a_2 x_2+\cdots+a_n x_n\right)}{\partial x_n} \end{array}\right] \\\ &=\left[\begin{array}{c} a_1 \\\ a_2 \\\ \vdots \\\ a_n \end{array}\right] \\\ &=\boldsymbol{a} \end{aligned} \]
\[ \frac{\partial( \boldsymbol{x}^{\top} \boldsymbol{x})}{\partial{\boldsymbol{x}}} = 2\boldsymbol{x} \]

证明:

\[ \begin{aligned} \frac{\partial\left(\boldsymbol{x}^{\top} \boldsymbol{x}\right)}{\partial \boldsymbol{x}}& =\frac{\partial\left(x_1^2 +x_2^2 +\cdots+ x_n^2\right)}{\partial \boldsymbol{x}} \\\ & =\left[\begin{array}{c} \frac{\partial\left(x_1^2 +x_2^2 +\cdots+ x_n^2\right)}{\partial x_1} \\\ \frac{\partial\left(x_1^2 +x_2^2 +\cdots+ x_n^2\right)}{\partial x_2} \\\ \vdots \\\ \frac{\partial\left(x_1^2 +x_2^2 +\cdots+ x_n^2\right)}{\partial x_n} \end{array}\right] \\\ & =\left[\begin{array}{c} 2 x_1 \\\ 2 x_2 \\\ \vdots \\\ 2 x_n \end{array}\right] \\\ & =2\left[\begin{array}{c} x_1 \\\ x_2 \\\ \vdots \\\ x_n \end{array}\right] \\\ & =2 \boldsymbol{x} \\\ \end{aligned} \]
\[ \frac{\partial( \boldsymbol{x}^{\top} \boldsymbol{A}\boldsymbol{x})}{\partial{\boldsymbol{x}}} = \boldsymbol{A}\boldsymbol{x}+\boldsymbol{A}^{\top} \boldsymbol{x} \]

证明:

\[ \begin{aligned} &\partial\left(a_{11} x_1 x_1+a_{12} x_1 x_2+\cdots+a_{1 n} x_1 x_n\right.\\\ &+a_{21} x_2 x_1+a_{22} x_2 x_2+\cdots+a_{2 n} x_2 x_n\\\ \frac{\partial\left(\boldsymbol{x}^{\top} \boldsymbol{A} \boldsymbol{x}\right)}{\partial \boldsymbol{x}}=&\frac{\left.+a_{n 1} x_n x_1+a_{n 2} x_n x_2+\cdots+a_{n n} x_n x_n\right)}{\partial \boldsymbol{x}}\\\ =&\left[\begin{array}{c} \left(a_{11} x_1+a_{12} x_2+\cdots+a_{1 n} x_n\right)+\left(a_{11} x_1+a_{21} x_2+\cdots+a_{n 1} x_n\right) \\\ \left(a_{21} x_1+a_{22} x_2+\cdots+a_{2 n} x_n\right)+\left(a_{12} x_1+a_{22} x_2+\cdots+a_{n 2} x_n\right) \\\ \vdots \\\ \left(a_{n 1} x_1+a_{n 2} x_2+\cdots+a_{n n} x_n\right)+\left(a_{1 n} x_1+a_{2 n} x_2+\cdots+a_{n n} x_n\right) \end{array}\right] \\\ =&\left[\begin{array}{c} a_{11} x_1+a_{12} x_2+\cdots+a_{1 n} x_n \\\ a_{21} x_1+a_{22} x_2+\cdots+a_{2 n} x_n \\\ \vdots \\\ a_{n 1} x_1+a_{n 2} x_2+\cdots+a_{n n} x_n \end{array}\right]+\left[\begin{array}{c} a_{11} x_1+a_{21} x_2+\cdots+a_{n 1} x_n \\\ a_{12} x_1+a_{22} x_2+\cdots+a_{n 2} x_n \\\ \vdots \\\ a_{1 n} x_1+a_{2 n} x_2+\cdots+a_{n n} x_n \end{array}\right]\\\ =&\left[\begin{array}{cccc} a_{11} & a_{12} & \cdots & a_{1 n} \\\ a_{21} & a_{22} & \cdots & a_{2 n} \\\ \vdots & \vdots & \ddots & \vdots \\\ a_{n 1} & a_{n 2} & \cdots & a_{n n} \end{array}\right]\left[\begin{array}{c} x_1 \\\ x_2 \\\ \vdots \\\ x_n \end{array}\right]+\left[\begin{array}{cccc} a_{11} & a_{21} & \cdots & a_{n 1} \\\ a_{12} & a_{22} & \cdots & a_{n 2} \\\ \vdots & \vdots & \ddots & \vdots \\\ a_{1 n} & a_{2 n} & \cdots & a_{n n} \end{array}\right]\left[\begin{array}{c} x_1 \\\ x_2 \\\ \vdots \\\ x_n \end{array}\right] \\\ =&\boldsymbol{A} \boldsymbol{x}+\boldsymbol{A}^{\top} \boldsymbol{x} \end{aligned} \]

评论