从头推导与实现 BP 网络
回归模型
目标
学习 \(y = 2x\)
模型
单隐层、单节点的 BP 神经网络
策略
Mean Square Error 均方误差
\[ MSE = \frac{1}{2}(\hat{y} - y)^2 \]模型的目标是 \(\min \frac{1}{2} (\hat{y} - y)^2\)
算法
朴素梯度下降。在每个 epoch 内,使模型对所有的训练数据都误差最小化。
网络结构
Forward Propagation Derivation
\[ E = \frac{1}{2}(\hat{Y}-Y)^2 \\ \hat{Y} = \beta \\ \beta = W b \\ b = sigmoid(\alpha) \\ \alpha = V x \]
Back Propagation Derivation
模型的可学习参数为 \(w,v\) ,更新的策略遵循感知机模型:
参数 w 的更新算法
\[ w \leftarrow w + \Delta w \\ \Delta w = - \eta \frac{\partial E}{\partial w} \\ \frac{\partial E}{\partial w} = \frac{\partial E}{\partial \hat{Y}} \frac{\partial \hat{Y}}{\partial \beta} \frac{\partial \beta}{\partial w} \\ = (\hat{Y} - Y) \cdot 1 \cdot b \]参数 v 的更新算法
\[ v \leftarrow v + \Delta v \\ \Delta v = -\eta \frac{\partial E}{\partial v} \\ \frac{\partial E}{\partial v} = \frac{\partial E}{\partial \hat{Y}} \frac{\partial \hat{Y}}{\partial \beta} \frac{\partial \beta}{\partial b} \frac{\partial \beta}{\partial \alpha} \frac{\partial \alpha}{\partial v} \\ = (\hat{Y} - Y) \cdot 1 \cdot w \cdot \frac{\partial \beta}{\partial \alpha} \cdot x \\ \frac{\partial \beta}{\partial \alpha} = sigmoid(\alpha) [ 1 - sigmoid(\alpha) ] \\ sigmoid(\alpha) = \frac{1}{1+e^{-\alpha}} \]代码实现
C++ 实现
#include#include using namespace std;class Network {public : Network(float eta) :eta(eta) {} float predict(int x) { // forward propagation this->alpha = this->v * x; this->b = this->sigmoid(alpha); this->beta = this->w * this->b; float prediction = this->beta; return prediction; } void step(int x, float prediction, float label) { this->w = this->w - this->eta * (prediction - label) * this->b; this->alpha = this->v * x; this->v = this->v - this->eta * (prediction - label) * this->w * this->sigmoid(this->alpha) * (1 - this->sigmoid(this->alpha)) * x; }private: float sigmoid(float x) {return (float)1 / (1 + exp(-x));} float v = 1, w = 1, alpha = 1, beta = 1, b = 1, prediction, eta;};int main() { // Going to learn the linear relationship y = 2*x float loss, pred; Network model(0.01); cout << "x is " << 3 << " prediction is " << model.predict(3) << " label is " << 2*3 << endl; for (int epoch = 0; epoch < 500; epoch++) { loss = 0; for (int i = 0; i < 10; i++) { pred = model.predict(i); loss += pow((pred - 2*i), 2) / 2; model.step(i, pred, 2*i); } loss /= 10; cout << "Epoch: " << epoch << " Loss:" << loss << endl; } cout << "x is " << 3 << " prediction is " << model.predict(3) << " label is " << 2*3 << endl; return 0;}
C++ 运行结果
初始网络权重,对数据 x=3, y=6的 预测结果为 \(\hat{y} = 0.952534\) 。
训练了 500 个 epoch 以后,平均损失下降至 7.82519,对数据 x=3, y=6的 预测结果为 \(\hat{y} = 11.242\) 。
PyTorch 实现
# encoding:utf8# 极简的神经网络,单隐层、单节点、单输入、单输出import torch as timport torch.nn as nnimport torch.optim as optimclass Model(nn.Module): def __init__(self, in_dim, out_dim): super(Model, self).__init__() self.hidden_layer = nn.Linear(in_dim, out_dim) def forward(self, x): out = self.hidden_layer(x) out = t.sigmoid(out) return outif __name__ == '__main__': X, Y = [[i] for i in range(10)], [2*i for i in range(10)] X, Y = t.Tensor(X), t.Tensor(Y) model = Model(1, 1) optimizer = optim.SGD(model.parameters(), lr=0.01) criticism = nn.MSELoss(reduction='mean') y_pred = model.forward(t.Tensor([[3]])) print(y_pred.data) for i in range(500): optimizer.zero_grad() y_pred = model.forward(X) loss = criticism(y_pred, Y) loss.backward() optimizer.step() print(loss.data) y_pred = model.forward(t.Tensor([[3]])) print(y_pred.data)
PyTorch 运行结果
初始网络权重,对数据 x=3, y=6的 预测结果为 $\hat{y} =0.5164 $ 。
训练了 500 个 epoch 以后,平均损失下降至 98.8590,对数据 x=3, y=6的 预测结果为 \(\hat{y} = 0.8651\) 。
结论
居然手工编程的实现其学习效果比 PyTorch 的实现更好,真是奇怪!但是我估计差距就产生于学习算法的不同,PyTorch采用的是 SGD。
分类模型
目标
目标未知,因为本实验的数据集是对 iris 取前两类样本,然后把四维特征降维成两维,得到本实验的数据集。
数据简介:
-1.653443 0.198723 1 0 # 前两列为特正,最后两列“1 0”表示第一类1.373162 -0.194633 0 1 # "0 1",第二类
模型
单隐层双输入输入节点的分类 BP 网络
策略
在整个模型的优化过程中,使得在整个训练集上交叉熵最小:
\[ \mathop{\arg\min}_{\theta} H(Y, \hat{Y}) \]交叉熵:
\[ \begin{align} H(y, \hat y) & = -\sum_{i=1}^{2} y_i \log \hat{y}_i \\ & = - (y_1 \log \hat{y}_1 + y_2 \log \hat{y}_2) \end{align} \]算法
梯度下降,也即在每个 epoch 内,使模型对所有的训练数据都误差最小。
网络结构
如图
Forward Propagation
公式推导如下
\[ a_1 = w_{11}x_1 + w_{21}x_2 \\ a_2 = w_{12}x_1 + w_{22}x_2 \\ b_1 = sigmoid(a_1) \\ b_2 = sigmoid(a_2) \\ \hat{y_1} = \frac{\exp(b_1)}{\exp(b_1) + \exp(b_2)} \\ \hat{y_2} = \frac{\exp(b_2)}{\exp(b_1) + \exp(b_2)} \\ \]
\[ \begin{align} E^{(k)} & = H(y^{(k)}, \hat{y}^{(k)}) \\ & =- (y_1 \log\hat{y}_1 + y_2 \log\hat{y}_2) \end{align} \]
Back Propagation
\[ \frac{\partial E}{\partial w_{11}} = ( \frac{\partial E}{\partial\hat{y}_1} \frac{\partial\hat{y}_1}{\partial b_1} + \frac{\partial E}{\partial\hat{y}_2} \frac{\partial\hat{y}_2}{\partial b_1}) \frac{\partial b_1}{\partial a_1} \frac{\partial a_1}{\partial w_{11}} \]
其中,
\[ \frac{\partial E}{\partial\hat{y}_1} = \frac{-y_1}{\hat{y}_1} \\ \frac{\partial E}{\partial\hat{y}_2} = \frac{-y_2}{\hat{y}_2} \\ \frac{\partial \hat{y}_1}{\partial b_1} = \hat{y}_1 (1- \hat{y}_1) \\ \frac{\partial \hat{y}_2}{\partial b_1} = - \hat{y}_1 \hat{y}_2 \\ \frac{\partial b1}{\partial a1} = sigmoid(a_1) [1 - sigmoid(a_1)] \\ \frac{\partial a_1}{\partial w_{11}} = x_1 \] 所以,\[ \frac{\partial E}{\partial w_{11}} = (\hat{y}_1 - y_1) sigmoid(a_1) [ 1 - sigmoid(a_1)] x_1 \] 类似的,可得\[ \frac{\partial E}{\partial w_{21}} = (\hat{y}_1 - y_1) sigmoid(a_1) [ 1 - sigmoid(a_1)] x_2 \\ \frac{\partial E}{\partial w_{12}} = (\hat{y}_2 - y_2) sigmoid(a_2) [ 1 - sigmoid(a_2)] x_1 \\ \frac{\partial E}{\partial w_{22}} = (\hat{y}_2 - y_2) sigmoid(a_2) [ 1 - sigmoid(a_2)] x_2 \]代码实现
Python 3 实现
# encoding:utf8from math import exp, logimport numpy as npdef load_data(fname): X, Y = list(), list() with open(fname, encoding='utf8') as f: for line in f: line = line.strip().split() X.append(line[:2]) Y.append(line[2:]) return X, Yclass Network: eta = 0.5 w = [[0.5, 0.5], [0.5, 0.5]] b = [0.5, 0.5] a = [0.5, 0.5] pred = [0.5, 0.5] def __sigmoid(self, x): return 1 / (1 + exp(-x)) def forward(self, x): self.a[0] = self.w[0][0] * x[0] + self.w[1][0] * x[1] self.a[1] = self.w[0][1] * x[0] + self.w[1][1] * x[1] self.b[0] = self.__sigmoid(self.a[0]) self.b[1] = self.__sigmoid(self.a[1]) self.pred[0] = self.__sigmoid(self.b[0]) / (self.__sigmoid(self.b[0]) + self.__sigmoid(self.b[1])) self.pred[1] = self.__sigmoid(self.b[1]) / (self.__sigmoid(self.b[0]) + self.__sigmoid(self.b[1])) return self.pred def step(self, x, label): g = (self.pred[0] - label[0]) * self.__sigmoid(self.a[0]) * (1-self.__sigmoid(self.a[0])) * x[0] self.w[0][0] = self.w[0][0] - self.eta * g g = (self.pred[0] - label[0]) * self.__sigmoid(self.a[0]) * (1 - self.__sigmoid(self.a[0])) * x[1] self.w[1][0] = self.w[1][0] - self.eta * g g = (self.pred[1] - label[1]) * self.__sigmoid(self.a[1]) * (1 - self.__sigmoid(self.a[1])) * x[0] self.w[0][1] = self.w[0][1] - self.eta * g g = (self.pred[1] - label[1]) * self.__sigmoid(self.a[1]) * (1 - self.__sigmoid(self.a[1])) * x[1] self.w[1][1] = self.w[1][1] - self.eta * gif __name__ == '__main__': X, Y = load_data('iris.txt') X, Y = np.array(X).astype(float), np.array(Y).astype(float) model = Network() pred = model.forward(X[0]) print("Label: %d %d, Pred: %f %f" % (Y[0][0], Y[0][1], pred[0], pred[1])) epoch = 100 loss = 0 for i in range(epoch): loss = 0 for j in range(len(X)): pred = model.forward(X[j]) loss = loss - Y[j][0] * log(pred[0]) - Y[j][1] * log(pred[1]) model.step(X[j], Y[j]) print("Loss: %f" % (loss)) pred = model.forward(X[0]) print("Label: %d %d, Pred: %f %f" % (Y[0][0], Y[0][1], pred[0], pred[1]))
网络在训练之前,预测为:
Label: 1 0, Pred: 0.500000 0.500000Loss: 55.430875
学习率 0.5, 训练 100 个 epoch 以后:
Label: 1 0, Pred: 0.593839 0.406161Loss: 52.136626
结论
训练后损失减小,模型预测的趋势朝着更贴近标签的方向前进,本次实验成功。
只不过模型的参数较少,所以学习能力有限。