正文
f
(
x
)
=
max
(
0
,
x
)

梯度弥散没完全解决,在(-)部分相当于神经元死亡而且不会复活1、解决了部分梯度弥散问题
2、收敛速度更快Leaky ReLU
f
(
x
)
=
1
(
x
<
0
)
(
α
x
)
+
1
(
x
>=
0
)
(
x
)
" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border-width: 0px; border-style: initial; border-color: initial; word-break: break-all;">
f
(
x
)
=
1
(
x
<0
)(
αx
)
+
1
(
x
>
=
0
)(
x
)
f
(
x
)
=
1
(
x
<
0
)
(
α
x
)
+
1
(
x
>=
0
)
(
x
)
-解决了神经死亡问题Maxout
max
(
w
1
T
x
+
b
1
,
w
2
T
x
+
b
2
)
" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; word-spacing: normal; word-wrap: normal; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border-width: 0px; border-style: initial; border-color: initial; word-break: break-all;">
max(
w
T
1
x
+
b
1
,
w
T
2
x
+
b
2
)
max
(
w
1
T
x
+
b
1
,
w
2
T
x
+
b
2
)

参数比较多,本质上是在输出结果上又增加了一层克服了ReLU的缺点,比较提倡使用
(4)参数更新方法
方法名称
|
公式
|
|
Vanilla update
|
x += - learning_rate * dx
|
|
Momentum update动量更新
|
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position
|
|
Nesterov Momentum
|
x_ahead = x + mu * v
v = mu * v - learning_rate * dx_ahead
x += v
|
|
Adagrad
(自适应的方法,梯度大的方向学习率越来越小,由快到慢)
|
cache += dx**2
x += - learning_rate * dx / (np.sqrt(cache) + eps)
|
|