11.9. Adadelta¶ Open the notebook in SageMaker Studio Lab
Adadelta是AdaGrad的另一种变体( 11.7节), 主要区别在于前者减少了学习率适应坐标的数量。 此外,广义上Adadelta被称为没有学习率,因为它使用变化量本身作为未来变化的校准。 Adadelta算法是在 (Zeiler, 2012)中提出的。
11.9.1. Adadelta算法¶
简而言之,Adadelta使用两个状态变量,
以下是Adadelta的技术细节。鉴于参数du
jour是
与
11.8节的区别在于,我们使用重新缩放的梯度
那么,调整后的梯度
其中
和
11.9.2. 代码实现¶
Adadelta需要为每个变量维护两个状态变量,即
%matplotlib inline
from mxnet import np, npx
from d2l import mxnet as d2l
npx.set_np()
def init_adadelta_states(feature_dim):
s_w, s_b = np.zeros((feature_dim, 1)), np.zeros(1)
delta_w, delta_b = np.zeros((feature_dim, 1)), np.zeros(1)
return ((s_w, delta_w), (s_b, delta_b))
def adadelta(params, states, hyperparams):
rho, eps = hyperparams['rho'], 1e-5
for p, (s, delta) in zip(params, states):
# In-placeupdatesvia[:]
s[:] = rho * s + (1 - rho) * np.square(p.grad)
g = (np.sqrt(delta + eps) / np.sqrt(s + eps)) * p.grad
p[:] -= g
delta[:] = rho * delta + (1 - rho) * g * g
%matplotlib inline
import torch
from d2l import torch as d2l
def init_adadelta_states(feature_dim):
s_w, s_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
delta_w, delta_b = torch.zeros((feature_dim, 1)), torch.zeros(1)
return ((s_w, delta_w), (s_b, delta_b))
def adadelta(params, states, hyperparams):
rho, eps = hyperparams['rho'], 1e-5
for p, (s, delta) in zip(params, states):
with torch.no_grad():
# In-placeupdatesvia[:]
s[:] = rho * s + (1 - rho) * torch.square(p.grad)
g = (torch.sqrt(delta + eps) / torch.sqrt(s + eps)) * p.grad
p[:] -= g
delta[:] = rho * delta + (1 - rho) * g * g
p.grad.data.zero_()
%matplotlib inline
import tensorflow as tf
from d2l import tensorflow as d2l
def init_adadelta_states(feature_dim):
s_w = tf.Variable(tf.zeros((feature_dim, 1)))
s_b = tf.Variable(tf.zeros(1))
delta_w = tf.Variable(tf.zeros((feature_dim, 1)))
delta_b = tf.Variable(tf.zeros(1))
return ((s_w, delta_w), (s_b, delta_b))
def adadelta(params, grads, states, hyperparams):
rho, eps = hyperparams['rho'], 1e-5
for p, (s, delta), grad in zip(params, states, grads):
s[:].assign(rho * s + (1 - rho) * tf.math.square(grad))
g = (tf.math.sqrt(delta + eps) / tf.math.sqrt(s + eps)) * grad
p[:].assign(p - g)
delta[:].assign(rho * delta + (1 - rho) * g * g)
%matplotlib inline
import warnings
from d2l import paddle as d2l
warnings.filterwarnings("ignore")
import paddle
def init_adadelta_states(feature_dim):
s_w, s_b = paddle.zeros(shape=(feature_dim, 1)), paddle.zeros(shape=(1, ))
delta_w, delta_b = paddle.zeros(shape=(feature_dim, 1)), paddle.zeros(shape=(1, ))
return ((s_w, delta_w), (s_b, delta_b))
def adadelta(params, states, hyperparams):
a = []
rho, eps = hyperparams['rho'], 1e-5
for p, (s, delta) in zip(params, states):
with paddle.no_grad():
# In-placeupdatesvia[:]
s[:] = rho * s + (1 - rho) * paddle.square(p.grad)
g = (paddle.sqrt(delta + eps) / paddle.sqrt(s + eps)) * p.grad
p[:] -= g
delta[:] = rho * delta + (1 - rho) * g * g
p.grad.zero_()
a.append(p)
return a
对于每次参数更新,选择
loss: 0.243, 0.101 sec/epoch
loss: 0.243, 0.014 sec/epoch
loss: 0.243, 0.148 sec/epoch
为了简洁实现,我们只需使用高级API中的Adadelta算法。
loss: 0.243, 0.013 sec/epoch
loss: 0.244, 0.101 sec/epoch
11.9.3. 小结¶
Adadelta没有学习率参数。相反,它使用参数本身的变化率来调整学习率。
Adadelta需要两个状态变量来存储梯度的二阶导数和参数的变化。
Adadelta使用泄漏的平均值来保持对适当统计数据的运行估计。
11.9.4. 练习¶
调整
的值,会发生什么?展示如何在不使用
的情况下实现算法。为什么这是个好主意?Adadelta真的是学习率为0吗?能找到Adadelta无法解决的优化问题吗?
将Adadelta的收敛行为与AdaGrad和RMSProp进行比较。