2024-07-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Stochastic optimization is a crucial component of machine learning, with the stochastic gradient descent (SGD) algorithm at its core, a method that has been widely used since it was first proposed more than 60 years ago. In the past eight years, we have witnessed an exciting new development: variance reduction techniques for stochastic optimization methods. These variance reduction methods (VR methods) perform well in settings where they allow many iterations over the training data, and they have been shown to converge faster than SGD both in theory and in practice. This speed increase highlights the growing interest in VR methods and the rapidly accumulating research results in this area. This article reviews the key principles and major advances in VR methods for optimization on finite datasets, aiming to be informative for non-expert readers. We focus primarily on the convex optimization setting and provide references for readers interested in the extension to non-convex function minimization.
Key words | Machine learning; optimization; variance reduction
In the field of machine learning research, a fundamental and important question is how to adapt the model to a large data set. For example, we can consider the typical case of a linear least squares model:
x ∗ ∈ arg min x ∈ R d 1 n ∑ i = 1 n ( a i T x − b i ) 2 x^* in argmin_{x in mathbb{R}^d} frac{1}{n} sum_{i=1}^{n} (a_i^T x - b_i)^2 x∗∈argx∈Rdminn1i=1∑n(aiTx−bi)2
In this model, we have d d d parameters, which are represented by the vector x ∈ R d x in mathbb{R}^d x∈Rd At the same time, we have n n n data points, including the feature vector a i ∈ R d a_i in mathbb{R}^d ai∈Rd and target value b i ∈ R b_i in mathbb{R} bi∈RThe model adaptation process is to adjust these parameters so that the model's prediction output a i T x a_i^T x aiTx As close to the target value as possible on average b i b_i bi。
More generally, we might use the loss function f i ( x ) f_i(x) fi(x) To measure the model prediction and i i i The proximity of the data points:
x ∗ ∈ arg min x ∈ R d f ( x ) : = 1 n ∑ i = 1 n f i ( x ) x^* in argmin_{x in mathbb{R}^d} f(x) := frac{1}{n} sum_{i=1}^{n} f_i(x) x∗∈argx∈Rdminf(x):=n1i=1∑nfi(x)
Loss Function f i ( x ) f_i(x) fi(x) If it is larger, it indicates that the model's prediction deviates greatly from the data; if f i ( x ) f_i(x) fi(x) is equal to zero, indicating that the model fits the data points perfectly. Function f ( x ) f(x) f(x) Reflects the average loss of the model on the entire dataset.
Problems similar to the above form (2) are not only applicable to linear least squares problems, but also to many other models studied in machine learning. For example, in the logistic regression model, we solve:
x ∗ ∈ arg min x ∈ R d 1 n ∑ i = 1 n log ( 1 + e − b i a i T x ) + λ 2 ∥ x ∥ 2 2 x^* in argmin_{x in mathbb{R}^d} frac{1}{n} sum_{i=1}^{n} log(1 + e^{-b_i a_i^T x}) + frac{lambda}{2} |x|_2^2 x∗∈argx∈Rdminn1i=1∑nlog(1+e−biaiTx)+2λ∥x∥22
Here, we are dealing with b i ∈ { − 1 , + 1 } b_i in {-1, +1} bi∈{−1,+1} For a binary classification problem, the prediction is based on a i T x a_i^T x aiTx The regularization term is also introduced in the formula λ 2 ∥ x ∥ 2 2 frac{lambda}{2} |x|_2^2 2λ∥x∥22 To avoid overfitting the data, ∥ x ∥ 2 2 |x|_2^2 ∥x∥22 express x x x The square of the Euclidean norm of .
In most supervised learning models, the training process can be expressed as form (2), including L1 regularized least squares, support vector machine (SVM), principal component analysis, conditional random fields, and deep neural networks.
A key challenge in modern problem instances is the number of data points
n
n
n Can be extremely large. We often deal with datasets that are well beyond the terabyte range in size, and these data may come from sources as diverse as the internet, satellites, remote sensors, financial markets, and scientific experiments. To cope with such large datasets, a common approach is to use the stochastic gradient descent (SGD) algorithm, which uses only a small number of randomly picked data points in each iteration. In addition, there has been a recent surge in interest in stochastic gradient methods with reduced variance (VR), which have shown faster convergence rates than traditional stochastic gradient methods.
Figure 1. Gradient descent (GD), accelerated gradient descent (AGD, i.e., accelerated GD in [50]), stochastic gradient descent (SGD), and ADAM [30] methods are compared with the variance reduction (VR) methods SAG and SVRG on the logistic regression problem based on the Mushroom dataset [7], where n = 8124 and d = 112.
Gradient descent (GD) is a classic algorithm used to solve the above problem (2). Its iterative update formula is as follows:
x
k
+
1
=
x
k
−
γ
1
n
∑
i
=
1
n
∇
f
i
(
x
k
)
x_{k+1} = x_k - gamma frac{1}{n} sum_{i=1}^{n} nabla f_i(x_k)
xk+1=xk−γn1i=1∑n∇fi(xk)
here, γ gamma γ is a fixed step value greater than zero. In each iteration of the GD algorithm, each data point must be i i i Computing Gradients ∇ f i ( x k ) nabla f_i(x_k) ∇fi(xk), which means that GD needs to n n n data points to perform a complete traversal. n n n When becomes very large, the cost of each iteration of the GD algorithm becomes very high, which limits its application.
As an alternative, we can consider the stochastic gradient descent (SGD) method, which was first proposed by Robbins and Monro, and its iterative update formula is as follows:
x
k
+
1
=
x
k
−
γ
∇
f
i
k
(
x
k
)
x_{k+1} = x_k - gamma nabla f_{i_k}(x_k)
xk+1=xk−γ∇fik(xk)
The SGD algorithm uses the gradient of only one randomly selected data point in each iteration. ∇ f i k ( x k ) nabla f_{i_k}(x_k) ∇fik(xk) to reduce the cost of each iteration. In Figure 1, we can see that SGD makes more significant progress than GD (including accelerated GD methods) in the early stages of the optimization process. The figure shows the progress of the optimization based on epochs, which is defined as the number of times all n n n The number of times the gradient of training samples is calculated. The GD algorithm performs one iteration in each round, while the SGD algorithm performs n n n We use the number of rounds as the basis for comparing SGD and GD because under the assumption n n n For very large cases, the main cost of both methods is concentrated in the gradient ∇ f i ( x k ) nabla f_i(x_k) ∇fi(xk) 's calculation.
Let's consider random indexing
i
k
i_k
ik From the collection
{
1
,
…
,
n
}
{1, ldots, n}
{1,…,n} In the case of uniform random selection, this means that for all
i
i
i,choose
i
k
=
i
i_k = i
ik=i The probability
P
[
i
k
=
i
]
P[i_k = i]
P[ik=i] equal
1
n
frac{1}{n}
n1.in this case,
∇
f
i
k
(
x
k
)
nabla f_{i_k}(x_k)
∇fik(xk) As
∇
f
(
x
k
)
nabla f(x_k)
∇f(xk) The estimator of is unbiased because, by the definition of expectation, we have:
E
[
∇
f
i
k
(
x
k
)
∣
x
k
]
=
1
n
∑
i
=
1
n
∇
f
i
(
x
k
)
=
∇
f
(
x
k
)
(
6
)
E[nabla f_{i_k}(x_k) | x_k] = frac{1}{n} sum_{i=1}^{n} nabla f_i(x_k) = nabla f(x_k) quad (6)
E[∇fik(xk)∣xk]=n1i=1∑n∇fi(xk)=∇f(xk)(6)
Although the SGD (Stochastic Gradient Descent) method does not guarantee the function f f f The value of will decrease, but on average it moves in the direction of the negative full gradient, which represents the direction of descent.
However, having an unbiased gradient estimator is not enough to ensure the convergence of SGD iterations. To illustrate this point, Figure 2 (left) shows the SGD iteration trajectory when applying the logistic regression function to the four-class dataset provided by LIBSVM [7] using a constant step size. The concentric ellipses in the figure represent the contour lines of the function, that is, the function value f ( x ) = c f(x) = c f(x)=c The corresponding point x x x gather, c c c is a specific constant in the real number set. Different constant values c c c Corresponding to different ellipses.
The SGD iterative trajectory does not converge to the optimal solution (indicated by the green asterisk in the figure), but instead forms a point cloud around the optimal solution. In contrast, Figure 2 shows the iterative trajectory of a variance reduction (VR) method, stochastic average gradient (SAG), using the same constant step size, which we will introduce later. The reason why SGD fails to converge in this example is that the stochastic gradient itself does not converge to zero, so the constant-step SGD method (5) will never stop. This is in stark contrast to the gradient descent (GD) method, which will naturally stop because as
x
k
x_k
xk Approaches
x
∗
x^*
x∗,gradient
∇
f
(
x
k
)
nabla f(x_k)
∇f(xk) will tend to zero.
Figure 2. Level set plots for 2D logistic regression using SGD (left) and SAG (right) iterative methods with a fixed step size. The green asterisks represent xuntie.
Processing due to ∇ f i ( x k ) nabla f_i(x_k) ∇fi(xk) There are several classical techniques to solve the nonconvergence problem caused by the variance of the values. For example, Robbins and Monro [64] used a series of decreasing step sizes to solve the problem. γ k gamma_k γk To solve the variance problem, ensure that the product γ k ∇ f i k ( x k ) gamma_k nabla f_{i_k}(x_k) γk∇fik(xk) can converge to zero. However, it is a difficult problem to adjust this sequence of decreasing step sizes to avoid stopping the algorithm too early or too late.
Another classic technique for reducing variance is to use multiple
∇
f
i
(
x
k
)
nabla f_i(x_k)
∇fi(xk) The average value of the complete gradient
∇
f
(
x
)
nabla f(x)
∇f(x) This approach is called minibatch processing and is particularly useful when multiple gradients can be evaluated in parallel. This leads to an iteration of the following form:
x
k
+
1
=
x
k
−
γ
1
∣
B
k
∣
∑
i
∈
B
k
∇
f
i
(
x
k
)
(
7
)
x_{k+1} = x_k - gamma frac{1}{|B_k|} sum_{i in B_k} nabla f_i(x_k) quad (7)
xk+1=xk−γ∣Bk∣1i∈Bk∑∇fi(xk)(7)
in
B
k
B_k
Bk is a random index set,
∣
B
k
∣
|B_k|
∣Bk∣ express
B
k
B_k
Bk If
B
k
B_k
Bk If we sample uniformly with replacement, the variance of the gradient estimate is proportional to the batch size
∣
B
k
∣
|B_k|
∣Bk∣ The variance is inversely proportional to , so the variance can be reduced by increasing the batch size.
However, the cost of this iteration is proportional to the batch size, so this form of variance reduction comes at the expense of increased computational cost.
Another common strategy to reduce variance and improve the empirical performance of SGD is to add "momentum", an extra term based on the direction used in past steps. In particular, SGD with momentum takes the following form:
x
k
+
1
=
x
k
−
γ
m
k
(
9
)
x_{k+1} = x_k - gamma m_k quad (9)
xk+1=xk−γmk(9)
The momentum parameter
β
beta
β is in the range (0, 1). If the initial momentum
m
0
=
0
m_0 = 0
m0=0, and expand in (8)
m
k
m_k
mk As an update, we get
m
k
m_k
mk is the weighted average of the previous gradients:
m
k
=
∑
t
=
0
k
β
k
−
t
∇
f
i
t
(
x
t
)
(
10
)
m_k = sum_{t=0}^{k} beta^{k-t} nabla f_{i_t}(x_t) quad (10)
mk=t=0∑kβk−t∇fit(xt)(10)
therefore,
m
k
m_k
mk is the weighted sum of stochastic gradients. Since
∑
t
=
0
k
β
k
−
t
=
1
−
β
k
+
1
1
−
β
sum_{t=0}^{k} beta^{k-t} = frac{1 - beta^{k+1}}{1 - beta}
∑t=0kβk−t=1−β1−βk+1, we can
1
−
β
1
−
β
k
m
k
frac{1 - beta}{1 - beta^k} m_k
1−βk1−βmk is considered as a weighted average of the stochastic gradient. If we compare this to the expression for the full gradient
∇
f
(
x
k
)
=
1
n
∑
i
=
1
n
∇
f
i
(
x
k
)
nabla f(x_k) = frac{1}{n} sum_{i=1}^{n} nabla f_i(x_k)
∇f(xk)=n1∑i=1n∇fi(xk) For comparison, we can
1
−
β
1
−
β
k
m
k
frac{1 - beta}{1 - beta^k} m_k
1−βk1−βmk(as well as
m
k
m_k
mk) is interpreted as an estimate of the full gradient. This weighted sum reduces variance but also brings a key problem. Since the weighted sum (10) gives more weight to the most recently sampled gradients, it will not converge to the full gradient
∇
f
(
x
k
)
nabla f(x_k)
∇f(xk), the latter is a simple average. The first variance reduction method we will see in Section II.A solves this problem by using a simple average instead of any weighted average.
Unlike classical methods, they directly use one or more
∇
f
i
(
x
k
)
nabla f_i(x_k)
∇fi(xk) As
∇
f
(
x
k
)
nabla f(x_k)
∇f(xk) Modern variance reduction (VR) methods take a different strategy. These methods use
∇
f
i
(
x
k
)
nabla f_i(x_k)
∇fi(xk) To update the estimated value of the gradient
g
k
g_k
gk, whose goal is to make
g
k
g_k
gk Approach
∇
f
(
x
k
)
nabla f(x_k)
∇f(xk)Specifically, we hope
g
k
g_k
gk Can satisfy
g
k
≈
∇
f
(
x
k
)
g_k approx nabla f(x_k)
gk≈∇f(xk)Based on this gradient estimate, we then perform an approximate gradient step of the following form:
x
k
+
1
=
x
k
−
γ
g
k
(
11
)
x_{k+1} = x_k - gamma g_k quad (11)
xk+1=xk−γgk(11)
here
γ
>
0
gamma > 0
γ>0 is the step size parameter.
To ensure a constant step size
γ
gamma
γ When iteration (11) can converge, we need to ensure that the gradient estimate
g
k
g_k
gk The variance of tends to zero. Mathematically, this can be expressed as:
E
[
∥
g
k
−
∇
f
(
x
k
)
∥
2
]
→
0
as
k
→
∞
(
12
)
Eleft[ | g_k - nabla f(x_k) |^2 right] rightarrow 0 quad text{as } k rightarrow infty quad (12)
E[∥gk−∇f(xk)∥2]→0as k→∞(12)
What to expect here
E
E
E It is based on the algorithm until
k
k
k The property (12) ensures that the VR method stops when it reaches the optimal solution. We regard this property as a hallmark of the VR method and hence call it the VR property. It is worth noting that the expression “reduced” variance may be misleading, since in reality the variance tends to zero. Property (12) is the key factor that enables the VR method to achieve faster convergence in theory (under appropriate assumptions) and in practice (as shown in Figure 1).
A simple improvement method can make SGD recursion (5) converge without reducing the step size, that is, to translate each gradient by subtracting
∇
f
i
(
x
∗
)
nabla f_i(x^*)
∇fi(x∗), this method is defined as follows:
x
k
+
1
=
x
k
−
γ
(
∇
f
i
k
(
x
k
)
−
∇
f
i
k
(
x
∗
)
)
(
13
)
x_{k+1} = x_k - gamma (nabla f_{i_k}(x_k) - nabla f_{i_k}(x^*)) quad (13)
xk+1=xk−γ(∇fik(xk)−∇fik(x∗))(13)
This method is called SGD² [22]. Although we usually do not know exactly
∇
f
i
(
x
∗
)
nabla f_i(x^*)
∇fi(x∗), but SGD² is an example that can well illustrate the basic properties of variance reduction methods. In addition, many variance reduction methods can be viewed as an approximation of the SGD² method; these methods do not rely on knowing every
∇
f
i
(
x
∗
)
nabla f_i(x^*)
∇fi(x∗), but use the method that can approximate
∇
f
i
(
x
∗
)
nabla f_i(x^*)
∇fi(x∗) The estimated value of .
It is important to note that SGD² uses an unbiased estimate of the full gradient.
∇
f
(
x
∗
)
=
0
nabla f(x^*) = 0
∇f(x∗)=0,F:
E
[
∇
f
i
k
(
x
k
)
−
∇
f
i
k
(
x
∗
)
]
=
∇
f
(
x
k
)
−
∇
f
(
x
∗
)
=
∇
f
(
x
k
)
E[nabla f_{i_k}(x_k) - nabla f_{i_k}(x^*)] = nabla f(x_k) - nabla f(x^*) = nabla f(x_k)
E[∇fik(xk)−∇fik(x∗)]=∇f(xk)−∇f(x∗)=∇f(xk)
In addition, when SGD² reaches the optimal solution, it will naturally stop, because for any
i
i
i,have:
(
∇
f
i
(
x
)
−
∇
f
i
(
x
∗
)
)
∣
x
=
x
∗
=
0
(nabla f_i(x) - nabla f_i(x^*)) bigg|_{x=x^*} = 0
(∇fi(x)−∇fi(x∗))
x=x∗=0
Further observation shows that
x
k
x_k
xk near
x
∗
x^*
x∗(For continuous
∇
f
i
nabla f_i
∇fi), SGD² satisfies the variance reduction property (12) because:
E
[
∥
g
k
−
∇
f
(
x
k
)
∥
2
]
=
E
[
∥
∇
f
i
k
(
x
k
)
−
∇
f
i
k
(
x
∗
)
−
∇
f
(
x
k
)
∥
2
]
≤
E
[
∥
∇
f
i
k
(
x
k
)
−
∇
f
i
k
(
x
∗
)
∥
2
]
Eleft[ | g_k - nabla f(x_k) |^2 right] = \Eleft[ | nabla f_{i_k}(x_k) - nabla f_{i_k}(x^*) - nabla f(x_k) |^2 right] leq Eleft[ | nabla f_{i_k}(x_k) - nabla f_{i_k}(x^*) |^2 right]
E[∥gk−∇f(xk)∥2]=E[∥∇fik(xk)−∇fik(x∗)−∇f(xk)∥2]≤E[∥∇fik(xk)−∇fik(x∗)∥2]
Here we use Lemma 2, let
X
=
∇
f
i
k
(
x
k
)
−
∇
f
i
k
(
x
∗
)
X = nabla f_{i_k}(x_k) - nabla f_{i_k}(x^*)
X=∇fik(xk)−∇fik(x∗), and used
E
[
∇
f
i
k
(
x
k
)
−
∇
f
i
k
(
x
∗
)
]
=
∇
f
(
x
k
)
E[nabla f_{i_k}(x_k) - nabla f_{i_k}(x^*)] = nabla f(x_k)
E[∇fik(xk)−∇fik(x∗)]=∇f(xk) This property indicates that SGD² has a faster convergence rate than the traditional SGD method, which we have explained in detail in Appendix B.
In this section we introduce two standard assumptions used in the analysis of variance reduction (VR) methods and discuss the speedups that can be achieved under these assumptions compared to traditional SGD methods. First, we assume that the gradient has Lipschitz continuity, which means that the gradient can change at a finite rate.
We assume that the function
f
f
fis differentiable and is
L
L
L- Smooth, for all
x
x
x and
y
y
y and a
0
<
L
<
∞
0 < L < infty
0<L<∞,The following conditions:
∥
∇
f
(
x
)
−
∇
f
(
y
)
∥
≤
L
∥
x
−
y
∥
(
14
)
|nabla f(x) - nabla f(y)| leq L|x - y| quad (14)
∥∇f(x)−∇f(y)∥≤L∥x−y∥(14)
This means that each
f
i
:
R
d
→
R
fi: mathbb{R}^d rightarrow mathbb{R}
fi:Rd→R is differentiable,
L
i
L_i
Li- Smooth, we define
L
max
L_{text{max}}
Lmax for
max
{
L
1
,
.
.
.
,
L
n
}
max{L_1, . . . , L_n}
max{L1,...,Ln}。
Although this is generally considered a weak assumption, in subsequent chapters we will discuss VR methods applicable to nonsmooth problems. For a twice differentiable univariate function, L L L- Smoothness can be intuitively understood as follows: it is equivalent to assuming that the second-order derivative is L L L Upper limit, i.e. ∣ f ′ ′ ( x ) ∣ ≤ L |f''(x)| leq L ∣f′′(x)∣≤L For all x ∈ R d x in mathbb{R}^d x∈RdFor a twice differentiable function of multiple variables, this is equivalent to assuming that the Hessian matrix ∇ 2 f ( x ) nabla^2 f(x) ∇2f(x) The singular values of L L L Upper limit.
The second assumption we consider is that the function (f) is μ mu μ- is strongly convex, which means that for some μ > 0 mu > 0 μ>0,function x ↦ f ( x ) − μ 2 ∥ x ∥ 2 x mapsto f(x) - frac{mu}{2}|x|^2 x↦f(x)−2μ∥x∥2 is convex. In addition, for each i = 1 , . . . , n i = 1, . . . , n i=1,...,n, f i : R d → R fi: mathbb{R}^d rightarrow mathbb{R} fi:Rd→R It is convex.
This is a strong assumption. In the least squares problem, each (fi$) is convex, but the overall function (f) is only convex if the design matrix A : = [ a 1 , . . . , a n ] A := [a_1, . . . , a_n] A:=[a1,...,an] It is strongly convex only when it has complete row rank. The L2 regularized logistic regression problem satisfies this assumption due to the existence of the regularization term, where μ ≥ λ mu geq lambda μ≥λ。
An important class of problems that satisfy these assumptions are optimization problems of the form:
x
∗
∈
arg
min
x
∈
R
d
f
(
x
)
=
1
n
∑
i
=
1
n
ℓ
i
(
a
i
T
x
)
+
λ
2
∥
x
∥
2
(
15
)
x^* in argmin_{x in mathbb{R}^d} f(x) = frac{1}{n} sum_{i=1}^{n} ell_i(a_i^Tx) + frac{lambda}{2}|x|^2 quad (15)
x∗∈argx∈Rdminf(x)=n1i=1∑nℓi(aiTx)+2λ∥x∥2(15)
Each of the "loss" functions
ℓ
i
:
R
→
R
ell_i: mathbb{R} rightarrow mathbb{R}
ℓi:R→R is twice differentiable, and its second-order derivative
ℓ
i
′
′
ell_i''
ℓi′′ is restricted to 0 and some upper bound
M
M
M This includes various loss functions with L2 regularization in machine learning, such as least squares, logistic regression, probit regression, Huber robust regression, etc. In this case, for all
i
i
i,We have
L
i
≤
M
∥
a
i
∥
2
+
λ
L_i leq M|a_i|^2 + lambda
Li≤M∥ai∥2+λ and
μ
≥
λ
mu geq lambda
μ≥λ。
Under these assumptions, the convergence rate of the gradient descent (GD) method is given by the condition number κ : = L / μ kappa := L/mu κ:=L/μ The condition number is always greater than or equal to 1. When it is significantly greater than 1, the contour of the function becomes very elliptical, causing the iteration of the GD method to oscillate. On the contrary, when κ kappa κ When it is close to 1, the GD method converges faster.
Under Assumptions 1 and 2, the VR method converges at a linear rate. We say that the function value ({f(x_k)}) of a random method converges at
0
<
ρ
≤
1
0 < rho leq 1
0<ρ≤1 The rate of convergence is linear (under expectation) if there exists a constant
C
>
0
C > 0
C>0 So that:
E
[
f
(
x
k
)
]
−
f
(
x
∗
)
≤
(
1
−
ρ
)
k
C
=
O
(
exp
(
−
k
ρ
)
)
∀
k
(
16
)
E[f(x_k)] - f(x^*) leq (1 - rho)^k C = O(exp(-krho)) quad forall k quad (16)
E[f(xk)]−f(x∗)≤(1−ρ)kC=O(exp(−kρ))∀k(16)
This is in contrast to the classical SGD approach which relies only on an unbiased estimate of the gradient at each iteration, which can only achieve sublinear rates under these assumptions:
E
[
f
(
x
k
)
]
−
f
(
x
∗
)
≤
O
(
1
/
k
)
E[f(x_k)] - f(x^*) leq O(1/k)
E[f(xk)]−f(x∗)≤O(1/k)
The minimum value that satisfies this inequality
k
k
k It is called the iteration complexity of the algorithm. Here are the iteration complexities and the cost of one iteration for basic variants of GD, SGD, and VR methods:
algorithm | Iterations | The cost of one iteration |
---|---|---|
GD | O ( κ log ( 1 / ϵ ) ) O(kappa log(1/epsilon)) O(κlog(1/ϵ)) | O ( n ) O(n) O(n) |
SGD | O ( κ max max ( 1 / ϵ ) ) O(kappa_{text{max}} max(1/epsilon)) O(κmaxmax(1/ϵ)) | O ( 1 ) O(1) O(1) |
VR | O ( ( κ max + n ) log ( 1 / ϵ ) ) O((kappa_{text{max}} + n) log(1/epsilon)) O((κmax+n)log(1/ϵ)) | O ( 1 ) O(1) O(1) |
The total running time of the algorithm is determined by the product of the iteration complexity and the iteration running time. κ max : = max i L i / μ kappa_{text{max}} := max_i L_i/mu κmax:=maxiLi/μ.Notice κ max ≥ κ kappa_{text{max}} geq kappa κmax≥κ; Therefore, the iteration complexity of GD is smaller than that of VR method.
However, since the cost of each iteration of GD is n n n times, the VR method is superior in terms of total running time.
The advantage of classical SGD methods is that their running time and convergence rate do not depend on n n n, but it has a tolerance ϵ epsilon ϵ The dependence of is much worse, which explains the poor performance of SGD when the tolerance is small.
In Appendix B, we provide a simple proof showing that the SGD² method has the same iteration complexity as the VR method.
The development of variance reduction (VR) methods has gone through several stages, with the first batch of methods achieving significant convergence rates. The beginning of this series of methods was the SAG algorithm. Subsequently, the randomized dual coordinate ascent (SDCA) algorithm, the MISO algorithm, the stochastic variance reduction gradient (SVRG/S2GD) algorithm, and the SAGA algorithm (meaning "improved" SAG) algorithm came out one after another.
In this chapter, we will introduce these groundbreaking VR methods in detail, and in Chapter 4, we will explore some newer methods that show superior properties compared to these basic methods in specific application scenarios.
Our exploration of the first variance reduction (VR) method begins with imitating the full gradient structure. ∇ f ( x ) nabla f(x) ∇f(x) Yes all ∇ f i ( x ) nabla f_i(x) ∇fi(x) A simple average of the gradients, then our estimate of the full gradient g k g_k gk It should also be the average of these gradient estimates. This idea gave birth to our first VR method: the stochastic average gradient (SAG) method.
The SAG method [37], [65] is a randomized version of the earlier Incremental Aggregate Gradient (IAG) method [4]. The core idea of SAG is to perform a
i
i
i Maintain an estimate
v
i
k
≈
∇
f
i
(
x
k
)
v_{ik} approx nabla f_i(x_k)
vik≈∇fi(xk)Then, use these
v
i
k
v_{ik}
vik The average of the values is used as an estimate of the complete gradient, that is:
g
ˉ
k
=
1
n
∑
j
=
1
n
v
j
k
≈
1
n
∑
j
=
1
n
∇
f
j
(
x
k
)
=
∇
f
(
x
k
)
(
18
)
bar{g}_k = frac{1}{n} sum_{j=1}^{n} v_{jk} approx frac{1}{n} sum_{j=1}^{n} nabla f_j(x_k) = nabla f(x_k) quad (18)
gˉk=n1j=1∑nvjk≈n1j=1∑n∇fj(xk)=∇f(xk)(18)
In each iteration of SAG, from the set
{
1
,
…
,
n
}
{1, ldots, n}
{1,…,n} Extract an index from
i
k
i_k
ik, and then update according to the following rules
v
j
k
v_{jk}
vjk:
v
j
k
k
+
1
=
{
∇
f
i
k
(
x
k
)
,
if
j
=
i
k
v
j
k
k
,
if
j
≠
i
k
(
19
)
v_{jk}^{k+1} = {∇fik(xk),if j=ikvkjk,if j≠ik quad (19)
vjkk+1={∇fik(xk),vjkk,if j=ikif j=ik(19)
Among them, each
v
0
i
v_{0i}
v0i Can be initialized to zero or
∇
f
i
(
x
0
)
nabla f_i(x_0)
∇fi(x0) As the solution
x
∗
x^*
x∗ The approximation of
v
i
k
v_{ik}
vik will gradually converge to
∇
f
i
(
x
∗
)
nabla f_i(x^*)
∇fi(x∗), thus satisfying VR property (12).
In order to implement SAG efficiently, we need to pay attention to the calculation
g
ˉ
k
bar{g}_k
gˉk Avoid starting the sum from scratch each time
n
n
n vector, because this
n
n
n It is very expensive when it is large. Fortunately, since there is only one
v
i
k
v_{ik}
vik terms change, we can avoid recalculating the entire sum each time. Specifically, suppose that in the iteration
k
k
k The index was extracted from
i
k
i_k
ik, then:
g
ˉ
k
=
1
n
∑
j
=
1
j
≠
i
k
n
v
j
k
+
1
n
v
i
k
k
=
g
ˉ
k
−
1
−
1
n
v
i
k
k
−
1
+
1
n
v
i
k
k
(
20
)
bar{g}_k = frac{1}{n} sum_{substack{j=1 \ j neq i_k}}^{n} v_{jk} + frac{1}{n} v_{i_k}^k = bar{g}_{k-1} - frac{1}{n} v_{i_k}^{k-1} + frac{1}{n} v_{i_k}^k quad (20)
gˉk=n1j=1j=ik∑nvjk+n1vikk=gˉk−1−n1vikk−1+n1vikk(20)
Because in addition to v i k v_{i_k} vik All except v j k v_{jk} vjk The values remain unchanged, we only need to store each j j j A corresponding vector v j v_j vjAlgorithm 1 shows the specific implementation of the SAG method.
SAG is the first randomized method to achieve linear convergence, and its iteration complexity is O ( ( κ max + n ) log ( 1 / ϵ ) ) O((kappa_{text{max}} + n) log(1/epsilon)) O((κmax+n)log(1/ϵ)), using the step size γ = O ( 1 / L max ) gamma = O(1/L_{text{max}}) γ=O(1/Lmax)This linear convergence can be observed in Figure 1. It is worth noting that due to L max L_{text{max}} Lmax-Smooth function for any L ′ ≥ L max L' geq L_{text{max}} L′≥Lmax Too L ′ L' L′- Smooth, SAG methods achieve linear convergence rates for sufficiently small step sizes, in stark contrast to classical SGD methods, which achieve sublinear rates only for decreasing sequences of step sizes that are difficult to tune in practice.
At the time, the linear convergence of SAG was a significant advance because it only computed a single stochastic gradient (processing a single data point) at each iteration. However, the convergence proof provided by Schmidt et al. [65] was very complex and relied on computer verification procedures. A key reason why SAG is difficult to analyze is that g k g_k gk is a biased estimator of the gradient.
Next, we present the SAGA method, a variant of SAG that exploits the concept of covariates to create an unbiased variant of the SAG method that has similar performance but is easier to analyze.
Algorithm 1: SAG method
A Reduced Basic Unbiased Gradient Estimator
∇
f
i
k
(
x
k
)
nabla f_{i_k}(x_k)
∇fik(xk) The variance method is to use so-called covariates (or control variables).
i
=
1
,
…
,
n
i = 1, ldots, n
i=1,…,n,set up
v
i
∈
R
d
v_i in mathbb{R}^d
vi∈Rd is a vector. Using these vectors, we can convert the full gradient
∇
f
(
x
)
nabla f(x)
∇f(x) Rewritten as:
∇
f
(
x
)
=
1
n
∑
i
=
1
n
(
∇
f
i
(
x
)
−
v
i
+
v
i
)
=
1
n
∑
i
=
1
n
∇
f
i
(
x
)
−
v
i
+
1
n
∑
j
=
1
n
v
j
nabla f(x) = frac{1}{n} sum_{i=1}^{n}(nabla f_i(x) - v_i + v_i) = frac{1}{n} sum_{i=1}^{n} nabla f_i(x) - v_i + frac{1}{n} sum_{j=1}^{n} v_j
∇f(x)=n1i=1∑n(∇fi(x)−vi+vi)=n1i=1∑n∇fi(x)−vi+n1j=1∑nvj
:
=
1
n
∑
i
=
1
n
∇
f
i
(
x
,
v
)
(
21
)
:= frac{1}{n} sum_{i=1}^{n} nabla f_i(x, v) quad (21)
:=n1i=1∑n∇fi(x,v)(21)
where the definition
∇
f
i
(
x
,
v
)
:
=
∇
f
i
(
x
)
−
v
i
+
1
n
∑
j
=
1
n
v
j
nabla f_i(x, v) := nabla f_i(x) - v_i + frac{1}{n} sum_{j=1}^{n} v_j
∇fi(x,v):=∇fi(x)−vi+n1∑j=1nvjNow, we can randomly sample a
∇
f
i
(
x
,
v
)
nabla f_i(x, v)
∇fi(x,v) To build the full gradient
∇
f
(
x
)
nabla f(x)
∇f(x) An unbiased estimate of
i
∈
{
1
,
…
,
n
}
i in {1, ldots, n}
i∈{1,…,n}, we can apply the SGD method and use the gradient estimate:
g
k
=
∇
f
i
k
(
x
k
,
v
)
=
∇
f
i
k
(
x
k
)
−
v
i
k
+
1
n
∑
j
=
1
n
v
j
(
22
)
g_k = nabla f_{i_k}(x_k, v) = nabla f_{i_k}(x_k) - v_{i_k} + frac{1}{n} sum_{j=1}^{n} v_j quad (22)
gk=∇fik(xk,v)=∇fik(xk)−vik+n1j=1∑nvj(22)
To observe
v
i
v_i
vi The choice of variance
g
k
g_k
gk , we can
g
k
=
∇
f
i
k
(
x
k
,
v
)
g_k = nabla f_{i_k}(x_k, v)
gk=∇fik(xk,v) Substitute and use
E
i
∼
1
n
[
v
i
]
=
1
n
∑
j
=
1
n
v
j
E_i sim frac{1}{n}[v_i] = frac{1}{n} sum_{j=1}^{n} v_j
Ei∼n1[vi]=n1∑j=1nvj To calculate the expectation, we get:
E
[
∥
∇
f
i
(
x
k
)
−
v
i
+
E
i
∼
1
n
[
v
i
−
∇
f
i
(
x
k
)
]
∥
2
]
≤
E
[
∥
∇
f
i
(
x
k
)
−
v
i
∥
2
]
(
23
)
E left[ |nabla f_i(x_k) - v_i + E_i sim frac{1}{n}[v_i - nabla f_i(x_k)]|^2 right] leq E left[ |nabla f_i(x_k) - v_i|^2 right] quad (23)
E[∥∇fi(xk)−vi+Ei∼n1[vi−∇fi(xk)]∥2]≤E[∥∇fi(xk)−vi∥2](23)
Lemma 2 is used here, where
X
=
∇
f
i
(
x
k
)
−
v
i
X = nabla f_i(x_k) - v_i
X=∇fi(xk)−viThis bound (23) shows that if
v
i
v_i
vi along with
k
k
k The increase is close to
∇
f
i
(
x
k
)
nabla f_i(x_k)
∇fi(xk), we can obtain the VR property (12). This is why we call
v
i
v_i
vi are covariates, and we can choose them to reduce variance.
For example, the SGD² method (13) also implements this approach, where
v
i
=
∇
f
i
(
x
∗
)
v_i = nabla f_i(x^*)
vi=∇fi(x∗)However, this is not often used in practice, because we usually do not know
∇
f
i
(
x
∗
)
nabla f_i(x^*)
∇fi(x∗)A more practical option is
v
i
v_i
vi As we know
x
ˉ
i
∈
R
d
bar{x}_i in mathbb{R}^d
xˉi∈Rd Nearby gradient
∇
f
i
(
x
ˉ
i
)
nabla f_i(bar{x}_i)
∇fi(xˉi)SAGA performs
f
i
f_i
fi Use a reference point
x
ˉ
i
∈
R
d
bar{x}_i in mathbb{R}^d
xˉi∈Rd, and using the covariate
v
i
=
∇
f
i
(
x
ˉ
i
)
v_i = nabla f_i(bar{x}_i)
vi=∇fi(xˉi), where each
x
ˉ
i
bar{x}_i
xˉi This will be our last evaluation
f
i
f_i
fi Using these covariates, we can construct the gradient estimate, following (22), giving:
g
k
=
∇
f
i
k
(
x
k
)
−
∇
f
i
k
(
x
ˉ
i
k
)
+
1
n
∑
j
=
1
n
∇
f
j
(
x
ˉ
j
)
(
24
)
g_k = nabla f_{i_k}(x_k) - nabla f_{i_k}(bar{x}_{i_k}) + frac{1}{n} sum_{j=1}^{n} nabla f_j(bar{x}_j) quad (24)
gk=∇fik(xk)−∇fik(xˉik)+n1j=1∑n∇fj(xˉj)(24)
To implement SAGA, we can store the gradient ∇ f i ( x ˉ i ) nabla f_i(bar{x}_i) ∇fi(xˉi) Rather than n n n Reference points x ˉ i bar{x}_i xˉiThat is to say, v j = ∇ f j ( x ˉ j ) v_j = nabla f_j(bar{x}_j) vj=∇fj(xˉj) for j ∈ { 1 , … , n } j in {1, ldots, n} j∈{1,…,n}, in each iteration, we update a stochastic gradient θ like SAG v j v_j vj。
Algorithm 2 SAGA
The SAGA method has the same iteration complexity as SAG O ( ( κ max + n ) log ( 1 / ϵ ) ) O((kappa_{text{max}} + n) log(1/epsilon)) O((κmax+n)log(1/ϵ)), using the step size γ = O ( 1 / L max ) gamma = O(1/L_{text{max}}) γ=O(1/Lmax), but the proof is much simpler. However, like SAG, the SAGA method requires storage n n n Auxiliary vector v i ∈ R d v_i in mathbb{R}^d vi∈Rd for i = 1 , … , n i = 1, ldots, n i=1,…,n, which means that O ( n d ) O(nd) O(nd) storage space. d d d and n n n This may not be feasible when both are large. In the next section, we detail how to reduce this memory requirement for common models such as regularized linear models.
When it is possible n n n The performance of SAG and SAGA tends to be similar when the auxiliary vectors are stored in memory. If this memory requirement is too high, the SVRG method, which we will review in the next section, is a good choice. The SVRG method achieves the same convergence rate and is often nearly as fast in practice, but only requires O ( d ) O(d) O(d) of memory, for general questions.
Before the advent of the SAGA method, some early works first introduced covariates to solve the high memory problem required by the SAG method. These studies constructed a fixed reference point based on x ˉ ∈ R d bar{x} in mathbb{R}^d xˉ∈Rd covariate, we have already calculated the full gradient at this point ∇ f ( x ˉ ) nabla f(bar{x}) ∇f(xˉ)By storing the reference point x ˉ bar{x} xˉ and the corresponding full gradient ∇ f ( x ˉ ) nabla f(bar{x}) ∇f(xˉ), we can do this without storing each ∇ f j ( x ˉ ) nabla f_j(bar{x}) ∇fj(xˉ) In the case of x ˉ j = x ˉ bar{x}_j = bar{x} xˉj=xˉ For all j j j to implement the update (24). Specifically, instead of storing these vectors, we use the stored reference point in each iteration x ˉ bar{x} xˉ To calculate ∇ f i k ( x ˉ ) nabla f_{i_k}(bar{x}) ∇fik(xˉ)This method was originally proposed by different authors with different names, but was later uniformly referred to as the SVRG method, following the nomenclature of [28] and [84].
We formalize the SVRG method in Algorithm 3.
Using (23), we can derive the gradient estimate
g
k
g_k
gk The variance of is bounded:
E
[
∥
g
k
−
∇
f
(
x
k
)
∥
2
]
≤
E
[
∥
∇
f
i
(
x
k
)
−
∇
f
i
(
x
ˉ
)
∥
2
]
≤
L
max
2
∥
x
k
−
x
ˉ
∥
2
Eleft[ | g_k - nabla f(x_k) |^2 right] leq Eleft[ | nabla f_i(x_k) - nabla f_i(bar{x}) |^2 right] leq L_{text{max}}^2 | x_k - bar{x} |^2
E[∥gk−∇f(xk)∥2]≤E[∥∇fi(xk)−∇fi(xˉ)∥2]≤Lmax2∥xk−xˉ∥2
The second inequality uses each
f
i
f_i
fi of
L
i
L_i
Li- Smoothness.
It is worth noting that the reference point x ˉ bar{x} xˉ The closer to the current point x k x_k xk, the smaller the variance of the gradient estimate.
In order for the SVRG method to be effective, we need to update the reference point frequently. x ˉ bar{x} xˉThe trade-off between the cost of σ (thus requiring the calculation of the full gradient) and the benefit of reducing variance is that we t t t Update the reference point once in each iteration to make it close to x k x_k xk(See line 11 of Algorithm II-C). That is, the SVRG method contains two loops: an outer loop s s s, where the reference gradient is calculated ∇ f ( x ˉ s − 1 ) nabla f(bar{x}_{s-1}) ∇f(xˉs−1)(Line 4), and an inner loop where the reference point is fixed and the inner iteration is updated according to the stochastic gradient step (22) x k x_k xk(line 10).
Unlike SAG and SAGA, SVRG only needs O ( d ) O(d) O(d) The disadvantages of SVRG include: 1) We have an extra parameter t t t, i.e., the length of the inner loop, needs to be adjusted; 2) two gradients need to be calculated for each iteration, and the full gradient needs to be calculated each time the reference point is changed.
Johnson and Zhang [28] showed that SVRG has an iterative complexity O ( ( κ max + n ) log ( 1 / ϵ ) ) O((kappa_{text{max}} + n) log(1/epsilon)) O((κmax+n)log(1/ϵ)), similar to SAG and SAGA. This is based on the assumption that the number of inner loops t t t From the collection { 1 , … , m } {1, ldots, m} {1,…,m} The result is obtained by uniform sampling, where L max L_{text{max}} Lmax, μ mu μ, step length γ gamma γ and t t t There must be certain dependencies between them. In practice, by using γ = O ( 1 / L max ) gamma = O(1/L_{text{max}}) γ=O(1/Lmax) and the inner loop length t = n t = n t=n,SVRG tends to perform well, which is exactly the setting we use in Figure 1.
There are many variations of the original SVRG method. For example, some variants use
t
t
t [32], some variants allow the form
O
(
1
/
L
max
)
O(1/L_{text{max}})
O(1/Lmax) The step size is [27], [33], [35]. Some variants use
∇
f
(
x
ˉ
)
nabla f(bar{x})
∇f(xˉ) to reduce the cost of these full gradient evaluations by using a mini-batch approximation of , and increasing the mini-batch size to preserve the VR property. There are also variants that repeatedly update in the inner loop according to [54]
g
k
g_k
gk:
[ g_k = nabla f_{i_k}(x_k) - nabla f_{i_k}(x_{k-1}) + g_{k-1} quad (25) ]
This provides a more local approximation. Using this continuous update variant of (25) shows particular advantages when minimizing non-convex functions, as we briefly discuss in Section 4. Finally, note that SVRG can exploit
∇
f
(
x
ˉ
s
)
nabla f(bar{x}_s)
∇f(xˉs) The value of helps decide when to terminate the algorithm.
Algorithm 3 SVRG method
A disadvantage of the SAG and SVRG methods is that their step size depends on the
L
max
L_{text{max}}
LmaxBefore SVRG, the SDCA method [70] was one of the earliest VR methods that extended the study of coordinate descent methods to the finite sum problem. The idea behind SDCA and its variants is that the coordinates of the gradient provide a natural variance-reducing gradient estimate. Specifically, let
j
∈
{
1
,
…
,
d
}
j in {1, ldots, d}
j∈{1,…,d}, and define
∇
j
f
(
x
)
:
=
(
∂
f
(
x
)
∂
x
j
)
e
j
nabla_j f(x) := left( frac{partial f(x)}{partial x_j} right) e_j
∇jf(x):=(∂xj∂f(x))ej is the first
j
j
j The derivative in the coordinate direction is
e
j
∈
R
d
e_j in mathbb{R}^d
ej∈Rd It is
j
j
j unit vectors. A key property of coordinate derivatives is that
∇
j
f
(
x
∗
)
=
0
nabla_j f(x^*) = 0
∇jf(x∗)=0This is because we know
∇
f
(
x
∗
)
=
0
nabla f(x^*) = 0
∇f(x∗)=0. This is consistent with the derivative of each data point
∇
f
j
nabla f_j
∇fj Different, the latter
x
∗
x^*
x∗ may not be zero at any point. Therefore, we have:
∥
∇
f
(
x
)
−
∇
j
f
(
x
)
∥
2
→
0
当
x
→
x
∗
(
26
)
| nabla f(x) - nabla_j f(x) |^2 rightarrow 0 quad text{当} quad x rightarrow x^* quad (26)
∥∇f(x)−∇jf(x)∥2→0whenx→x∗(26)
This means that the coordinate derivative satisfies the variance reduction property (12). In addition, we can use
∇
j
f
(
x
)
nabla_j f(x)
∇jf(x) To build
∇
f
(
x
)
nabla f(x)
∇f(x) For example, let
j
j
j It is from the collection
{
1
,
…
,
d
}
{1, ldots, d}
{1,…,d} A uniformly random index is selected from . Therefore, for any
i
∈
{
1
,
…
,
d
}
i in {1, ldots, d}
i∈{1,…,d},We have
P
[
j
=
i
]
=
1
d
P[j = i] = frac{1}{d}
P[j=i]=d1.therefore,
d
×
∇
j
f
(
x
)
d times nabla_j f(x)
d×∇jf(x) yes
∇
f
(
x
)
nabla f(x)
∇f(x) is an unbiased estimate of because:
E
[
d
∇
j
f
(
x
)
]
=
d
∑
i
=
1
d
P
[
j
=
i
]
∂
f
(
x
)
∂
x
i
e
i
=
∑
i
=
1
d
∂
f
(
x
)
∂
x
i
e
i
=
∇
f
(
x
)
Eleft[ d nabla_j f(x) right] = d sum_{i=1}^{d} P[j = i] frac{partial f(x)}{partial x_i} e_i = sum_{i=1}^{d} frac{partial f(x)}{partial x_i} e_i = nabla f(x)
E[d∇jf(x)]=di=1∑dP[j=i]∂xi∂f(x)ei=i=1∑d∂xi∂f(x)ei=∇f(x)
therefore, ∇ j f ( x ) nabla_j f(x) ∇jf(x) has all the desirable properties we expect of a VR estimate of the full gradient, but without the need to use covariates. One drawback of using this coordinate gradient is that it is computationally expensive for our sum problem (2). This is because computing ∇ j f ( x ) nabla_j f(x) ∇jf(x) The entire dataset needs to be traversed because ∇ j f ( x ) = 1 n ∑ i = 1 n ∇ j f i ( x ) nabla_j f(x) = frac{1}{n} sum_{i=1}^{n} nabla_j f_i(x) ∇jf(x)=n1∑i=1n∇jfi(x). Thus, the use of coordinate derivatives seems incompatible with the structure of our sum problem. However, we can often rewrite the original problem (2) as a so-called dual formulation, where the coordinate derivatives can exploit the inherent structure.
For example, the dual formula of the L2 regularized linear model (15) is:
v
∗
∈
arg
max
v
∈
R
n
1
n
∑
i
=
1
n
−
ℓ
i
∗
(
−
v
i
)
−
λ
2
∥
1
λ
∑
i
=
1
n
v
i
a
i
∥
2
(
27
)
v^* in argmax_{v in mathbb{R}^n} frac{1}{n} sum_{i=1}^{n} -ell_i^*(-v_i) - frac{lambda}{2} left| frac{1}{lambda} sum_{i=1}^{n} v_i a_i right|^2 quad (27)
v∗∈argv∈Rnmaxn1i=1∑n−ℓi∗(−vi)−2λ
λ1i=1∑nviai
2(27)
in
ℓ
i
∗
(
v
)
ell_i^*(v)
ℓi∗(v) yes
ℓ
i
ell_i
ℓi convex conjugate of . We can use the mapping
x
=
1
λ
∑
i
=
1
n
v
i
a
i
x = frac{1}{lambda} sum_{i=1}^{n} v_i a_i
x=λ1∑i=1nviai To recover the original problem (15)
x
x
x Variable.
v
∗
v^*
v∗ Substituting the right side of the above mapping, we can obtain the solution of (15)
x
∗
x^*
x∗。
Note that this dual problem has n n n real variable v i ∈ R v_i in mathbb{R} vi∈R, one for each training sample. In addition, each dual loss function ℓ i ∗ ell_i^* ℓi∗ only v i v_i vi function of . That is, the first term in the loss function is separable in coordinates. This separability in coordinates, combined with the simple form of the second term, allows us to implement coordinate ascent methods efficiently. In fact, Shalev-Shwartz and Zhang show that coordinate ascent on this problem has an iterative complexity similar to that of SAG, SAGA, and SVRG. O ( ( κ max + n ) log ( 1 / ϵ ) ) O((kappa_{text{max}} + n) log(1/epsilon)) O((κmax+n)log(1/ϵ))。
The iteration cost and algorithm structure are also very similar: by tracking the sum ∑ i = 1 n v i a i sum_{i=1}^{n} v_i a_i ∑i=1nviai To deal with the second term in (27), each dual coordinate ascent iteration only needs to consider one training example, and the cost of each iteration is n n n Moreover, we can efficiently compute the step size using a one-dimensional line search to maximize the v i v_i vi The dual objective of the function. This means that even without L max L_{text{max}} Lmax Or knowledge of related quantities can also enable fast worst-case runtimes for VR methods.
In order to implement basic variance reduction (VR) methods and achieve reasonable performance, several implementation issues must be addressed. In this section, we discuss several issues not covered above.
In the field of optimization algorithms, especially in variational reduction methods such as stochastic average gradient (SAG), stochastic average gradient algorithm (SAGA), and stochastic gradient reduction (SVRG), setting the step size is a key issue. Although for the stochastic dual coordinate ascent (SDCA) method, we can use the dual objective to determine the step size, the theoretical basis of the original variable methods such as SAG, SAGA, and SVRG is that the step size should be γ = O ( 1 L max ) gamma = Oleft(frac{1}{L_{text{max}}}right) γ=O(Lmax1) However, in practical applications, we often do not know L max L_{text{max}} Lmax The exact value of is unknown, and using other step sizes may give better performance.
A classic strategy for setting the step size in full-GD is the Armijo line search. Given the current point
x
k
x_k
xk and search direction
g
k
g_k
gk, Armijo line search in
γ
k
gamma_k
γk The line is defined as
γ
k
∈
{
γ
:
x
k
+
γ
g
k
}
gamma_k in {gamma : x_k + gamma g_k}
γk∈{γ:xk+γgk}, and requires the function to be sufficiently reduced, that is:
f
(
x
k
+
γ
k
g
k
)
<
f
(
x
k
)
−
c
γ
k
∥
∇
f
(
x
k
)
∥
2
f(x_k + gamma_k g_k) < f(x_k) - c gamma_k |nabla f(x_k)|^2
f(xk+γkgk)<f(xk)−cγk∥∇f(xk)∥2
However, this approach requires multiple candidate step sizes.
γ
k
gamma_k
γk Calculation
f
(
x
k
+
γ
k
g
k
)
f(x_k + gamma_k g_k)
f(xk+γkgk), which is evaluated
f
(
x
)
f(x)
f(x) It is too expensive when the entire dataset needs to be traversed.
To solve this problem, we can use the random variant method to find the
γ
k
gamma_k
γk:
f
i
k
(
x
k
+
γ
k
g
k
)
<
f
i
k
(
x
k
)
−
c
γ
k
∥
∇
f
i
k
(
x
k
)
∥
2
f_{ik}(x_k + gamma_k g_k) < f_{ik}(x_k) - c gamma_k |nabla f_{ik}(x_k)|^2
fik(xk+γkgk)<fik(xk)−cγk∥∇fik(xk)∥2
This approach generally works well in practice, especially in
∥
∇
f
i
k
(
x
k
)
∥
|nabla f_{ik}(x_k)|
∥∇fik(xk)∥ Not close to zero, although there is currently no theory to support this approach.
In addition, Mairal proposed a "Bottou technique" for setting the step size in practice. This method performs a binary search on a small portion of the data set (e.g. 5%) to try to find the optimal step size when traversing through this sample once. Similar to Armijo line search, this method usually works well in practice, but also lacks theoretical foundation.
Please note that the above content is a restatement of the original text, using Markdown format to represent mathematical formulas and variables.
However, the SDCA method also has some disadvantages. First, it requires the calculation of the convex conjugate ℓ i ∗ ell_i^* ℓi∗ rather than a simple gradient. We do not have an automatic differentiation equivalent for convex conjugation, so this may increase the implementation effort. Recent work has proposed “dual-free” SDCA methods that do not require conjugation, and instead use the gradient directly. However, in these methods, it is no longer possible to keep track of the dual objective to set the step size. Second, although SDCA only requires O ( n + d ) O(n + d) O(n+d) memory to solve the problem (15), but for this problem class, SAG/SAGA also only needs O ( n + d ) O(n + d) O(n+d) of memory (see Section 3). SDCA variants applicable to more general problems have the O ( n d ) O(nd) O(nd) Memory, because v i v_i vi Become a d d d A final subtle drawback of SDCA is that it implicitly assumes a strong convexity constant μ mu μ equal λ lambda λ.for μ mu μ more than the λ lambda λ For this problem, the original VR method usually outperforms SDCA significantly.
In the field of algorithm optimization, we usually rely on theoretical results of iteration complexity to predict the worst-case number of iterations required for an algorithm to reach a certain accuracy. However, these theoretical bounds often depend on some constants that we cannot predict, while in practical applications, algorithms can often reach the expected accuracy in fewer iterations. Therefore, we need to set up some test criteria to decide when to end the operation of the algorithm.
In the traditional full-gradient descent (full-GD) method, we usually use the norm of the gradient ∥ ∇ f ( x k ) ∥ | nabla f(x_k) | ∥∇f(xk)∥ Or some other quantity related to it to decide when to stop iterating. For the SVRG method, we can use the same criterion, but use ∥ ∇ f ( x ˉ s ) ∥ | nabla f(bar{x}_s) | ∥∇f(xˉs)∥ For the SAG/SAGA method, although we do not explicitly calculate the full gradient, the quantity $ g_{bar{k}} $ will gradually approach ∇ f ( x k ) nabla f(x_k) ∇f(xk), so use ∥ g k ˉ ∥ | g_{bar{k}} | ∥gkˉ∥ is a reasonable heuristic as a stopping condition.
In the SDCA method, with some extra logging, we can track the gradient of the dual objective without incurring additional asymptotic cost. Alternatively, a more systematic approach would be to track the dual gap, although this would increase the number of iterations per iteration. O ( n ) O(n) O(n) The cost of this method is 0, but it is able to provide a termination condition with a dual gap proof. In addition, based on the optimality conditions of the strong convex objective, the MISO method adopts a principled approach based on the quadratic lower bound [41].
The following are mathematical formulas and variables expressed using Markdown format:
Please note that the above content is a restatement of the original text, using Markdown format to represent mathematical formulas and variables.
Although the stochastic variational gradient reduction (SVRG) algorithm eliminates the memory requirements of earlier variational reduction methods, in practice, the SAG (stochastic average gradient descent) and SAGA (stochastic average gradient descent with gradient accumulation) algorithms often require fewer iterations than the SVRG algorithm on many problems. This raises a question: Are there some problems that make SAG/SAGA more efficient? O ( n d ) O(nd) O(nd) Memory requirements are as follows. This section explores the class of linear models for which memory requirements can be significantly reduced.
Consider a linear model where each function
f
i
(
x
)
f_i(x)
fi(x) It can be expressed as
ξ
i
(
a
i
⊤
x
)
xi_i(mathbf{a}_i^top x)
ξi(ai⊤x).right
x
x
x The derivative obtains the gradient form:
∇
f
i
(
x
)
=
ξ
′
(
a
i
⊤
x
)
a
i
nabla f_i(x) = xi'(mathbf{a}_i^top x) mathbf{a}_i
∇fi(x)=ξ′(ai⊤x)ai
here,
ξ
′
xi'
ξ′ express
ξ
xi
ξ The derivative of . Assume that we have direct access to the eigenvector
a
i
mathbf{a}_i
ai, then to implement the SAG/SAGA method, we only need to store the scalar
ξ
(
a
i
⊤
x
)
xi(mathbf{a}_i^top x)
ξ(ai⊤x). Thus, the memory requirement is reduced from
O
(
n
d
)
O(nd)
O(nd) Reduced to
O
(
n
)
O(n)
O(n)The SVRG algorithm can also exploit this structure of the gradient: by storing
n
n
n scalar, we can reduce the number of gradient evaluations required per “inner” iteration of SVRG to 1 for this class of problems.
There are other types of problems, such as probabilistic graphical models, that also offer the potential to reduce memory requirements.[66] Through specific data structures and algorithmic optimizations, it is possible to further reduce the memory resources required by the algorithm at runtime.
The following are mathematical formulas and variables expressed using Markdown format:
In some problems, the gradient ∇ f i ( x ) nabla f_i(x) ∇fi(x) May contain a large number of zero values, such as linear models with sparse features. In this case, the traditional stochastic gradient descent (SGD) algorithm can be implemented efficiently, with a computational complexity linear to the number of non-zero elements in the gradient, which is usually much smaller than the problem dimension. d d dHowever, this advantage is not exploited in standard variational reduction (VR) methods. Fortunately, there are two known methods to improve it.
The first improvement, proposed by Schmidt et al., exploits the simplicity of the update process and implements a variant of "on-the-fly" computation, making the cost of each iteration proportional to the number of nonzero elements. In the case of SAG (but this approach applies to all variants), the specific approach is to not store the full vector after each iteration v i k v_{ik} vik, but only calculates the corresponding non-zero elements v i k j v_{ik_j} vikj, by updating each variable since the last time that element was non-zero v i k j v_{ik_j} vikj。
The second improvement method was proposed by Leblond et al. for SAGA, which updates the formula x k + 1 = x k − γ ( ∇ f i k ( x k ) − ∇ f i k ( x ˉ i k ) + g ˉ k ) x_{k+1} = x_k - gamma(nabla f_{ik}(x_k) - nabla f_{ik}(bar{x}_{ik}) + bar{g}_k) xk+1=xk−γ(∇fik(xk)−∇fik(xˉik)+gˉk) Additional randomness is introduced in . Here, ∇ f i k ( x k ) nabla f_{ik}(x_k) ∇fik(xk) and ∇ f i k ( x ˉ i k ) nabla f_{ik}(bar{x}_{ik}) ∇fik(xˉik) is sparse, and g ˉ k bar{g}_k gˉk is dense. In this method, the dense term ( g ˉ k ) j (bar{g}_k)_j (gˉk)j Each component of is replaced by w j ( g ˉ k ) j w_j (bar{g}_k)_j wj(gˉk)j,in w ∈ R d w in mathbb{R}^d w∈Rd is a random sparse vector whose support set is contained in ∇ f i k ( x k ) nabla f_{ik}(x_k) ∇fik(xk) , and is expected to be a constant vector with all elements equal to 1. This way, the update process remains unbiased (although now sparse), and the increased variance does not affect the convergence rate of the algorithm. More details are provided by Leblond et al.
The following are mathematical formulas and variables expressed using Markdown format: