【Deep Learning】Basics of Graphical Models (7): Variance Reduction Methods in Machine Learning Optimization (1)

2024-07-12

Summary

Stochastic optimization is a crucial component of machine learning, with the stochastic gradient descent (SGD) algorithm at its core, a method that has been widely used since it was first proposed more than 60 years ago. In the past eight years, we have witnessed an exciting new development: variance reduction techniques for stochastic optimization methods. These variance reduction methods (VR methods) perform well in settings where they allow many iterations over the training data, and they have been shown to converge faster than SGD both in theory and in practice. This speed increase highlights the growing interest in VR methods and the rapidly accumulating research results in this area. This article reviews the key principles and major advances in VR methods for optimization on finite datasets, aiming to be informative for non-expert readers. We focus primarily on the convex optimization setting and provide references for readers interested in the extension to non-convex function minimization.

Key words | Machine learning; optimization; variance reduction

1 Introduction

In the field of machine learning research, a fundamental and important question is how to adapt the model to a large data set. For example, we can consider the typical case of a linear least squares model:

$x^* in argmin_{x in mathbb{R}^d} frac{1}{n} sum_{i=1}^{n} (a_i^T x - b_i)^2$

In this model, we have $d$ parameters, which are represented by the vector $mathbb{R}^d$ At the same time, we have $n$ data points, including the feature vector $a_i in mathbb{R}^d$ and target value $b_i in mathbb{R}$ The model adaptation process is to adjust these parameters so that the model's prediction output $a_i^T x$ As close to the target value as possible on average $b_i$ 。

More generally, we might use the loss function $f_i(x)$ To measure the model prediction and $i$ The proximity of the data points:

$x^* in argmin_{x in mathbb{R}^d} f(x) := frac{1}{n} sum_{i=1}^{n} f_i(x)$

Loss Function $f_i(x)$ If it is larger, it indicates that the model's prediction deviates greatly from the data; if $f_i(x)$ is equal to zero, indicating that the model fits the data points perfectly. Function $f (x)$ Reflects the average loss of the model on the entire dataset.

Problems similar to the above form (2) are not only applicable to linear least squares problems, but also to many other models studied in machine learning. For example, in the logistic regression model, we solve:

$x^* in argmin_{x in mathbb{R}^d} frac{1}{n} sum_{i=1}^{n} log(1 + e^{-b_i a_i^T x}) + frac{lambda}{2} |x|_2^2$

Here, we are dealing with $b_i in {-1, +1}$ For a binary classification problem, the prediction is based on $a_i^T x$ The regularization term is also introduced in the formula $|x|_2^2$ To avoid overfitting the data, $x|_2^2$ express $x$ The square of the Euclidean norm of .

In most supervised learning models, the training process can be expressed as form (2), including L1 regularized least squares, support vector machine (SVM), principal component analysis, conditional random fields, and deep neural networks.

A key challenge in modern problem instances is the number of data points $n$ Can be extremely large. We often deal with datasets that are well beyond the terabyte range in size, and these data may come from sources as diverse as the internet, satellites, remote sensors, financial markets, and scientific experiments. To cope with such large datasets, a common approach is to use the stochastic gradient descent (SGD) algorithm, which uses only a small number of randomly picked data points in each iteration. In addition, there has been a recent surge in interest in stochastic gradient methods with reduced variance (VR), which have shown faster convergence rates than traditional stochastic gradient methods.
insert image description here
Figure 1. Gradient descent (GD), accelerated gradient descent (AGD, i.e., accelerated GD in [50]), stochastic gradient descent (SGD), and ADAM [30] methods are compared with the variance reduction (VR) methods SAG and SVRG on the logistic regression problem based on the Mushroom dataset [7], where n = 8124 and d = 112.

1.1. Gradient and Stochastic Gradient Descent Methods

Gradient descent (GD) is a classic algorithm used to solve the above problem (2). Its iterative update formula is as follows:
$x_{k+1} = x_k - gamma frac{1}{n} sum_{i=1}^{n} nabla f_i(x_k)$

here, $γ$ is a fixed step value greater than zero. In each iteration of the GD algorithm, each data point must be $i$ Computing Gradients $f_i(x_k)$ , which means that GD needs to $n$ data points to perform a complete traversal. $n$ When becomes very large, the cost of each iteration of the GD algorithm becomes very high, which limits its application.

As an alternative, we can consider the stochastic gradient descent (SGD) method, which was first proposed by Robbins and Monro, and its iterative update formula is as follows:
$x_{k+1} = x_k - gamma nabla f_{i_k}(x_k)$

The SGD algorithm uses the gradient of only one randomly selected data point in each iteration. $f_{i_k}(x_k)$ to reduce the cost of each iteration. In Figure 1, we can see that SGD makes more significant progress than GD (including accelerated GD methods) in the early stages of the optimization process. The figure shows the progress of the optimization based on epochs, which is defined as the number of times all $n$ The number of times the gradient of training samples is calculated. The GD algorithm performs one iteration in each round, while the SGD algorithm performs $n$ We use the number of rounds as the basis for comparing SGD and GD because under the assumption $n$ For very large cases, the main cost of both methods is concentrated in the gradient $f_i(x_k)$ 's calculation.

1.2. Variance Problem

Let's consider random indexing $i_k$ From the collection ${1, \dots, n}$ In the case of uniform random selection, this means that for all $i$ ,choose $i_k = i$ The probability $P[i_k = i]$ equal $1 n frac{1}{n}$ .in this case, $f_{i_k}(x_k)$ As $f(x_k)$ The estimator of is unbiased because, by the definition of expectation, we have:
$f_{i_k}(x_k) | x_k] = frac{1}{n} sum_{i=1}^{n} nabla f_i(x_k) = nabla f(x_k) quad (6)$

Although the SGD (Stochastic Gradient Descent) method does not guarantee the function $f$ The value of will decrease, but on average it moves in the direction of the negative full gradient, which represents the direction of descent.

However, having an unbiased gradient estimator is not enough to ensure the convergence of SGD iterations. To illustrate this point, Figure 2 (left) shows the SGD iteration trajectory when applying the logistic regression function to the four-class dataset provided by LIBSVM [7] using a constant step size. The concentric ellipses in the figure represent the contour lines of the function, that is, the function value $f (x) = c$ The corresponding point $x$ gather, $c$ is a specific constant in the real number set. Different constant values $c$ Corresponding to different ellipses.

The SGD iterative trajectory does not converge to the optimal solution (indicated by the green asterisk in the figure), but instead forms a point cloud around the optimal solution. In contrast, Figure 2 shows the iterative trajectory of a variance reduction (VR) method, stochastic average gradient (SAG), using the same constant step size, which we will introduce later. The reason why SGD fails to converge in this example is that the stochastic gradient itself does not converge to zero, so the constant-step SGD method (5) will never stop. This is in stark contrast to the gradient descent (GD) method, which will naturally stop because as $x_k$ Approaches $x^*$ ,gradient $f(x_k)$ will tend to zero.
insert image description here
Figure 2. Level set plots for 2D logistic regression using SGD (left) and SAG (right) iterative methods with a fixed step size. The green asterisks represent xuntie.

1.3. Classical variance reduction methods

Processing due to $f_i(x_k)$ There are several classical techniques to solve the nonconvergence problem caused by the variance of the values. For example, Robbins and Monro [64] used a series of decreasing step sizes to solve the problem. $gamma_k$ To solve the variance problem, ensure that the product $gamma_k nabla f_{i_k}(x_k)$ can converge to zero. However, it is a difficult problem to adjust this sequence of decreasing step sizes to avoid stopping the algorithm too early or too late.

Another classic technique for reducing variance is to use multiple $f_i(x_k)$ The average value of the complete gradient $\nabla f (x)$ This approach is called minibatch processing and is particularly useful when multiple gradients can be evaluated in parallel. This leads to an iteration of the following form:
$x_{k+1} = x_k - gamma frac{1}{|B_k|} sum_{i in B_k} nabla f_i(x_k) quad (7)$
in $B_k$ is a random index set, $B_k|$ express $B_k$ If $B_k$ If we sample uniformly with replacement, the variance of the gradient estimate is proportional to the batch size $B_k|$ The variance is inversely proportional to , so the variance can be reduced by increasing the batch size.

However, the cost of this iteration is proportional to the batch size, so this form of variance reduction comes at the expense of increased computational cost.

Another common strategy to reduce variance and improve the empirical performance of SGD is to add "momentum", an extra term based on the direction used in past steps. In particular, SGD with momentum takes the following form:
$x_{k+1} = x_k - gamma m_k quad (9)$
The momentum parameter $β$ is in the range (0, 1). If the initial momentum $m_0 = 0$ , and expand in (8) $m_k$ As an update, we get $m_k$ is the weighted average of the previous gradients:
$m_k = sum_{t=0}^{k} beta^{k-t} nabla f_{i_t}(x_t) quad (10)$
therefore, $m_k$ is the weighted sum of stochastic gradients. Since $sum_{t=0}^{k} beta^{k-t} = frac{1 - beta^{k+1}}{1 - beta}$ , we can $beta^k} m_k$ is considered as a weighted average of the stochastic gradient. If we compare this to the expression for the full gradient $f(x_k) = frac{1}{n} sum_{i=1}^{n} nabla f_i(x_k)$ For comparison, we can $beta^k} m_k$ (as well as $m_k$ ) is interpreted as an estimate of the full gradient. This weighted sum reduces variance but also brings a key problem. Since the weighted sum (10) gives more weight to the most recently sampled gradients, it will not converge to the full gradient $f(x_k)$ , the latter is a simple average. The first variance reduction method we will see in Section II.A solves this problem by using a simple average instead of any weighted average.

1.4. Modern variance reduction methods

Unlike classical methods, they directly use one or more $f_i(x_k)$ As $f(x_k)$ Modern variance reduction (VR) methods take a different strategy. These methods use $f_i(x_k)$ To update the estimated value of the gradient $g_k$ , whose goal is to make $g_k$ Approach $f(x_k)$ Specifically, we hope $g_k$ Can satisfy $g_k approx nabla f(x_k)$ Based on this gradient estimate, we then perform an approximate gradient step of the following form:
$x_{k+1} = x_k - gamma g_k quad (11)$
here $γ > 0$ is the step size parameter.

To ensure a constant step size $γ$ When iteration (11) can converge, we need to ensure that the gradient estimate $g_k$ The variance of tends to zero. Mathematically, this can be expressed as:
$g_k - nabla f(x_k) |^2 right] rightarrow 0 quad text{as } k rightarrow infty quad (12)$
What to expect here $E$ It is based on the algorithm until $k$ The property (12) ensures that the VR method stops when it reaches the optimal solution. We regard this property as a hallmark of the VR method and hence call it the VR property. It is worth noting that the expression “reduced” variance may be misleading, since in reality the variance tends to zero. Property (12) is the key factor that enables the VR method to achieve faster convergence in theory (under appropriate assumptions) and in practice (as shown in Figure 1).

1.5. First example of a variance reduction method: SGD²

A simple improvement method can make SGD recursion (5) converge without reducing the step size, that is, to translate each gradient by subtracting $f_i(x^*)$ , this method is defined as follows:
$x_{k+1} = x_k - gamma (nabla f_{i_k}(x_k) - nabla f_{i_k}(x^*)) quad (13)$
This method is called SGD² [22]. Although we usually do not know exactly $f_i(x^*)$ , but SGD² is an example that can well illustrate the basic properties of variance reduction methods. In addition, many variance reduction methods can be viewed as an approximation of the SGD² method; these methods do not rely on knowing every $f_i(x^*)$ , but use the method that can approximate $f_i(x^*)$ The estimated value of .

It is important to note that SGD² uses an unbiased estimate of the full gradient. $f(x^*) = 0$ ,F:
$f_{i_k}(x_k) - nabla f_{i_k}(x^*)] = nabla f(x_k) - nabla f(x^*) = nabla f(x_k)$
In addition, when SGD² reaches the optimal solution, it will naturally stop, because for any $i$ ,have:
$f_i(x) - nabla f_i(x^*)) bigg|_{x=x^*} = 0$

Further observation shows that $x_k$ near $x^*$ (For continuous $f_i$ ), SGD² satisfies the variance reduction property (12) because:
$g_k - nabla f(x_k) |^2 right] = \Eleft[ | nabla f_{i_k}(x_k) - nabla f_{i_k}(x^*) - nabla f(x_k) |^2 right] leq Eleft[ | nabla f_{i_k}(x_k) - nabla f_{i_k}(x^*) |^2 right]$
Here we use Lemma 2, let $f_{i_k}(x_k) - nabla f_{i_k}(x^*)$ , and used $f_{i_k}(x_k) - nabla f_{i_k}(x^*)] = nabla f(x_k)$ This property indicates that SGD² has a faster convergence rate than the traditional SGD method, which we have explained in detail in Appendix B.

1.6. Fast Convergence of Variance Reduction Methods

In this section we introduce two standard assumptions used in the analysis of variance reduction (VR) methods and discuss the speedups that can be achieved under these assumptions compared to traditional SGD methods. First, we assume that the gradient has Lipschitz continuity, which means that the gradient can change at a finite rate.

Assumption 1 (Lipschitz continuity)

We assume that the function $f$ is differentiable and is $L$ - Smooth, for all $x$ and $y$ and a $0 < L < \infty$ ,The following conditions:
$∥\nabla f (x) - \nabla f (y) ∥ \leq L ∥ x - y ∥ (14)$
This means that each $mathbb{R}^d rightarrow mathbb{R}$ is differentiable, $L_i$ - Smooth, we define $L_{text{max}}$ for $max{L_1, . . . , L_n}$ 。

Although this is generally considered a weak assumption, in subsequent chapters we will discuss VR methods applicable to nonsmooth problems. For a twice differentiable univariate function, $L$ - Smoothness can be intuitively understood as follows: it is equivalent to assuming that the second-order derivative is $L$ Upper limit, i.e. $∣ f^{''} (x) ∣ \leq L$ For all $mathbb{R}^d$ For a twice differentiable function of multiple variables, this is equivalent to assuming that the Hessian matrix $nabla^2 f(x)$ The singular values of $L$ Upper limit.

Assumption 2 (strong convexity)

The second assumption we consider is that the function (f) is $μ$ - is strongly convex, which means that for some $μ > 0$ ,function $frac{mu}{2}|x|^2$ is convex. In addition, for each $i = 1, ..., n$ ， $mathbb{R}^d rightarrow mathbb{R}$ It is convex.

This is a strong assumption. In the least squares problem, each (fi$) is convex, but the overall function (f) is only convex if the design matrix $A := [a_1, . . . , a_n]$ It is strongly convex only when it has complete row rank. The L2 regularized logistic regression problem satisfies this assumption due to the existence of the regularization term, where $μ \geq λ$ 。

An important class of problems that satisfy these assumptions are optimization problems of the form:
$x^* in argmin_{x in mathbb{R}^d} f(x) = frac{1}{n} sum_{i=1}^{n} ell_i(a_i^Tx) + frac{lambda}{2}|x|^2 quad (15)$
Each of the "loss" functions $ell_i: mathbb{R} rightarrow mathbb{R}$ is twice differentiable, and its second-order derivative $ell_i''$ is restricted to 0 and some upper bound $M$ This includes various loss functions with L2 regularization in machine learning, such as least squares, logistic regression, probit regression, Huber robust regression, etc. In this case, for all $i$ ,We have $L_i leq M|a_i|^2 + lambda$ and $μ \geq λ$ 。

Under these assumptions, the convergence rate of the gradient descent (GD) method is given by the condition number $κ := L / μ$ The condition number is always greater than or equal to 1. When it is significantly greater than 1, the contour of the function becomes very elliptical, causing the iteration of the GD method to oscillate. On the contrary, when $κ$ When it is close to 1, the GD method converges faster.

Under Assumptions 1 and 2, the VR method converges at a linear rate. We say that the function value ({f(x_k)}) of a random method converges at $0 < ρ \leq 1$ The rate of convergence is linear (under expectation) if there exists a constant $C > 0$ So that:
$E[f(x_k)] - f(x^*) leq (1 - rho)^k C = O(exp(-krho)) quad forall k quad (16)$
This is in contrast to the classical SGD approach which relies only on an unbiased estimate of the gradient at each iteration, which can only achieve sublinear rates under these assumptions:
$E[f(x_k)] - f(x^*) leq O(1/k)$
The minimum value that satisfies this inequality $k$ It is called the iteration complexity of the algorithm. Here are the iteration complexities and the cost of one iteration for basic variants of GD, SGD, and VR methods:

algorithm	Iterations	The cost of one iteration
GD	$O (κ lo g (1/ ϵ))$	$O (n)$
SGD	$O(kappa_{text{max}} max(1/epsilon))$	$O (1)$
VR	$O((kappa_{text{max}} + n) log(1/epsilon))$	$O (1)$

The total running time of the algorithm is determined by the product of the iteration complexity and the iteration running time. $kappa_{text{max}} := max_i L_i/mu$ .Notice $kappa_{text{max}} geq kappa$ ; Therefore, the iteration complexity of GD is smaller than that of VR method.

However, since the cost of each iteration of GD is $n$ times, the VR method is superior in terms of total running time.

The advantage of classical SGD methods is that their running time and convergence rate do not depend on $n$ , but it has a tolerance $ϵ$ The dependence of is much worse, which explains the poor performance of SGD when the tolerance is small.

In Appendix B, we provide a simple proof showing that the SGD² method has the same iteration complexity as the VR method.

2. Basic variance reduction method

The development of variance reduction (VR) methods has gone through several stages, with the first batch of methods achieving significant convergence rates. The beginning of this series of methods was the SAG algorithm. Subsequently, the randomized dual coordinate ascent (SDCA) algorithm, the MISO algorithm, the stochastic variance reduction gradient (SVRG/S2GD) algorithm, and the SAGA algorithm (meaning "improved" SAG) algorithm came out one after another.

In this chapter, we will introduce these groundbreaking VR methods in detail, and in Chapter 4, we will explore some newer methods that show superior properties compared to these basic methods in specific application scenarios.

2.1. Stochastic Average Gradient Method (SAG)

Our exploration of the first variance reduction (VR) method begins with imitating the full gradient structure. $\nabla f (x)$ Yes all $f_i(x)$ A simple average of the gradients, then our estimate of the full gradient $g_k$ It should also be the average of these gradient estimates. This idea gave birth to our first VR method: the stochastic average gradient (SAG) method.

The SAG method [37], [65] is a randomized version of the earlier Incremental Aggregate Gradient (IAG) method [4]. The core idea of SAG is to perform a $i$ Maintain an estimate $v_{ik} approx nabla f_i(x_k)$ Then, use these $v_{ik}$ The average of the values is used as an estimate of the complete gradient, that is:
$bar{g}_k = frac{1}{n} sum_{j=1}^{n} v_{jk} approx frac{1}{n} sum_{j=1}^{n} nabla f_j(x_k) = nabla f(x_k) quad (18)$

In each iteration of SAG, from the set ${1, \dots, n}$ Extract an index from $i_k$ , and then update according to the following rules $v_{jk}$ ：
$begin{cases} nabla f_{i_k}(x_k), & text{if } j = i_k \ v_{jk}^k, & text{if } j neq i_k end{cases}$
Among them, each $v_{0i}$ Can be initialized to zero or $f_i(x_0)$ As the solution $x^*$ The approximation of $v_{ik}$ will gradually converge to $f_i(x^*)$ , thus satisfying VR property (12).

In order to implement SAG efficiently, we need to pay attention to the calculation $bar{g}_k$ Avoid starting the sum from scratch each time $n$ vector, because this $n$ It is very expensive when it is large. Fortunately, since there is only one $v_{ik}$ terms change, we can avoid recalculating the entire sum each time. Specifically, suppose that in the iteration $k$ The index was extracted from $i_k$ , then:
$bar{g}_k = frac{1}{n} sum_{substack{j=1 \ j neq i_k}}^{n} v_{jk} + frac{1}{n} v_{i_k}^k = bar{g}_{k-1} - frac{1}{n} v_{i_k}^{k-1} + frac{1}{n} v_{i_k}^k quad (20)$

Because in addition to $v_{i_k}$ All except $v_{jk}$ The values remain unchanged, we only need to store each $j$ A corresponding vector $v_j$ Algorithm 1 shows the specific implementation of the SAG method.

SAG is the first randomized method to achieve linear convergence, and its iteration complexity is $O((kappa_{text{max}} + n) log(1/epsilon))$ , using the step size $O(1/L_{text{max}})$ This linear convergence can be observed in Figure 1. It is worth noting that due to $L_{text{max}}$ -Smooth function for any $L_{text{max}}$ Too $L^{'}$ - Smooth, SAG methods achieve linear convergence rates for sufficiently small step sizes, in stark contrast to classical SGD methods, which achieve sublinear rates only for decreasing sequences of step sizes that are difficult to tune in practice.

At the time, the linear convergence of SAG was a significant advance because it only computed a single stochastic gradient (processing a single data point) at each iteration. However, the convergence proof provided by Schmidt et al. [65] was very complex and relied on computer verification procedures. A key reason why SAG is difficult to analyze is that $g_k$ is a biased estimator of the gradient.

Next, we present the SAGA method, a variant of SAG that exploits the concept of covariates to create an unbiased variant of the SAG method that has similar performance but is easier to analyze.

Algorithm 1: SAG method

Parameter: Step size $γ > 0$
initialization: $x_0$ ， $v_i = 0 in mathbb{R}^d$ for $i = 1, \dots, n$
right $k = 1, \dots, T - 1$ implement:
a. Random selection $i_k in {1, ldots, n}$
b. Calculation $bar{g}_k = bar{g}_{k-1} - frac{1}{n} v_{i_k}^{k-1}$
c. Update $v_{i_k}^k = nabla f_{i_k}(x_k)$
d. Update gradient estimates $bar{g}_k = bar{g}_k + frac{1}{n} v_{i_k}^k$
e. Update $x_{k+1} = x_k - gamma bar{g}_k$
Output: $x_T$

2.2. SAGA Method

A Reduced Basic Unbiased Gradient Estimator $f_{i_k}(x_k)$ The variance method is to use so-called covariates (or control variables). $i = 1, \dots, n$ ,set up $v_i in mathbb{R}^d$ is a vector. Using these vectors, we can convert the full gradient $\nabla f (x)$ Rewritten as:
$sum_{i=1}^{n}(nabla f_i(x) - v_i + v_i) = frac{1}{n} sum_{i=1}^{n} nabla f_i(x) - v_i + frac{1}{n} sum_{j=1}^{n} v_j$
$sum_{i=1}^{n} nabla f_i(x, v) quad (21)$
where the definition $f_i(x, v) := nabla f_i(x) - v_i + frac{1}{n} sum_{j=1}^{n} v_j$ Now, we can randomly sample a $f_i(x, v)$ To build the full gradient $\nabla f (x)$ An unbiased estimate of $i \in {1, \dots, n}$ , we can apply the SGD method and use the gradient estimate:
$g_k = nabla f_{i_k}(x_k, v) = nabla f_{i_k}(x_k) - v_{i_k} + frac{1}{n} sum_{j=1}^{n} v_j quad (22)$

To observe $v_i$ The choice of variance $g_k$ , we can $g_k = nabla f_{i_k}(x_k, v)$ Substitute and use $E_i sim frac{1}{n}[v_i] = frac{1}{n} sum_{j=1}^{n} v_j$ To calculate the expectation, we get:
$f_i(x_k) - v_i + E_i sim frac{1}{n}[v_i - nabla f_i(x_k)]|^2 right] leq E left[ |nabla f_i(x_k) - v_i|^2 right] quad (23)$
Lemma 2 is used here, where $f_i(x_k) - v_i$ This bound (23) shows that if $v_i$ along with $k$ The increase is close to $f_i(x_k)$ , we can obtain the VR property (12). This is why we call $v_i$ are covariates, and we can choose them to reduce variance.

For example, the SGD² method (13) also implements this approach, where $v_i = nabla f_i(x^*)$ However, this is not often used in practice, because we usually do not know $f_i(x^*)$ A more practical option is $v_i$ As we know $bar{x}_i in mathbb{R}^d$ Nearby gradient $f_i(bar{x}_i)$ SAGA performs $f_i$ Use a reference point $bar{x}_i in mathbb{R}^d$ , and using the covariate $v_i = nabla f_i(bar{x}_i)$ , where each $bar{x}_i$ This will be our last evaluation $f_i$ Using these covariates, we can construct the gradient estimate, following (22), giving:
$g_k = nabla f_{i_k}(x_k) - nabla f_{i_k}(bar{x}_{i_k}) + frac{1}{n} sum_{j=1}^{n} nabla f_j(bar{x}_j) quad (24)$

To implement SAGA, we can store the gradient $f_i(bar{x}_i)$ Rather than $n$ Reference points $bar{x}_i$ That is to say, $v_j = nabla f_j(bar{x}_j)$ for $j \in {1, \dots, n}$ , in each iteration, we update a stochastic gradient θ like SAG $v_j$ 。

Algorithm 2 SAGA

Parameter: Step size $γ > 0$
initialization: $x_0$ ， $v_i = 0 in mathbb{R}^d$ for $i = 1, \dots, n$
conduct $k = 1, \dots, T - 1$ Iterations:
a. Random selection $i_k in {1, ldots, n}$
b. Save the old value $v_{text{old}} = v_{i_k}$
c. Update $v_{i_k} = nabla f_{i_k}(x_k)$
d. Update $x_{k+1} = x_k - gamma (v_{i_k} - v_{text{old}} + bar{g}_k)$
e. Update gradient estimates $bar{g}_k = bar{g}_{k-1} + frac{1}{n} (v_{i_k} - v_{text{old}})$
Output: $x_T$

The SAGA method has the same iteration complexity as SAG $O((kappa_{text{max}} + n) log(1/epsilon))$ , using the step size $O(1/L_{text{max}})$ , but the proof is much simpler. However, like SAG, the SAGA method requires storage $n$ Auxiliary vector $v_i in mathbb{R}^d$ for $i = 1, \dots, n$ , which means that $O (n d)$ storage space. $d$ and $n$ This may not be feasible when both are large. In the next section, we detail how to reduce this memory requirement for common models such as regularized linear models.

When it is possible $n$ The performance of SAG and SAGA tends to be similar when the auxiliary vectors are stored in memory. If this memory requirement is too high, the SVRG method, which we will review in the next section, is a good choice. The SVRG method achieves the same convergence rate and is often nearly as fast in practice, but only requires $O (d)$ of memory, for general questions.

SVRG method

Before the advent of the SAGA method, some early works first introduced covariates to solve the high memory problem required by the SAG method. These studies constructed a fixed reference point based on $mathbb{R}^d$ covariate, we have already calculated the full gradient at this point $\nabla f (\overset{x}{ˉ})$ By storing the reference point $\overset{x}{ˉ}$ and the corresponding full gradient $\nabla f (\overset{x}{ˉ})$ , we can do this without storing each $f_j(bar{x})$ In the case of $bar{x}_j = bar{x}$ For all $j$ to implement the update (24). Specifically, instead of storing these vectors, we use the stored reference point in each iteration $\overset{x}{ˉ}$ To calculate $f_{i_k}(bar{x})$ This method was originally proposed by different authors with different names, but was later uniformly referred to as the SVRG method, following the nomenclature of [28] and [84].

We formalize the SVRG method in Algorithm 3.

Using (23), we can derive the gradient estimate $g_k$ The variance of is bounded:
$g_k - nabla f(x_k) |^2 right] leq Eleft[ | nabla f_i(x_k) - nabla f_i(bar{x}) |^2 right] leq L_{text{max}}^2 | x_k - bar{x} |^2$
The second inequality uses each $f_i$ of $L_i$ - Smoothness.

It is worth noting that the reference point $\overset{x}{ˉ}$ The closer to the current point $x_k$ , the smaller the variance of the gradient estimate.

In order for the SVRG method to be effective, we need to update the reference point frequently. $\overset{x}{ˉ}$ The trade-off between the cost of σ (thus requiring the calculation of the full gradient) and the benefit of reducing variance is that we $t$ Update the reference point once in each iteration to make it close to $x_k$ (See line 11 of Algorithm II-C). That is, the SVRG method contains two loops: an outer loop $s$ , where the reference gradient is calculated $f(bar{x}_{s-1})$ (Line 4), and an inner loop where the reference point is fixed and the inner iteration is updated according to the stochastic gradient step (22) $x_k$ (line 10).

Unlike SAG and SAGA, SVRG only needs $O (d)$ The disadvantages of SVRG include: 1) We have an extra parameter $t$ , i.e., the length of the inner loop, needs to be adjusted; 2) two gradients need to be calculated for each iteration, and the full gradient needs to be calculated each time the reference point is changed.

Johnson and Zhang [28] showed that SVRG has an iterative complexity $O((kappa_{text{max}} + n) log(1/epsilon))$ , similar to SAG and SAGA. This is based on the assumption that the number of inner loops $t$ From the collection ${1, \dots, m}$ The result is obtained by uniform sampling, where $L_{text{max}}$ ， $μ$ , step length $γ$ and $t$ There must be certain dependencies between them. In practice, by using $O(1/L_{text{max}})$ and the inner loop length $t = n$ ,SVRG tends to perform well, which is exactly the setting we use in Figure 1.

There are many variations of the original SVRG method. For example, some variants use $t$ [32], some variants allow the form $O(1/L_{text{max}})$ The step size is [27], [33], [35]. Some variants use $\nabla f (\overset{x}{ˉ})$ to reduce the cost of these full gradient evaluations by using a mini-batch approximation of , and increasing the mini-batch size to preserve the VR property. There are also variants that repeatedly update in the inner loop according to [54] $g_k$ ：
[ g_k = nabla f_{i_k}(x_k) - nabla f_{i_k}(x_{k-1}) + g_{k-1} quad (25) ]
This provides a more local approximation. Using this continuous update variant of (25) shows particular advantages when minimizing non-convex functions, as we briefly discuss in Section 4. Finally, note that SVRG can exploit $f(bar{x}_s)$ The value of helps decide when to terminate the algorithm.

Algorithm 3 SVRG method

Parameter: Step size $γ > 0$
Initialization reference point $bar{x}_0 = x_0 in mathbb{R}^d$
Perform external circulation $s = 1, 2, \dots$ ：
a. Calculate and store $f(bar{x}_{s-1})$
b. Assume $x_0 = bar{x}_{s-1}$
c. Select the number of inner loop iterations $t$
d. Perform inner loop $k = 0, 1, \dots, t - 1$ ：
i. Random selection $i_k in {1, ldots, n}$
ii. Calculation $g_k = nabla f_{i_k}(x_k) - nabla f_{i_k}(bar{x}_{s-1}) + nabla f(bar{x}_{s-1})$
iii. Update $x_{k+1} = x_k - gamma g_k$
e. Update reference point $bar{x}_s = x_t$

2.4. SDCA and its variants

A disadvantage of the SAG and SVRG methods is that their step size depends on the $L_{text{max}}$ Before SVRG, the SDCA method [70] was one of the earliest VR methods that extended the study of coordinate descent methods to the finite sum problem. The idea behind SDCA and its variants is that the coordinates of the gradient provide a natural variance-reducing gradient estimate. Specifically, let $j \in {1, \dots, d}$ , and define $nabla_j f(x) := left( frac{partial f(x)}{partial x_j} right) e_j$ is the first $j$ The derivative in the coordinate direction is $e_j in mathbb{R}^d$ It is $j$ unit vectors. A key property of coordinate derivatives is that $nabla_j f(x^*) = 0$ This is because we know $f(x^*) = 0$ . This is consistent with the derivative of each data point $f_j$ Different, the latter $x^*$ may not be zero at any point. Therefore, we have:
$nabla_j f(x) |^2 rightarrow 0 quad text{当} quad x rightarrow x^* quad (26)$
This means that the coordinate derivative satisfies the variance reduction property (12). In addition, we can use $nabla_j f(x)$ To build $\nabla f (x)$ For example, let $j$ It is from the collection ${1, \dots, d}$ A uniformly random index is selected from . Therefore, for any $i \in {1, \dots, d}$ ,We have $P [ j = i ] = 1 d P[j = i] = frac{1}{d}$ .therefore, $nabla_j f(x)$ yes $\nabla f (x)$ is an unbiased estimate of because:
$nabla_j f(x) right] = d sum_{i=1}^{d} P[j = i] frac{partial f(x)}{partial x_i} e_i = sum_{i=1}^{d} frac{partial f(x)}{partial x_i} e_i = nabla f(x)$

therefore, $nabla_j f(x)$ has all the desirable properties we expect of a VR estimate of the full gradient, but without the need to use covariates. One drawback of using this coordinate gradient is that it is computationally expensive for our sum problem (2). This is because computing $nabla_j f(x)$ The entire dataset needs to be traversed because $nabla_j f(x) = frac{1}{n} sum_{i=1}^{n} nabla_j f_i(x)$ . Thus, the use of coordinate derivatives seems incompatible with the structure of our sum problem. However, we can often rewrite the original problem (2) as a so-called dual formulation, where the coordinate derivatives can exploit the inherent structure.

For example, the dual formula of the L2 regularized linear model (15) is:
$v^* in argmax_{v in mathbb{R}^n} frac{1}{n} sum_{i=1}^{n} -ell_i^*(-v_i) - frac{lambda}{2} left| frac{1}{lambda} sum_{i=1}^{n} v_i a_i right|^2 quad (27)$
in $ell_i^*(v)$ yes $ell_i$ convex conjugate of . We can use the mapping $sum_{i=1}^{n} v_i a_i$ To recover the original problem (15) $x$ Variable. $v^*$ Substituting the right side of the above mapping, we can obtain the solution of (15) $x^*$ 。

Note that this dual problem has $n$ real variable $v_i in mathbb{R}$ , one for each training sample. In addition, each dual loss function $ell_i^*$ only $v_i$ function of . That is, the first term in the loss function is separable in coordinates. This separability in coordinates, combined with the simple form of the second term, allows us to implement coordinate ascent methods efficiently. In fact, Shalev-Shwartz and Zhang show that coordinate ascent on this problem has an iterative complexity similar to that of SAG, SAGA, and SVRG. $O((kappa_{text{max}} + n) log(1/epsilon))$ 。

The iteration cost and algorithm structure are also very similar: by tracking the sum $sum_{i=1}^{n} v_i a_i$ To deal with the second term in (27), each dual coordinate ascent iteration only needs to consider one training example, and the cost of each iteration is $n$ Moreover, we can efficiently compute the step size using a one-dimensional line search to maximize the $v_i$ The dual objective of the function. This means that even without $L_{text{max}}$ Or knowledge of related quantities can also enable fast worst-case runtimes for VR methods.

3. Practical issues of variance reduction

In order to implement basic variance reduction (VR) methods and achieve reasonable performance, several implementation issues must be addressed. In this section, we discuss several issues not covered above.

3.1.SAG/SAGA/SVRG setting step size

In the field of optimization algorithms, especially in variational reduction methods such as stochastic average gradient (SAG), stochastic average gradient algorithm (SAGA), and stochastic gradient reduction (SVRG), setting the step size is a key issue. Although for the stochastic dual coordinate ascent (SDCA) method, we can use the dual objective to determine the step size, the theoretical basis of the original variable methods such as SAG, SAGA, and SVRG is that the step size should be $Oleft(frac{1}{L_{text{max}}}right)$ However, in practical applications, we often do not know $L_{text{max}}$ The exact value of is unknown, and using other step sizes may give better performance.

A classic strategy for setting the step size in full-GD is the Armijo line search. Given the current point $x_k$ and search direction $g_k$ , Armijo line search in $gamma_k$ The line is defined as $gamma_k in {gamma : x_k + gamma g_k}$ , and requires the function to be sufficiently reduced, that is:
$f(x_k + gamma_k g_k) < f(x_k) - c gamma_k |nabla f(x_k)|^2$
However, this approach requires multiple candidate step sizes. $gamma_k$ Calculation $f(x_k + gamma_k g_k)$ , which is evaluated $f (x)$ It is too expensive when the entire dataset needs to be traversed.

To solve this problem, we can use the random variant method to find the $gamma_k$ ：
$f_{ik}(x_k + gamma_k g_k) < f_{ik}(x_k) - c gamma_k |nabla f_{ik}(x_k)|^2$
This approach generally works well in practice, especially in $f_{ik}(x_k)|$ Not close to zero, although there is currently no theory to support this approach.

In addition, Mairal proposed a "Bottou technique" for setting the step size in practice. This method performs a binary search on a small portion of the data set (e.g. 5%) to try to find the optimal step size when traversing through this sample once. Similar to Armijo line search, this method usually works well in practice, but also lacks theoretical foundation.

Please note that the above content is a restatement of the original text, using Markdown format to represent mathematical formulas and variables.

However, the SDCA method also has some disadvantages. First, it requires the calculation of the convex conjugate $ell_i^*$ rather than a simple gradient. We do not have an automatic differentiation equivalent for convex conjugation, so this may increase the implementation effort. Recent work has proposed “dual-free” SDCA methods that do not require conjugation, and instead use the gradient directly. However, in these methods, it is no longer possible to keep track of the dual objective to set the step size. Second, although SDCA only requires $O (n + d)$ memory to solve the problem (15), but for this problem class, SAG/SAGA also only needs $O (n + d)$ of memory (see Section 3). SDCA variants applicable to more general problems have the $O (n d)$ Memory, because $v_i$ Become a $d$ A final subtle drawback of SDCA is that it implicitly assumes a strong convexity constant $μ$ equal $λ$ .for $μ$ more than the $λ$ For this problem, the original VR method usually outperforms SDCA significantly.

3.2. Determination of termination conditions

In the field of algorithm optimization, we usually rely on theoretical results of iteration complexity to predict the worst-case number of iterations required for an algorithm to reach a certain accuracy. However, these theoretical bounds often depend on some constants that we cannot predict, while in practical applications, algorithms can often reach the expected accuracy in fewer iterations. Therefore, we need to set up some test criteria to decide when to end the operation of the algorithm.

In the traditional full-gradient descent (full-GD) method, we usually use the norm of the gradient $f(x_k) |$ Or some other quantity related to it to decide when to stop iterating. For the SVRG method, we can use the same criterion, but use $f(bar{x}_s) |$ For the SAG/SAGA method, although we do not explicitly calculate the full gradient, the quantity $ g_{bar{k}} $ will gradually approach $f(x_k)$ , so use $g_{bar{k}} |$ is a reasonable heuristic as a stopping condition.

In the SDCA method, with some extra logging, we can track the gradient of the dual objective without incurring additional asymptotic cost. Alternatively, a more systematic approach would be to track the dual gap, although this would increase the number of iterations per iteration. $O (n)$ The cost of this method is 0, but it is able to provide a termination condition with a dual gap proof. In addition, based on the optimality conditions of the strong convex objective, the MISO method adopts a principled approach based on the quadratic lower bound [41].

The following are mathematical formulas and variables expressed using Markdown format:

Gradient norm: $f(x_k) |$
Gradient norm in the SVRG method: $f(bar{x}_s) |$
The quantity of approximating the gradient in the SAG/SAGA method: $ g_{bar{k}} $
Incremental cost per iteration: $O (n)$
MISO Method
Quadratic lower bound

Please note that the above content is a restatement of the original text, using Markdown format to represent mathematical formulas and variables.

3.3. Reducing Memory Requirements

Although the stochastic variational gradient reduction (SVRG) algorithm eliminates the memory requirements of earlier variational reduction methods, in practice, the SAG (stochastic average gradient descent) and SAGA (stochastic average gradient descent with gradient accumulation) algorithms often require fewer iterations than the SVRG algorithm on many problems. This raises a question: Are there some problems that make SAG/SAGA more efficient? $O (n d)$ Memory requirements are as follows. This section explores the class of linear models for which memory requirements can be significantly reduced.

Consider a linear model where each function $f_i(x)$ It can be expressed as $xi_i(mathbf{a}_i^top x)$ .right $x$ The derivative obtains the gradient form:
$f_i(x) = xi'(mathbf{a}_i^top x) mathbf{a}_i$
here, $ξ^{'}$ express $ξ$ The derivative of . Assume that we have direct access to the eigenvector $mathbf{a}_i$ , then to implement the SAG/SAGA method, we only need to store the scalar $xi(mathbf{a}_i^top x)$ . Thus, the memory requirement is reduced from $O (n d)$ Reduced to $O (n)$ The SVRG algorithm can also exploit this structure of the gradient: by storing $n$ scalar, we can reduce the number of gradient evaluations required per “inner” iteration of SVRG to 1 for this class of problems.

There are other types of problems, such as probabilistic graphical models, that also offer the potential to reduce memory requirements.[66] Through specific data structures and algorithmic optimizations, it is possible to further reduce the memory resources required by the algorithm at runtime.

The following are mathematical formulas and variables expressed using Markdown format:

Linear model function: $f_i(x) = xi_i(mathbf{a}_i^top x)$
Gradient expression: $f_i(x) = xi'(mathbf{a}_i^top x) mathbf{a}_i$
Feature vector: $mathbf{a}_i$
Memory requirements from $O (n d)$ Reduce to $O (n)$ 。

3.4. Sparse Gradient Processing

In some problems, the gradient $f_i(x)$ May contain a large number of zero values, such as linear models with sparse features. In this case, the traditional stochastic gradient descent (SGD) algorithm can be implemented efficiently, with a computational complexity linear to the number of non-zero elements in the gradient, which is usually much smaller than the problem dimension. $d$ However, this advantage is not exploited in standard variational reduction (VR) methods. Fortunately, there are two known methods to improve it.

The first improvement, proposed by Schmidt et al., exploits the simplicity of the update process and implements a variant of "on-the-fly" computation, making the cost of each iteration proportional to the number of nonzero elements. In the case of SAG (but this approach applies to all variants), the specific approach is to not store the full vector after each iteration $v_{ik}$ , but only calculates the corresponding non-zero elements $v_{ik_j}$ , by updating each variable since the last time that element was non-zero $v_{ik_j}$ 。

The second improvement method was proposed by Leblond et al. for SAGA, which updates the formula $x_{k+1} = x_k - gamma(nabla f_{ik}(x_k) - nabla f_{ik}(bar{x}_{ik}) + bar{g}_k)$ Additional randomness is introduced in . Here, $f_{ik}(x_k)$ and $f_{ik}(bar{x}_{ik})$ is sparse, and $bar{g}_k$ is dense. In this method, the dense term $(bar{g}_k)_j$ Each component of is replaced by $w_j (bar{g}_k)_j$ ,in $mathbb{R}^d$ is a random sparse vector whose support set is contained in $f_{ik}(x_k)$ , and is expected to be a constant vector with all elements equal to 1. This way, the update process remains unbiased (although now sparse), and the increased variance does not affect the convergence rate of the algorithm. More details are provided by Leblond et al.

The following are mathematical formulas and variables expressed using Markdown format:

gradient: $f_i(x)$
SGD Update: $x_{k+1} = x_k - gamma(nabla f_{ik}(x_k) - nabla f_{ik}(bar{x}_{ik}) + bar{g}_k)$
Sparse Gradients: $f_{ik}(x_k)$ and $f_{ik}(bar{x}_{ik})$
Dense Gradient: $bar{g}_k$
Random sparse vector: $w$
Expected constant vector: a vector whose elements are all 1.

Technology Sharing