为了正常的体验网站,请在浏览器设置里面开启Javascript功能!
首页 > Line Search Methods

Line Search Methods

2010-04-15 36页 pdf 309KB 19阅读

用户头像

is_382380

暂无简介

举报
Line Search Methods This is pag Printer: O CHAP T E R3 Line Search Methods Each iteration of a line search method computes a search direction pk and then decides how far to move along that direction. The iteration is given by xk+1 � xk + αk pk, (3.1) where the positive scalar αk i...
Line Search Methods
This is pag Printer: O CHAP T E R3 Line Search Methods Each iteration of a line search method computes a search direction pk and then decides how far to move along that direction. The iteration is given by xk+1 � xk + αk pk, (3.1) where the positive scalar αk is called the step length. The success of a line search method depends on effective choices of both the direction pk and the step length αk . Most line search algorithms require pk to be a descent direction—one for which pTk ∇ fk < 0—because this property guarantees that the function f can be reduced along 3 . 1 . S T E P L E N G T H 31 this direction, as discussed in the previous chapter. Moreover, the search direction often has the form pk � −B−1k ∇ fk, (3.2) where Bk is a symmetric and nonsingular matrix. In the steepest descent method, Bk is simply the identity matrix I , while in Newton’s method, Bk is the exact Hessian ∇2 f (xk). In quasi-Newton methods, Bk is an approximation to the Hessian that is updated at every iteration by means of a low-rank formula. When pk is defined by (3.2) and Bk is positive definite, we have pTk ∇ fk � −∇ f Tk B−1k ∇ fk < 0, and therefore pk is a descent direction. In this chapter, we discuss how to choose αk and pk to promote convergence from remote starting points. We also study the rate of convergence of steepest descent, quasi- Newton, andNewtonmethods. Since thepureNewton iteration is not guaranteed toproduce descent directionswhen the current iterate is not close to a solution,wediscussmodifications in Section 3.4 that allow it to start from any initial point. We now give careful consideration to the choice of the step-length parameter αk . 3.1 STEP LENGTH In computing the step length αk , we face a tradeoff. We would like to choose αk to give a substantial reduction of f , but at the same time we do not want to spend too much time making the choice. The ideal choice would be the global minimizer of the univariate function φ(·) defined by φ(α) � f (xk + αpk), α > 0, (3.3) but in general, it is too expensive to identify this value (see Figure 3.1). To find even a local minimizer of φ to moderate precision generally requires too many evaluations of the objec- tive function f and possibly the gradient ∇ f . More practical strategies perform an inexact line search to identify a step length that achieves adequate reductions in f at minimal cost. Typical line search algorithms try out a sequence of candidate values for α, stopping to accept one of these valueswhen certain conditions are satisfied. The line search is done in two stages: A bracketing phase finds an interval containing desirable step lengths, and a bisection or interpolation phase computes a good step length within this interval. Sophisticated line search algorithms can be quite complicated, so we defer a full description until Section 3.5. 32 C H A P T E R 3 . L I N E S E A R C H M E T H O D S (φ α) point stationary first minimizer localfirst global minimizer α Figure 3.1 The ideal step length is the global minimizer. We now discuss various termination conditions for line search algorithms and show that effective step lengths need not lie near minimizers of the univariate function φ(α) defined in (3.3). A simple condition we could impose on αk is to require a reduction in f , that is, f (xk + αk pk) < f (xk). That this requirement is not enough to produce convergence to x∗ is illustrated in Figure 3.2, for which the minimum function value is f ∗ � −1, but a sequence of iterates {xk} for which f (xk) � 5/k, k � 0, 1, . . . yields a decrease at each iteration but has a limiting function value of zero. The insufficient reduction in f at each step causes it to fail to converge to the minimizer of this convex function. To avoid this behavior we need to enforce a sufficient decrease condition, a concept we discuss next. 2 x 0 x1x x xf( ) Figure 3.2 Insufficient reduction in f . 3 . 1 . S T E P L E N G T H 33 THE WOLFE CONDITIONS A popular inexact line search condition stipulates that αk should first of all give sufficient decrease in the objective function f , as measured by the following inequality: f (xk + αpk) ≤ f (xk) + c1α∇ f Tk pk, (3.4) for some constant c1 ∈ (0, 1). In other words, the reduction in f should be proportional to both the step length αk and the directional derivative ∇ f Tk pk . Inequality (3.4) is sometimes called the Armijo condition. The sufficient decrease condition is illustrated in Figure 3.3. The right-hand-side of (3.4), which is a linear function, can be denoted by l(α). The function l(·) has negative slope c1∇ f Tk pk , but because c1 ∈ (0, 1), it lies above the graph of φ for small positive values of α. The sufficient decrease condition states that α is acceptable only if φ(α) ≤ l(α). The intervals on which this condition is satisfied are shown in Figure 3.3. In practice, c1 is chosen to be quite small, say c1 � 10−4. The sufficient decrease condition is not enough by itself to ensure that the algorithm makes reasonable progress because, as we see from Figure 3.3, it is satisfied for all sufficiently small values of α. To rule out unacceptably short steps we introduce a second requirement, called the curvature condition, which requires αk to satisfy ∇ f (xk + αk pk)T pk ≥ c2∇ f Tk pk, (3.5) for some constant c2 ∈ (c1, 1), where c1 is the constant from (3.4). Note that the left-hand- side is simply the derivative φ′(αk), so the curvature condition ensures that the slope of φ at αk is greater than c2 times the initial slope φ′(0). This makes sense because if the slope φ′(α) αl( ) φ (α f(x +) = kαk p ) acceptableacceptable α Figure 3.3 Sufficient decrease condition. 34 C H A P T E R 3 . L I N E S E A R C H M E T H O D S desired slope k )φ (α) =f(xk+α p tangent α acceptableacceptable Figure 3.4 The curvature condition. is strongly negative, we have an indication that we can reduce f significantly by moving further along the chosen direction. On the other hand, if φ′(αk) is only slightly negative or even positive, it is a sign that we cannot expect much more decrease in f in this direction, so it makes sense to terminate the line search. The curvature condition is illustrated in Figure 3.4. Typical values of c2 are 0.9 when the search direction pk is chosen by a Newton or quasi-Newton method, and 0.1 when pk is obtained from a nonlinear conjugate gradient method. The sufficient decrease and curvature conditions are known collectively as the Wolfe conditions. We illustrate them in Figure 3.5 and restate them here for future reference: f (xk + αk pk) ≤ f (xk) + c1αk∇ f Tk pk, (3.6a) ∇ f (xk + αk pk)T pk ≥ c2∇ f Tk pk, (3.6b) with 0 < c1 < c2 < 1. A step length may satisfy the Wolfe conditions without being particularly close to a minimizer of φ, as we show in Figure 3.5. We can, however, modify the curvature condition to force αk to lie in at least a broad neighborhood of a local minimizer or stationary point of φ. The strong Wolfe conditions require αk to satisfy f (xk + αk pk) ≤ f (xk) + c1αk∇ f Tk pk, (3.7a) |∇ f (xk + αk pk)T pk | ≤ c2|∇ f Tk pk |, (3.7b) with 0 < c1 < c2 < 1. The only difference with the Wolfe conditions is that we no longer allow the derivative φ′(αk) to be too positive. Hence, we exclude points that are far from stationary points of φ. 3 . 1 . S T E P L E N G T H 35 slope desired line of sufficient decrease l(α ) acceptable α φ (α ) = αpf(x + kk ) acceptable Figure 3.5 Step lengths satisfying the Wolfe conditions. It is not difficult to prove that there exist step lengths that satisfy the Wolfe conditions for every function f that is smooth and bounded below. Lemma 3.1. Suppose that f : IRn → IR is continuously differentiable. Let pk be a descent direction at xk , and assume that f is bounded below along the ray {xk + αpk |α > 0}. Then if 0 < c1 < c2 < 1, there exist intervals of step lengths satisfying the Wolfe conditions (3.6) and the strong Wolfe conditions (3.7). PROOF. Note that φ(α) � f (xk + αpk) is bounded below for all α > 0. Since 0 < c1 < 1, the line l(α) � f (xk) + αc1∇ f Tk pk is unbounded below and must therefore intersect the graph of φ at least once. Let α′ > 0 be the smallest intersecting value of α, that is, f (xk + α′ pk) � f (xk) + α′c1∇ f Tk pk . (3.8) The sufficient decrease condition (3.6a) clearly holds for all step lengths less than α′. By the mean value theorem (see (A.55)), there exists α′′ ∈ (0, α′) such that f (xk + α′ pk) − f (xk) � α′∇ f (xk + α′′ pk)T pk . (3.9) By combining (3.8) and (3.9), we obtain ∇ f (xk + α′′ pk)T pk � c1∇ f Tk pk > c2∇ f Tk pk, (3.10) since c1 < c2 and ∇ f Tk pk < 0. Therefore, α′′ satisfies the Wolfe conditions (3.6), and the inequalities hold strictly in both (3.6a) and (3.6b). Hence, by our smoothness assumption on f , there is an interval around α′′ for which the Wolfe conditions hold. Moreover, since 36 C H A P T E R 3 . L I N E S E A R C H M E T H O D S the term in the left-hand side of (3.10) is negative, the strong Wolfe conditions (3.7) hold in the same interval. � The Wolfe conditions are scale-invariant in a broad sense: Multiplying the objective function by a constant ormaking an affine change of variables does not alter them. They can be used in most line search methods, and are particularly important in the implementation of quasi-Newton methods, as we see in Chapter 6. THE GOLDSTEIN CONDITIONS Like the Wolfe conditions, the Goldstein conditions ensure that the step length α achieves sufficient decrease but is not too short. The Goldstein conditions can also be stated as a pair of inequalities, in the following way: f (xk) + (1 − c)αk∇ f Tk pk ≤ f (xk + αk pk) ≤ f (xk) + cαk∇ f Tk pk, (3.11) with 0 < c < 1/2. The second inequality is the sufficient decrease condition (3.4), whereas the first inequality is introduced to control the step length from below; see Figure 3.6 A disadvantage of the Goldstein conditions vis-a`-vis the Wolfe conditions is that the first inequality in (3.11) may exclude all minimizers of φ. However, the Goldstein and Wolfe conditions have much in common, and their convergence theories are quite similar. The Goldstein conditions are often used in Newton-type methods but are not well suited for quasi-Newton methods that maintain a positive definite Hessian approximation. fkTpkcα Tpkkα (1 _ c) f φ ( = f(x k+α pα ) k ) acceptable steplengths α Figure 3.6 The Goldstein conditions. 3 . 2 . C O N V E R G E N C E O F L I N E S E A R C H M E T H O D S 37 SUFFICIENT DECREASE AND BACKTRACKING We havementioned that the sufficient decrease condition (3.6a) alone is not sufficient to ensure that the algorithm makes reasonable progress along the given search direction. However, if the line search algorithm chooses its candidate step lengths appropriately, by using a so-called backtracking approach, we can dispense with the extra condition (3.6b) and use just the sufficient decrease condition to terminate the line search procedure. In its most basic form, backtracking proceeds as follows. Algorithm 3.1 (Backtracking Line Search). Choose α¯ > 0, ρ ∈ (0, 1), c ∈ (0, 1); Set α ← α¯; repeat until f (xk + αpk) ≤ f (xk) + cα∇ f Tk pk α ← ρα; end (repeat) Terminate with αk � α. In this procedure, the initial step length α¯ is chosen to be 1 in Newton and quasi- Newton methods, but can have different values in other algorithms such as steepest descent or conjugate gradient. An acceptable step length will be found after a finite number of trials, because αk will eventually become small enough that the sufficient decrease condition holds (see Figure 3.3). In practice, the contraction factor ρ is often allowed to vary at each iteration of the line search. For example, it can be chosen by safeguarded interpolation, as we describe later. We need ensure only that at each iteration we have ρ ∈ [ρlo, ρhi], for some fixed constants 0 < ρlo < ρhi < 1. The backtracking approach ensures either that the selected step length αk is some fixed value (the initial choice α¯), or else that it is short enough to satisfy the sufficient decrease condition but not too short. The latter claim holds because the accepted value αk is within a factor ρ of the previous trial value, αk/ρ, which was rejected for violating the sufficient decrease condition, that is, for being too long. This simple andpopular strategy for terminating a line search iswell suited forNewton methods but is less appropriate for quasi-Newton and conjugate gradient methods. 3.2 CONVERGENCE OF LINE SEARCH METHODS To obtain global convergence, we must not only have well chosen step lengths but also well chosen search directions pk . We discuss requirements on the search direction in this section, focusing on one key property: the angle θk between pk and the steepest descent direction −∇ fk , defined by cos θk � −∇ f T k pk ‖∇ fk‖ ‖pk‖ . (3.12) 38 C H A P T E R 3 . L I N E S E A R C H M E T H O D S The following theorem,due toZoutendijk, has far-reaching consequences. It quantifies the effect of properly chosen step lengthsαk , and shows, for example, that the steepest descent method is globally convergent. For other algorithms, it describes how far pk can deviate from the steepest descent direction and still produce a globally convergent iteration. Various line search termination conditions can be used to establish this result, but for concreteness we will consider only the Wolfe conditions (3.6). Though Zoutendijk’s result appears at first to be technical and obscure, its power will soon become evident. Theorem 3.2. Consider any iteration of the form (3.1), where pk is a descent direction and αk satisfies the Wolfe conditions (3.6). Suppose that f is bounded below in IRn and that f is continuously differentiable in an open set N containing the level set L def� {x : f (x) ≤ f (x0)}, where x0 is the starting point of the iteration. Assume also that the gradient ∇ f is Lipschitz continuous on N , that is, there exists a constant L > 0 such that ‖∇ f (x) − ∇ f (x˜)‖ ≤ L‖x − x˜‖, for all x, x˜ ∈ N . (3.13) Then ∑ k≥0 cos2 θk ‖∇ fk‖2 < ∞. (3.14) PROOF. From (3.6b) and (3.1) we have that (∇ fk+1 − ∇ fk)T pk ≥ (c2 − 1)∇ f Tk pk, while the Lipschitz condition (3.13) implies that (∇ fk+1 − ∇ fk)T pk ≤ αk L‖pk‖2. By combining these two relations, we obtain αk ≥ c2 − 1L ∇ f Tk pk ‖pk‖2 . By substituting this inequality into the first Wolfe condition (3.6a), we obtain fk+1 ≤ fk − c1 1 − c2L (∇ f Tk pk)2 ‖pk‖2 . From the definition (3.12), we can write this relation as fk+1 ≤ fk − c cos2 θk‖∇ fk‖2, 3 . 2 . C O N V E R G E N C E O F L I N E S E A R C H M E T H O D S 39 where c � c1(1 − c2)/L . By summing this expression over all indices less than or equal to k, we obtain fk+1 ≤ f0 − c k∑ j�0 cos2 θ j‖∇ f j‖2. (3.15) Since f is bounded below, we have that f0 − fk+1 is less than some positive constant, for all k. Hence, by taking limits in (3.15), we obtain ∞∑ k�0 cos2 θk‖∇ fk‖2 < ∞, which concludes the proof. � Similar results to this theorem hold when the Goldstein conditions (3.11) or strong Wolfe conditions (3.7) are used in place of the Wolfe conditions. For all these strategies, the step length selection implies inequality (3.14), which we call the Zoutendijk condition. Note that the assumptions of Theorem3.2 are not too restrictive. If the function f were not bounded below, the optimization problem would not be well defined. The smoothness assumption—Lipschitz continuity of the gradient—is implied by many of the smoothness conditions that are used in local convergence theorems (see Chapters 6 and 7) and are often satisfied in practice. The Zoutendijk condition (3.14) implies that cos2 θk‖∇ fk‖2 → 0. (3.16) This limit can be used in turn to derive global convergence results for line search algorithms. If our method for choosing the search direction pk in the iteration (3.1) ensures that the angle θk defined by (3.12) is bounded away from 90◦, there is a positive constant δ such that cos θk ≥ δ > 0, for all k. (3.17) It follows immediately from (3.16) that lim k→∞ ‖∇ fk‖ � 0. (3.18) In other words, we can be sure that the gradient norms ‖∇ fk‖ converge to zero, provided that the search directions are never too close to orthogonalitywith the gradient. In particular, the method of steepest descent (for which the search direction pk is parallel to the negative 40 C H A P T E R 3 . L I N E S E A R C H M E T H O D S gradient) produces a gradient sequence that converges to zero, provided that it uses a line search satisfying the Wolfe or Goldstein conditions. We use the term globally convergent to refer to algorithms for which the property (3.18) is satisfied, but note that this term is sometimes used in other contexts to mean different things. For line search methods of the general form (3.1), the limit (3.18) is the strongest global convergence result that can be obtained: We cannot guarantee that the method converges to a minimizer, but only that it is attracted by stationary points. Only by making additional requirements on the search direction pk—by introducing negative curvature information from the Hessian ∇2 f (xk), for example—can we strengthen these results to include convergence to a local minimum. See the Notes and References at the end of this chapter for further discussion of this point. Consider now the Newton-like method (3.1), (3.2) and assume that the matrices Bk are positive definite with a uniformly bounded condition number. That is, there is a constant M such that ‖Bk‖ ‖B−1k ‖ ≤ M, for all k. It is easy to show from the definition (3.12) that cos θk ≥ 1/M (3.19) (see Exercise 3.5). By combining this bound with (3.16) we find that lim k→∞ ‖∇ fk‖ � 0. (3.20) Therefore, we have shown that Newton and quasi-Newton methods are globally convergent if the matrices Bk have a bounded condition number and are positive definite (which is needed to ensure that pk is a descent direction), and if the step lengths satisfy the Wolfe conditions. For some algorithms, such as conjugate gradient methods, we will be able to prove the limit (3.18), but only the weaker result lim inf k→∞ ‖∇ fk‖ � 0. (3.21) In other words, just a subsequence of the gradient norms ‖∇ fk j ‖ converges to zero, rather than the whole sequence (see Appendix A). This result, too, can be proved by using Zou- tendijk’s condition (3.14), but instead of a constructive proof, we outline a proof by contradiction. Suppose that (3.21) does not hold, so that the gradients remain bounded away from zero, that is, there exists γ > 0 such that ‖∇ fk‖ ≥ γ, for all k sufficiently large. (3.22) 3 . 3 . R A T E O F C O N V E R G E N C E 41 Then from (3.16) we conclude that cos θk → 0, (3.23) that is, the entire sequence {cos θk} converges to 0. To establish (3.21), therefore, it is enough to show that a subsequence {cos θk j } is bounded away from zero. We will use this strategy in Chapter 5 to study the convergence of nonlinear conjugate gradient methods. By applying this proof technique, we can prove global convergence in the sense of (3.20) or (3.21) for a general class of algorithms. Consider any algorithm for which (i) every iteration produces a decrease in the objective function, and (ii) every mth iteration is a steepest descent step, with step length chosen to satisfy the Wolfe or Goldstein conditions. Then, since cos θk � 1 for the steepest descent steps, the result (3.21) holds. Of course, we would design the algorithm so that it does something “better" than steepest descent at the other m − 1 iterates. The occasional steepest descent steps may not make much progress, but they at least guarantee overall global convergence. Note that throughout this section we have used only the fact that Zoutendijk’s condi- tion implies the limit (3.16). In later chapterswewillmake use of the bounded sumcondition (3.14), which forces the sequence {cos2 θk‖∇ fk‖2} to converge to zero at a sufficiently rapid rate. 3.3 RATE OF CONVERGENCE It wou
/
本文档为【Line Search Methods】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。 本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。 网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。

历史搜索

    清空历史搜索