{"title": "A Convergent $O(n)$ Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 1609, "page_last": 1616, "abstract": "We introduce the first temporal-difference learning algorithm that is stable with linear function approximation and off-policy training, for any finite Markov decision process, target policy, and exciting behavior policy, and whose complexity scales linearly in the number of parameters. We consider an i.i.d.\\ policy-evaluation setting in which the data need not come from on-policy experience. The gradient temporal-difference (GTD) algorithm estimates the expected update vector of the TD(0) algorithm and performs stochastic gradient descent on its L_2 norm. Our analysis proves that its expected update is in the direction of the gradient, assuring convergence under the usual stochastic approximation conditions to the same least-squares solution as found by the LSTD, but without its quadratic computational complexity. GTD is online and incremental, and does not involve multiplying by products of likelihood ratios as in importance-sampling methods.", "full_text": "A Convergent O(n) Algorithm\n\nfor Off-policy Temporal-difference Learning\n\nwith Linear Function Approximation\n\nRichard S. Sutton, Csaba Szepesv\u00b4ari\u2217, Hamid Reza Maei\nReinforcement Learning and Arti\ufb01cial Intelligence Laboratory\n\nEdmonton, Alberta, Canada T6G 2E8\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nAbstract\n\nWe introduce the \ufb01rst temporal-difference learning algorithm that is stable with\nlinear function approximation and off-policy training, for any \ufb01nite Markov de-\ncision process, behavior policy, and target policy, and whose complexity scales\nlinearly in the number of parameters. We consider an i.i.d. policy-evaluation set-\nting in which the data need not come from on-policy experience. The gradient\ntemporal-difference (GTD) algorithm estimates the expected update vector of the\nTD(0) algorithm and performs stochastic gradient descent on its L2 norm. We\nprove that this algorithm is stable and convergent under the usual stochastic ap-\nproximation conditions to the same least-squares solution as found by the LSTD,\nbut without LSTD\u2019s quadratic computational complexity. GTD is online and in-\ncremental, and does not involve multiplying by products of likelihood ratios as in\nimportance-sampling methods.\n\n1 Off-policy learning methods\n\nOff-policy methods have an important role to play in the larger ambitions of modern reinforcement\nlearning.\nIn general, updates to a statistic of a dynamical process are said to be \u201coff-policy\u201d if\ntheir distribution does not match the dynamics of the process, particularly if the mismatch is due\nto the way actions are chosen. The prototypical example in reinforcement learning is the learning\nof the value function for one policy, the target policy, using data obtained while following another\npolicy, the behavior policy. For example, the popular Q-learning algorithm (Watkins 1989) is an off-\npolicy temporal-difference algorithm in which the target policy is greedy with respect to estimated\naction values, and the behavior policy is something more exploratory, such as a corresponding \u0001-\ngreedy policy. Off-policy methods are also critical to reinforcement-learning-based efforts to model\nhuman-level world knowledge and state representations as predictions of option outcomes (e.g.,\nSutton, Precup & Singh 1999; Sutton, Rafols & Koop 2006).\nUnfortunately, off-policy methods such as Q-learning are not sound when used with approximations\nthat are linear in the learned parameters\u2014the most popular form of function approximation in rein-\nforcement learning. Counterexamples have been known for many years (e.g., Baird 1995) in which\nQ-learning\u2019s parameters diverge to in\ufb01nity for any positive step size. This is a severe problem in\nso far as function approximation is widely viewed as necessary for large-scale applications of rein-\nforcement learning. The need is so great that practitioners have often simply ignored the problem\nand continued to use Q-learning with linear function approximation anyway. Although no instances\n\n\u2217Csaba Szepesv\u00b4ari is on leave from MTA SZTAKI.\n\n1\n\n\fof absolute divergence in applications have been reported in the literature, the potential for instability\nis disturbing and probably belies real but less obvious problems.\nThe stability problem is not speci\ufb01c to reinforcement learning. Classical dynamic programming\nmethods such as value and policy iteration are also off-policy methods and also diverge on some\nproblems when used with linear function approximation. Reinforcement learning methods are ac-\ntually an improvement over conventional dynamic programming methods in that at least they can\nbe used stably with linear function approximation in their on-policy form. The stability problem is\nalso not due to the interaction of control and prediction, or to stochastic approximation effects; the\nsimplest counterexamples are for deterministic, expected-value-style, synchronous policy evaluation\n(see Baird 1995; Sutton & Barto 1998).\nPrior to the current work, the possibility of instability could not be avoided whenever four individ-\nually desirable algorithmic features were combined: 1) off-policy updates, 2) temporal-difference\nlearning, 3) linear function approximation, and 4) linear complexity in memory and per-time-step\ncomputation. If any one of these four is abandoned, then stable methods can be obtained relatively\neasily. But each feature brings value and practitioners are loath to give any of them up, as we discuss\nlater in a penultimate related-work section. In this paper we present the \ufb01rst algorithm to achieve\nall four desirable features and be stable and convergent for all \ufb01nite Markov decision processes, all\ntarget and behavior policies, and all feature representations for the linear approximator. Moreover,\nour algorithm does not use importance sampling and can be expected to be much better conditioned\nand of lower variance than importance sampling methods. Our algorithm can be viewed as perform-\ning stochastic gradient-descent in a novel objective function whose optimum is the least-squares TD\nsolution. Our algorithm is also incremental and suitable for online use just as are simple temporal-\ndifference learning algorithms such as Q-learning and TD(\u03bb) (Sutton 1988). Our algorithm can be\nbroadly characterized as a gradient-descent version of TD(0), and accordingly we call it GTD(0).\n\n2 Sub-sampling and i.i.d. formulations of temporal-difference learning\n\nIn this section we formulate the off-policy policy-evaluation problem for one-step temporal-\ndifference learning such that the data consists of independent, identically-distributed (i.i.d.) sam-\nples. We start by considering the standard reinforcement learning framework, in which a learning\nagent interacts with an environment consisting of a \ufb01nite Markov decision process (MDP). At each\nof a sequence of discrete time steps, t = 1, 2, . . ., the environment is in a state st \u2208 S, the agent\nchooses an action at \u2208 A, and then the environment emits a reward rt \u2208 R, and transitions to its\nnext state st+1 \u2208 S. The state and action sets are \ufb01nite. State transitions are stochastic and depen-\ndent on the immediately preceding state and action. Rewards are stochastic and dependent on the\npreceding state and action, and on the next state. The agent process generating the actions is termed\nthe behavior policy. To start, we assume a deterministic target policy \u03c0 : S \u2192 A. The objective is\nto learn an approximation to its state-value function:\n\n(cid:34) \u221e(cid:88)\n\nt=1\n\n(cid:35)\n\nV \u03c0(s) = E\u03c0\n\n\u03b3t\u22121rt|s1 = s\n\n,\n\n(1)\n\nwhere \u03b3 \u2208 [0, 1) is the discount rate. The learning is to be done without knowledge of the process\ndynamics and from observations of a single continuous trajectory with no resets.\nIn many problems of interest the state set is too large for it to be practical to approximate the value of\neach state individually. Here we consider linear function approximation, in which states are mapped\nto feature vectors with fewer components than the number of states. That is, for each state s \u2208 S\nthere is a corresponding feature vector \u03c6(s) \u2208 Rn, with n (cid:28) |S|. The approximation to the value\nfunction is then required to be linear in the feature vectors and a corresponding parameter vector\n\u03b8 \u2208 Rn:\n\n(2)\nFurther, we assume that the states st are not visible to the learning agent in any way other than\nthrough the feature vectors. Thus this function approximation formulation can include partial-\nobservability formulations such as POMDPs as a special case.\nThe environment and the behavior policy together generate a stream of states, actions and re-\nwards, s1, a1, r1, s2, a2, r2, . . ., which we can break into causally related 4-tuples, (s1, a1, r1, s(cid:48)\n1),\n\nV \u03c0(s) \u2248 \u03b8(cid:62)\u03c6(s).\n\n2\n\n\fk)).\n\n2), . . . , where s(cid:48)\n\n(s2, a2, r2, s(cid:48)\nt = st+1. For some tuples, the action will match what the target policy\nwould do in that state, and for others it will not. We can discard all of the latter as not relevant to\nthe target policy. For the former, we can discard the action because it can be determined from the\nstate via the target policy. With a slight abuse of notation, let sk denote the kth state in which an\non-policy action was taken, and let rk and s(cid:48)\nk denote the associated reward and next state. The kth\non-policy transition, denoted (sk, rk, s(cid:48)\nk), is a triple consisting of the starting state of the transition,\nthe reward on the transition, and the ending state of the transition. The corresponding data available\nto the learning algorithm is the triple (\u03c6(sk), rk, \u03c6(s(cid:48)\nThe MDP under the behavior policy is assumed to be ergodic, so that it determines a stationary\nstate-occupancy distribution \u00b5(s) = limk\u2192\u221e P r{sk = s}. For any state s, the MDP and target\npolicy together determine an N \u00d7 N state-transition-probability matrix P , where pss(cid:48) = P r{s(cid:48)\nk =\ns(cid:48)|sk = s}, and an N \u00d7 1 expected-reward vector R, where Rs = E[rk|sk = s]. These two\ntogether completely characterize the statistics of on-policy transitions, and all the samples in the\nsequence of (\u03c6(sk), rk, \u03c6(s(cid:48)\nk)) respect these statistics. The problem still has a Markov structure\nin that there are temporal dependencies between the sample transitions. In our analysis we \ufb01rst\nconsider a formulation without such dependencies, the i.i.d. case, and then prove that our results\nextend to the original case.\nIn the i.i.d. formulation, the states sk are generated independently and identically distributed ac-\ncording to an arbitrary probability distribution \u00b5. From each sk, a corresponding s(cid:48)\nk is generated\naccording to the on-policy state-transition matrix, P , and a corresponding rk is generated according\nto an arbitrary bounded distribution with expected value Rsk. The \ufb01nal i.i.d. data sequence, from\nwhich an approximate value function is to be learned, is then the sequence (\u03c6(sk), rk, \u03c6(s(cid:48)\nk)), for\nk = 1, 2, . . . Further, because each sample is i.i.d., we can remove the indices and talk about a single\ntuple of random variables (\u03c6, r, \u03c6(cid:48)) drawn from \u00b5.\nIt remains to de\ufb01ne the objective of learning. The TD error for the linear setting is\n\n\u03b4 = r + \u03b3\u03b8(cid:62)\u03c6(cid:48) \u2212 \u03b8(cid:62)\u03c6.\n\n(3)\n\nGiven this, we de\ufb01ne the one-step linear TD solution as any value of \u03b8 at which\n\nwhere A = E(cid:2)\u03c6(\u03c6 \u2212 \u03b3\u03c6(cid:48))(cid:62)(cid:3) and b = E[r\u03c6]. This is the parameter value to which the linear TD(0)\n\n0 = E[\u03b4\u03c6] = \u2212A\u03b8 + b,\n\nalgorithm (Sutton 1988) converges under on-policy training, as well as the value found by LSTD(0)\n(Bradtke & Barto 1996) under both on-policy and off-policy training. The TD solution is always a\n\ufb01xed-point of the linear TD(0) algorithm, but under off-policy training it may not be stable; if \u03b8 does\nnot exactly satisfy (4), then the TD(0) algorithm may cause it to move away in expected value and\neventually diverge to in\ufb01nity.\n\n(4)\n\n3 The GTD(0) algorithm\n\nWe next present the idea and gradient-descent derivation leading to the GTD(0) algorithm. As dis-\ncussed above, the vector E[\u03b4\u03c6] can be viewed as an error in the current solution \u03b8. The vector should\nbe zero, so its norm is a measure of how far we are away from the TD solution. A distinctive fea-\nture of our gradient-descent analysis of temporal-difference learning is that we use as our objective\nfunction the L2 norm of this vector:\n\n(5)\nThis objective function is quadratic and unimodal; it\u2019s minimum value of 0 is achieved when\nE[\u03b4\u03c6] = 0, which can always be achieved. The gradient of this objective function is\n\nJ(\u03b8) = E[\u03b4\u03c6](cid:62) E[\u03b4\u03c6] .\n\n\u2207\u03b8J(\u03b8) = 2(\u2207\u03b8E[\u03b4\u03c6])E[\u03b4\u03c6]\n\n= 2E(cid:2)\u03c6(\u2207\u03b8\u03b4)(cid:62)(cid:3)(cid:62) E[\u03b4\u03c6]\n= \u22122E(cid:2)\u03c6(\u03c6 \u2212 \u03b3\u03c6(cid:48))(cid:62)(cid:3)(cid:62) E[\u03b4\u03c6] .\n\n(6)\nThis last equation is key to our analysis. We would like to take a stochastic gradient-descent ap-\nproach, in which a small change is made on each sample in such a way that the expected update\n\n3\n\n\fis the direction opposite to the gradient. This is straightforward if the gradient can be written as a\nsingle expected value, but here we have a product of two expected values. One cannot sample both\nof them because the sample product will be biased by their correlation. However, one could store\na long-term, quasi-stationary estimate of either of the expectations and then sample the other. The\nquestion is, which expectation should be estimated and stored, and which should be sampled? Both\nways seem to lead to interesting learning algorithms.\nFirst let us consider the algorithm obtained by forming and storing a separate estimate of the \ufb01rst\n\nexpectation, that is, of the matrix A = E(cid:2)\u03c6(\u03c6 \u2212 \u03b3\u03c6(cid:48))(cid:62)(cid:3). This matrix is straightforward to estimate\n\nfrom experience as a simple arithmetic average of all previously observed sample outer products\n\u03c6(\u03c6 \u2212 \u03b3\u03c6(cid:48))(cid:62). Note that A is a stationary statistic in any \ufb01xed-policy policy-evaluation problem; it\ndoes not depend on \u03b8 and would not need to be re-estimated if \u03b8 were to change. Let Ak be the\nestimate of A after observing the \ufb01rst k samples, (\u03c61, r1, \u03c6(cid:48)\nk). Then this algorithm\nis de\ufb01ned by\n\n1), . . . , (\u03c6k, rk, \u03c6(cid:48)\n\nAk =\n\n1\nk\n\nalong with the gradient descent rule:\n\nk(cid:88)\n\ni=1\n\n\u03c6i(\u03c6i \u2212 \u03b3\u03c6(cid:48)\n\ni)(cid:62)\n\n(7)\n\nk \u2265 1,\n\n\u03b8k+1 = \u03b8k + \u03b1kA(cid:62)\n\nk\u03b4k\u03c6k,\n\nk \u2212 \u03b8(cid:62)\n\n(8)\nwhere \u03b81 is arbitrary, \u03b4k = rk + \u03b3\u03b8(cid:62)\nk \u03c6(cid:48)\nk \u03c6k, and \u03b1k > 0 is a series of step-size parameters,\npossibly decreasing over time. We call this algorithm A(cid:62)TD(0) because it is essentially conventional\nTD(0) pre\ufb01xed by an estimate of the matrix A(cid:62). Although we \ufb01nd this algorithm interesting, we do\nnot consider it further here because it requires O(n2) memory and computation per time step.\nThe second path to a stochastic-approximation algorithm for estimating the gradient (6) is to form\nand store an estimate of the second expectation, the vector E[\u03b4\u03c6], and to sample the \ufb01rst expectation,\n\nE(cid:2)\u03c6(\u03c6 \u2212 \u03b3\u03c6(cid:48))(cid:62)(cid:3). Let uk denote the estimate of E[\u03b4\u03c6] after observing the \ufb01rst k \u2212 1 samples, with\n\nu1 = 0. The GTD(0) algorithm is de\ufb01ned by\n\nuk+1 = uk + \u03b2k(\u03b4k\u03c6k \u2212 uk)\n\n(9)\n\nand\n\n(10)\nwhere \u03b81 is arbitrary, \u03b4k is as in (3) using \u03b8k, and \u03b1k > 0 and \u03b2k > 0 are step-size parameters,\npossibly decreasing over time. Notice that if the product is formed right-to-left, then the entire\ncomputation is O(n) per time step.\n\nkuk,\n\n\u03b8k+1 = \u03b8k + \u03b1k(\u03c6k \u2212 \u03b3\u03c6(cid:48)\n\nk)\u03c6(cid:62)\n\n4 Convergence\n\nquences \u03b1k and \u03b2k satisfying \u03b2k = \u03b7\u03b1k, \u03b7 > 0, \u03b1k, \u03b2k \u2208 (0, 1],(cid:80)\u221e\nA = E(cid:2)\u03c6k(\u03c6k \u2212 \u03b3\u03c6(cid:48)\n\nThe purpose of this section is to establish that GTD(0) converges with probability one to the TD\nsolution in the i.i.d. problem formulation under standard assumptions. In particular, we have the\nfollowing result:\nTheorem 4.1 (Convergence of GTD(0)). Consider the GTD(0) iteration (9,10) with step-size se-\nk < \u221e.\nFurther assume that (\u03c6k, rk, \u03c6(cid:48)\nk) is an i.i.d. sequence with uniformly bounded second moments. Let\ntion of (\u03c6k, rk, \u03c6(cid:48)\nthe parameter vector \u03b8k converges with probability one to the TD solution (4).\n\nk)(cid:62)(cid:3) and b = E[rk\u03c6k] (note that A and b are well-de\ufb01ned because the distribu-\n\nk) does not depend on the sequence index k). Assume that A is non-singular. Then\n\nk=0 \u03b1k = \u221e,(cid:80)\u221e\n\nk=0 \u03b12\n\nProof. We use the ordinary-differential-equation (ODE) approach (Borkar & Meyn 2000). First,\n\u221a\nwe rewrite the algorithm\u2019s two iterations as a single iteration in a combined parameter vector with\n2n components \u03c1(cid:62)\n\u03b7, and a new reward-related vector with 2n\ncomponents g(cid:62)\n\nk), where vk = uk/\n\nk = (v(cid:62)\nk+1 = (rk\u03c6(cid:62)\n\nk , \u03b8(cid:62)\nk , 0(cid:62)):\n\nwhere\n\n(cid:18)\n\n\u03c1k+1 = \u03c1k + \u03b1k\n\u2212\u221a\n\u03b7I\n(\u03c6k \u2212 \u03b3\u03c6(cid:48)\nk)\u03c6(cid:62)\n\n\u221a\n\u03b7 (Gk+1\u03c1k + gk+1) ,\nk \u2212 \u03c6k)(cid:62)\n0\n\n\u03c6k(\u03b3\u03c6(cid:48)\n\nk\n\n(cid:19)\n\n.\n\nGk+1 =\n\n4\n\n\f(cid:19)\n\n(cid:18) b\n\n0\n\n(cid:19)\n\n.\n\n\u03b7 I \u2212A\nA(cid:62)\n0\n\n, g =\n\nG\u03c1 + g = 0,\n\nLet G = E[Gk] and g = E[gk]. Note that G and g are well-de\ufb01ned as by the assumption the process\n{\u03c6k, rk, \u03c6(cid:48)\n\nk}k is i.i.d. In particular,\nG =\n\n(cid:18) \u2212\u221a\n\n(11)\n\nk satis\ufb01es 0 < \u03b1(cid:48)\n\nk(h(\u03c1k)+Mk+1), where \u03b1(cid:48)\n\nk = \u03b1k\n\n\u221a\n\u03b7(G\u03c1k+g+(Gk+1\u2212G)\u03c1k+(gk+1\u2212g)) = \u03c1k+\u03b1(cid:48)\n\n(cid:3) \u2264 C0(1 + (cid:107)\u03c1k(cid:107)2) holds for\nk \u2264 1, (cid:80)\u221e\n\nk=1 \u03b1(cid:48)\n\ndifference sequence, and (ii-b) for some C0 > 0, E(cid:2)(cid:107)Mk+1(cid:107)2 |Fk\n(cid:80)\u221e\nk=1(\u03b1(cid:48)\n\nFurther, note that (4) follows from\nwhere \u03c1(cid:62) = (v(cid:62), \u03b8(cid:62)).\n\u221a\nNow we apply Theorem 2.2 of Borkar & Meyn (2000). For this purpose we write \u03c1k+1 = \u03c1k +\n\u03b7, h(\u03c1) =\n\u03b1k\ng + G\u03c1 and Mk+1 = (Gk+1 \u2212 G)\u03c1k + gk+1 \u2212 g. Let Fk = \u03c3(\u03c11, M1, . . . , \u03c1k\u22121, Mk). Theorem 2.2\nrequires the veri\ufb01cation of the following conditions: (i) The function h is Lipschitz and h\u221e(\u03c1) =\nlimr\u2192\u221e h(r\u03c1)/r is well-de\ufb01ned for every \u03c1 \u2208 R2n; (ii-a) The sequence (Mk,Fk) is a martingale\nk = \u221e,\nany initial parameter vector \u03c11; (iii) The sequence \u03b1(cid:48)\nk)2 < +\u221e; and (iv) The ODE \u02d9\u03c1 = h(\u03c1) has a globally asymptotically stable equilibrium.\nClearly, h(\u03c1) is Lipschitz with coef\ufb01cient (cid:107)G(cid:107) and h\u221e(\u03c1) = G\u03c1. By construction, (Mk,Fk)\nsatis\ufb01es E[Mk+1|Fk] = 0 and Mk \u2208 Fk, i.e., it is a martingale difference sequence. Condition\n(ii-b) can be shown to hold by a simple application of the triangle inequality and the boundedness of\nthe the second moments of (\u03c6k, rk, \u03c6(cid:48)\nk). Condition (iii) is satis\ufb01ed by our conditions on the step-size\nsequences \u03b1k, \u03b2k. Finally, the last condition (iv) will follow from the elementary theory of linear\ndifferential equations if we can show that the real parts of all the eigenvalues of G are negative.\nFirst, let us show that G is non-singular. Using the determinant rule for partitioned matrices1 we\nget det(G) = det(A(cid:62)A) (cid:54)= 0. This indicates that all the eigenvalues of G are non-zero. Now,\nlet \u03bb \u2208 C, \u03bb (cid:54)= 0 be an eigenvalue of G with corresponding normalized eigenvector x \u2208 C2n;\nthat is, (cid:107)x(cid:107)2 = x\u2217x = 1, where x\u2217 is the complex conjugate of x. Hence x\u2217Gx = \u03bb. Let\n\u03b7(cid:107)x1(cid:107)2 +\nx(cid:62) = (x(cid:62)\n1Ax2 \u2212 x\u2217\nx\u2217\n2A(cid:62)x1. Thus,\nRe(\u03bb) = Re(x\u2217Gx) = \u2212\u221a\n\u03b7(cid:107)x1(cid:107)2 \u2264 0. We are now done if we show that x1 cannot be zero. If\nx1 = 0, then from \u03bb = x\u2217Gx we get that \u03bb = 0, which contradicts with \u03bb (cid:54)= 0.\nThe next result concerns the convergence of GTD(0) when (\u03c6k, rk, \u03c6(cid:48)\nsub-sampling process described originally in Section 2. We make the following assumption:\nAssumption A1 The behavior policy \u03c0b (generator of the actions at) selects all actions of the target\npolicy \u03c0 with positive probability in every state, and the target policy is deterministic.\nThis assumption is needed to ensure that the sub-sampled process sk is well-de\ufb01ned and that the\nobtained sample is of \u201chigh quality\u201d. Under this assumption it holds that sk is again a Markov chain\nby the strong Markov property of Markov processes (as the times selected when actions correspond\nto those of the behavior policy form Markov times with respect to the \ufb01ltration de\ufb01ned by the original\nprocess st). The following theorem shows that the conclusion of the previous result continues to hold\nin this case:\nTheorem 4.2 (Convergence of GTD(0) with a sub-sampled process.). Assume A1. Let the param-\neters \u03b8k, uk be updated by (9,10). Further assume that (\u03c6k, rk, \u03c6(cid:48)\nodic and irreducible, so that limk\u2192\u221e P(sk = s(cid:48)|s0 = s) = \u00b5(s(cid:48)) exists and is unique. Let s be a\nstate randomly drawn from \u00b5, and let s(cid:48) be a state obtained by following \u03c0 for one time step in the\nb = E[r(s, s(cid:48))\u03c6(s)]. Assume that A is non-singular. Then the parameter vector \u03b8k converges with\nprobability one to the TD solution (4), provided that s1 \u223c \u00b5.\n\n(cid:3),\nk) is such that E(cid:2)(cid:107)\u03c6k(cid:107)2|sk\u22121\n(cid:3) are uniformly bounded. Assume that the Markov chain (sk) is aperi-\nE(cid:2)r2\nMDP from s. Further, let r(s, s(cid:48)) be the reward incurred. Let A = E(cid:2)\u03c6(s)(\u03c6(s) \u2212 \u03b3\u03c6(s(cid:48)))(cid:62)(cid:3) and\n\n2), where x1, x2 \u2208 Cn. Using the de\ufb01nition of G, \u03bb = x\u2217Gx = \u2212\u221a\n1 , x(cid:62)\n2A(cid:62)x1. Because A is real, A\u2217 = A(cid:62), and it follows that (x\u2217\n\n1Ax2)\u2217 = x\u2217\n\nk) is obtained by the off-policy\n\nk|sk\u22121\n\nk(cid:107)2|sk\u22121\n\n(cid:3), E(cid:2)(cid:107)\u03c6(cid:48)\n\nProof. The proof of Theorem 4.1 goes through without any changes once we observe that G =\nE[Gk+1|Fk] and g = E[gk+1 |Fk].\n\n1According to this rule, if A \u2208 Rn\u00d7n, B \u2208 Rn\u00d7m, C \u2208 Rm\u00d7n, D \u2208 Rm\u00d7m then for F = [A B; C D] \u2208\n\nR(n+m)\u00d7(n+m), det(F ) = det(A) det(D \u2212 CA\u22121B).\n\n5\n\n\fThe condition that (sk) is aperiodic and irreducible guarantees the existence of the steady state\ndistribution \u00b5. Further, the aperiodicity and irreducibility of (sk) follows from the same property of\nthe original process (st). For further discussion of these conditions cf. Section 6.3 of Bertsekas and\nTsitsiklis (1996).\nWith considerable more work the result can be extended to the case when s1 follows an arbitrary\ndistribution. This requires an extension of Theorem 2.2 of Borkar and Meyn (2000) to processes of\nthe form \u03c1k+1 + \u03c1k(h(\u03c1k) + Mk+1 + ek+1), where ek+1 is a fast decaying perturbation (see, e.g.,\nthe proof of Proposition 4.8 of Bertsekas and Tsitsiklis (1996)).\n\n5 Extensions to action values, stochastic target policies, and other sample\n\nweightings\n\nThe GTD algorithm extends immediately to the case of off-policy learning of action-value functions.\nFor this assume that a behavior policy \u03c0b is followed that samples all actions in every state with\npositive probability. Let the target policy to be evaluated be \u03c0. In this case the basis functions are\ndependent on both the states and actions: \u03c6 : S \u00d7 A \u2192 Rn. The learning equations are unchanged,\nexcept that \u03c6t and \u03c6(cid:48)\n\nt are rede\ufb01ned as follows:\n\nt = (cid:88)\n\n\u03c6t = \u03c6(st, at),\n\u03c6(cid:48)\n\n\u03c0(st+1, a)\u03c6(st+1, a).\n\n(12)\n(13)\n\na\n\n(We use time indices t denoting physical time.) Here \u03c0(s, a) is the probability of selecting action\na in state s under the target policy \u03c0. Let us call the resulting algorithm \u201cone-step gradient-based\nQ-evaluation,\u201d or GQE(0).\nTheorem 5.1 (Convergence of GQE(0)). Assume that st is a state sequence generated by follow-\ning some stationary policy \u03c0b in a \ufb01nite MDP. Let rt be the corresponding sequence of rewards\nand let \u03c6t, \u03c6(cid:48)\n\nt be given by the respective equations (12) and (13), and assume that E(cid:2)(cid:107)\u03c6t(cid:107)2|st\u22121\n(cid:3),\n(cid:3), E(cid:2)(cid:107)\u03c6(cid:48)\n(cid:3) are uniformly bounded. Let the parameters \u03b8t, ut be updated by Equa-\nE(cid:2)r2\ntransition. Let A = E(cid:2)\u03c6(s, a)(\u03c6(s, a) \u2212 \u03b3\u03c6(s(cid:48), a(cid:48)))(cid:62)(cid:3) and b = E[r(s, a, s(cid:48))\u03c6(s, a)]. Assume that\n\ntions (9) and (10). Assume that the Markov chain (st) is aperiodic and irreducible, so that\nlimt\u2192\u221e P(st = s(cid:48)|s0 = s) = \u00b5(s(cid:48)) exists and is unique. Let s be a state randomly drawn from\n\u00b5, a be an action chosen by \u03c0b in s, let s(cid:48) be the next state obtained and let a(cid:48) = \u03c0(s(cid:48)) be the\naction chosen by the target policy in state s(cid:48). Further, let r(s, a, s(cid:48)) be the reward incurred in this\n\nt(cid:107)2|st\u22121\n\nt |st\u22121\n\nA is non-singular. Then the parameter vector \u03b8t converges with probability one to a TD solution\n(4), provided that s1 is selected from the steady-state distribution \u00b5.\n\nThe proof is almost identical to that of Theorem 4.2, and hence it is omitted.\nOur main convergence results are also readily generalized to stochastic target policies by replacing\nthe sub-sampling process described in Section 2 with a sample-weighting process. That is, instead of\nincluding or excluding transitions depending upon whether the action taken matches a deterministic\npolicy, we include all transitions but give each a weight. For example, we might let the weight wt for\ntime step t be equal to the probability \u03c0(st, at) of taking the action actually taken under the target\npolicy. We can consider the i.i.d. samples now to have four components (\u03c6k, rk, \u03c6(cid:48)\nk, wk), with the\nupdate rules (9) and (10) replaced by\n\nuk+1 = uk + \u03b2k(\u03b4k\u03c6k \u2212 uk)wk,\n\n(14)\n\nand\n\n\u03b8k+1 = \u03b8k + \u03b1k(\u03c6k \u2212 \u03b3\u03c6(cid:48)\n\nk)\u03c6(cid:62)\n\n(15)\nEach sample is also weighted by wk in the expected values, such as that de\ufb01ning the TD solution\n(4). With these changes, Theorems 4.1 and 4.2 go through immediately for stochastic policies. The\nreweighting is, in effect, an adjustment to the i.i.d. sampling distribution, \u00b5, and thus our results\nhold because they hold for all \u00b5. The choice wt = \u03c0(st, at) is only one possibility, notable for its\nequivalence to our original case if the target policy is deterministic. Another natural weighting is\nwt = \u03c0(st, at)/\u03c0b(st, at), where \u03c0b is the behavior policy. This weighting may result in the TD\nsolution (4) better matching the target policy\u2019s value function (1).\n\nkukwk.\n\n6\n\n\f6 Related work\n\nThere have been several prior attempts to attain the four desirable algorithmic features mentioned at\nthe beginning this paper (off-policy stability, temporal-difference learning, linear function approxi-\nmation, and O(n) complexity) but none has been completely successful.\nOne idea for retaining all four desirable features is to use importance sampling techniques to re-\nweight off-policy updates so that they are in the same direction as on-policy updates in expected\nvalue (Precup, Sutton & Dasgupta 2001; Precup, Sutton & Singh 2000). Convergence can some-\ntimes then be assured by existing results on the convergence of on-policy methods (Tsitsiklis &\nVan Roy 1997; Tadic 2001). However, the importance sampling weights are cumulative products\nof (possibly many) target-to-behavior-policy likelihood ratios, and consequently they and the cor-\nresponding updates may be of very high variance. The use of \u201crecognizers\u201d to construct the target\npolicy directly from the behavior policy (Precup, Sutton, Paduraru, Koop & Singh 2006) is one strat-\negy for limiting the variance; another is careful choice of the target policies (see Precup, Sutton &\nDasgupta 2001). However, it remains the case that for all of such methods to date there are always\nchoices of problem, behavior policy, and target policy for which the variance is in\ufb01nite, and thus for\nwhich there is no guarantee of convergence.\nResidual gradient algorithms (Baird 1995) have also been proposed as a way of obtaining all four\ndesirable features. These methods can be viewed as gradient descent in the expected squared TD\n\nerror, E(cid:2)\u03b42(cid:3); thus they converge stably to the solution that minimizes this objective for arbitrary\n\ndifferentiable function approximators. However, this solution has always been found to be much\ninferior to the TD solution (exempli\ufb01ed by (4) for the one-step linear case). In the literature (Baird\n1995; Sutton & Barto 1998), it is often claimed that residual-gradient methods are guaranteed to\n\ufb01nd the TD solution in two special cases: 1) systems with deterministic transitions and 2) systems in\nwhich two samples can be drawn for each next state (e.g., for which a simulation model is available).\nOur own analysis indicates that even these two special requirements are insuf\ufb01cient to guarantee\nconvergence to the TD solution.2\nGordon (1995) and others have questioned the need for linear function approximation. He has\nproposed replacing linear function approximation with a more restricted class of approximators,\nknown as averagers, that never extrapolate outside the range of the observed data and thus cannot\ndiverge. Rightly or wrongly, averagers have been seen as being too constraining and have not been\nused on large applications involving online learning. Linear methods, on the other hand, have been\nwidely used (e.g., Baxter, Tridgell & Weaver 1998; Sturtevant & White 2006; Schaeffer, Hlynka &\nJussila 2001).\nThe need for linear complexity has also been questioned. Second-order methods for linear approx-\nimators, such as LSTD (Bradtke & Barto 1996; Boyan 2002) and LSPI (Lagoudakis & Parr 2003;\nsee also Peters, Vijayakumar & Schaal 2005), can be effective on moderately sized problems. If the\nnumber of features in the linear approximator is n, then these methods require memory and per-time-\nstep computation that is O(n2). Newer incremental methods such as iLSTD (Geramifard, Bowling\n& Sutton 2006) have reduced the per-time-complexity to O(n), but are still O(n2) in memory. Spar-\nsi\ufb01cation methods may reduce the complexity further, they do not help in the general case, and may\napply to O(n) methods as well to further reduce their complexity. Linear function approximation is\nmost powerful when very large numbers of features are used, perhaps millions of features (e.g., as\nin Silver, Sutton & M\u00a8uller 2007). In such cases, O(n2) methods are not feasible.\n\n7 Conclusion\n\nGTD(0) is the \ufb01rst off-policy TD algorithm to converge under general conditions with linear func-\ntion approximation and linear complexity. As such, it breaks new ground in terms of important,\n\n2For a counterexample, consider that given in Dayan\u2019s (1992) Figure 2, except now consider that state A is\nactually two states, A and A\u2019, which share the same feature vector. The two states occur with 50-50 probability,\nand when one occurs the transition is always deterministically to B followed by the outcome 1, whereas when\nthe other occurs the transition is always deterministically to the outcome 0. In this case V (A) and V (B) will\nconverge under the residual-gradient algorithm to the wrong answers, 1/3 and 2/3, even though the system is\ndeterministic, and even if multiple samples are drawn from each state (they will all be the same).\n\n7\n\n\fabsolute abilities not previous available in existing algorithms. We have conducted empirical stud-\nies with the GTD(0) algorithm and have con\ufb01rmed that it converges reliably on standard off-policy\ncounterexamples such as Baird\u2019s (1995) \u201cstar\u201d problem. On on-policy problems such as the n-state\nrandom walk (Sutton 1988; Sutton & Barto 1998), GTD(0) does not seem to learn as ef\ufb01ciently as\nclassic TD(0), although we are still exploring different ways of setting the step-size parameters, and\nother variations on the algorithm. It is not clear that the GTD(0) algorithm in its current form will\nbe a fully satisfactory solution to the off-policy learning problem, but it is clear that is breaks new\nground and achieves important abilities that were previously unattainable.\n\nAcknowledgments\nThe authors gratefully acknowledge insights and assistance they have received from David Silver,\nEric Wiewiora, Mark Ring, Michael Bowling, and Alborz Geramifard. This research was supported\nby iCORE, NSERC and the Alberta Ingenuity Fund.\n\nReferences\nBaird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In Proceedings\n\nof the Twelfth International Conference on Machine Learning, pp. 30\u201337. Morgan Kaufmann.\n\nBaxter, J., Tridgell, A., Weaver, L. (1998). Experiments in parameter learning using temporal differences.\n\nInternational Computer Chess Association Journal, 21, 84\u201399.\n\nBertsekas, D. P., Tsitsiklis. J. (1996). Neuro-Dynamic Programming. Athena Scienti\ufb01c, 1996.\nBorkar, V. S. and Meyn, S. P. (2000). The ODE method for convergence of stochastic approximation and\n\nreinforcement learning. SIAM Journal on Control And Optimization , 38(2):447\u2013469.\n\nBoyan, J. (2002). Technical update: Least-squares temporal difference learning. Machine Learning, 49:233\u2013\n\n246.\n\nBradtke, S., Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine\n\nLearning, 22:33\u201357.\n\nDayan, P. (1992). The convergence of TD(\u03bb) for general \u03bb. Machine Learning, 8:341\u2013362.\nGeramifard, A., Bowling, M., Sutton, R. S. (2006).\n\nIncremental least-square temporal difference learning.\n\nProceedings of the National Conference on Arti\ufb01cial Intelligence, pp. 356\u2013361.\n\nGordon, G. J. (1995). Stable function approximation in dynamic programming. Proceedings of the Twelfth\n\nInternational Conference on Machine Learning, pp. 261\u2013268. Morgan Kaufmann, San Francisco.\n\nLagoudakis, M., Parr, R. (2003). Least squares policy iteration. Journal of Machine Learning Research,\n\n4:1107-1149.\n\nPeters, J., Vijayakumar, S. and Schaal, S. (2005). Natural Actor-Critic. Proceedings of the 16th European\n\nConference on Machine Learning, pp. 280\u2013291.\n\nPrecup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference learning with function\n\napproximation. Proceedings of the 18th International Conference on Machine Learning, pp. 417\u2013424.\n\nPrecup, D., Sutton, R. S., Paduraru, C., Koop, A., Singh, S. (2006). Off-policy Learning with Recognizers.\n\nAdvances in Neural Information Processing Systems 18.\n\nPrecup, D., Sutton, R. S., Singh, S. (2000). Eligibility traces for off-policy policy evaluation. Proceedings of\n\nthe 17th International Conference on Machine Learning, pp. 759\u2013766. Morgan Kaufmann.\n\nSchaeffer, J., Hlynka, M., Jussila, V. (2001). Temporal difference learning applied to a high-performance game-\nplaying program. Proceedings of the International Joint Conference on Arti\ufb01cial Intelligence, pp. 529\u2013534.\nSilver, D., Sutton, R. S., M\u00a8uller, M. (2007). Reinforcement learning of local shape in the game of Go.\n\nProceedings of the 20th International Joint Conference on Arti\ufb01cial Intelligence, pp. 1053\u20131058.\n\nSturtevant, N. R., White, A. M. (2006). Feature construction for reinforcement learning in hearts. In Proceed-\n\nings of the 5th International Conference on Computers and Games.\n\nSutton, R. S. (1988). Learning to predict by the method of temporal differences. Machine Learning, 3:9\u201344.\nSutton, R. S., Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press.\nSutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A framework for temporal\n\nabstraction in reinforcement learning. Arti\ufb01cial Intelligence, 112:181\u2013211.\n\nSutton, R. S., Rafols, E.J., and Koop, A. 2006. Temporal abstraction in temporal-difference networks. Advances\n\nin Neural Information Processing Systems 18.\n\nTadic, V. (2001). On the convergence of temporal-difference learning with linear function approximation. In\n\nMachine Learning 42:241\u2013267\n\nTsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning with function approxi-\n\nmation. IEEE Transactions on Automatic Control, 42:674\u2013690.\n\nWatkins, C. J. C. H. (1989). Learning from Delayed Rewards. Ph.D. thesis, Cambridge University.\n\n8\n\n\f", "award": [], "sourceid": 421, "authors": [{"given_name": "Richard", "family_name": "Sutton", "institution": null}, {"given_name": "Hamid", "family_name": "Maei", "institution": null}, {"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}]}