In this chapter, we focus on deep reinforcement learning, which is a combination reinforcement learning and deep learning. This combination has enabled machines for solving complex decision problems in robotics, healthcare, finance, automatic vehicle control, to name a few.
Learning outcomes from this chapter:
Markov decision processes
Reinforcement learning
Deep reinforcement learning
Sequential decision making is a key topic in machine learning. In this framework, an agent continuously interacts with and learn from an uncertain environment. A sequence of actions taken by the agent decides how the agent interacts with the environment and learn from it. Sequential decision making finds crucial applications in many domains, including healthcare, robotics, finance, and self-driving cars.
In certain cases decision making processes can be detached from precise details of information retrieval and interaction, and however, in many other practical cases the two are coupled. In the latter case, the agent observes the system, gains information and make decisions simultaneously. This is where techniques such as Markov Decision Processes (MDP), Partially Observable Markov Decision Processes (POMDP), and Reinforcement Learning (RL) play significant roles.
Over the last few years, RL has become increasingly popular in addressing important sequential decision making problems. This success is mostly due to the combination of RL with deep learning techniques thanks to the ability of deep learning to successfully learn different levels of abstractions from data with a lower prior knowledge. Deep RL has created an opportunity for us to train computers to mimic some human solving capabilities.
RL is an interdisciplinary field, addressing problems in
In all these different areas of science, there is a common goal to solve the problem of how to make optimal sequential decisions. RL algorithms are motivated from the human decision making based on “rewards provide a positive reinforcement for an action”. This behavior is studied in Neuro-science as reward systems and in Psychology as conditioning.
Let’s look at some examples where RL plays an important role.
Robitics: Imagine a robot trying to move from point A to point B, starting with a lack of knowledge on its surroundings. In this process, it tries different leg and body moves. From each successful or failed move, it learns and uses this experience in making the next moves.
Vehicle control: Think of you trying to fly a drone. In the beginning you have little or no information on what controls to apply to make a successful flight. As you start applying the controls, you get feedback based on how the drone moving in the air. You use this feedback to decide the next controls.
Learning to play games: This is another major area where RL has been very successful recently. You must have heard about Google Deepmind’s Alpha Go defeating world’s best Go player Lee Sedol, or Gerald Tesaro’s software agent defeating world Backgammon Champion. These agents use RL.
Atari Breakout played by Google Deepmind’s Deep Q-learning. To see a video recording of the game, click here.
Medical treatment planning: RL finds application in planning medical treatment for a patient. The goal is to learn a sequence of treatments based the patient’s reaction to the past treatments and the patient’s current state. The results of the treatments are obtained only at the end, and hence the decisions need to be taken very carefully as it can be a life and death situation.
Chatbots: Think of Microsoft’s chatbots Tay and Zo, and personal assistants like Siri, Google Now, Cortana, and Alexa. All the agents use RL in improving their conversation with a human user.
There are some key characteristics that separate RL from the other optimization paradigms and supervised learning methods in machine learning. Some of them are listed below.
No “supervision”: There is no supervisor, no labels telling the agent what is the best action to take. For example, in the robotics example, there is no supervisor guiding the robot in choosing the next moves.
Delayed feedback: Even though there is a feedback after every action, most often the feedback is delayed. That is, the effect of the actions taken by the agent might not be observed immediately. For example, when flying a drone from one point to another, moving it aggressively may seem to be nice as the drone moving towards the target quickly, however, the agent might realize later that this aggressive move made the drone to crash into something.
Sequential decisions: The notion of time is essential in RL: each action effects the next state of the system and that further influences the agent’s next action.
Effects of actions on observations: Another key characteristic of RL is that the feedback the agent obtains is in fact a function of the actions taken by the agent and the uncertainty in the environment. In contrary, for the supervised learning paradigm of machine learning, samples in the training data are assumed to be independent of each other.
In this section, we establish some preliminaries that are necessary for understanding the methods discussed later.
Let Z0,Z1,Z2,…Z0,Z1,Z2,… be a sequence of random objects. We say that the sequence is Markov if for every t∈{1,2,…}t∈{1,2,…},
P(Zt∈⋅|Z0=z0,Z1=z1,…,Zt−1=zt−1)=P(Zt∈⋅|Zt−1=zt−1),P(Zt∈⋅|Z0=z0,Z1=z1,…,Zt−1=zt−1)=P(Zt∈⋅|Zt−1=zt−1), for any sequence z0,z1,…,zt−1z0,z1,…,zt−1. In other worlds, if tt indicates time steps, given the current state of a Markov chain, its future is independent of its past.
Let FF be a a mapping on a complete metric space (M,D)(M,D) into itself. Any M∈MM∈M such that F(M)=MF(M)=M is called a fixed point of the mapping FF. Further, FF is said to be contraction if there exists a constant γ<1γ<1 such that
D(F(M),F(M′))≤γD(M,M′),D(F(M),F(M′))≤γD(M,M′), for every M,M′∈MM,M′∈M.
Fixed point theorem is also known as the method of successive approximations or the principle of contraction mappings
Theorem 9.1 (Fixed point theorem): Every contraction mapping defined on a complete metric space has a unique fixed point.
Finding the fixed point is easy: start at an arbitrary point M0∈MM0∈M and recursively apply
Mk=F(Mk−1).Mk=F(Mk−1). for each k=1,2,3,…k=1,2,3,…. Then the sequence M0,M1,M2,…M0,M1,M2,… converges to the unique fixed point.
For example consider the following real-valued function f(x)f(x) on M=[1,3.5]M=[1,3.5] with DD being the Euclidean distance. f(x)=12(x+2x).f(x)=12(x+2x). It is easy to show that f(x)f(x) is a contraction mapping with a fixed point at √2√2. The following figure illustrates the convergence of the sequence x0=3,x1=f(x0),x2=f(x1),…x0=3,x1=f(x0),x2=f(x1),… to √2√2.
To understand RL, we first need to understand the framework of Markov Decision Processes. They are the underlying stochastic models in RL for making decisions. Essentially, RL is the problem we address when this underlying model is either too difficult to solve or unknown in order to find the optimal strategy in advance.
As a first example, consider a scenario focusing on the engagement level of a student. Assume that there are LL levels of engagement, 1,2,…,L1,2,…,L, where at level 11 the student is not engaged at all and at the other extreme, at level LL, she is maximally engaged. Our goal is to maintain engagement as high as possible over time. We do this by choosing one of two actions at any time instant. (0): “do nothing” and let the student operate independently. (1): “stimulate” the student. In general, stimulating the student has a higher tendency to increase her engagement level, however this isn’t without cost as it requires resources.
We may denote the engagement level at (discrete) time tt by X(t)X(t), and for simplicity we assume here that at any time, X(t)X(t) either increases or decreases by 11. An exception exists at the extremes of X(t)=1X(t)=1 and X(t)=L.X(t)=L. In these extreme cases the student engagement either stays the same or increases/decreases respectively by 11. The actual transition of engagement level is random, however we assume that if our action is “stimulate” then there is more likely to be an increase of engagement than if we “do nothing”.
The control problem is the problem of deciding when to “do nothing” and when to “stimulate”. For this we formulate a reward function and assume that at any time tt our reward is,
Here κκ is some positive constant and A(t)=0A(t)=0 if the action at time tt is to “do nothing”, while A(t)=1A(t)=1 if the action is to “stimulate”. Hence the constant κκ captures our relative cost of stimulation effort in comparison to the benefit of a unit of student engagement. We see that R(t)R(t) depends both on our action and the state.
The reward is accumulated over time into an optimization objective via,
where γ∈(0,1)γ∈(0,1) and is called the discount factor. Through the presence of γγ, future rewards are discounted with a factor of γtγt, indicating that in general the present is more important than the future. There can also be other types of objectives, for example finite horizon or infinite horizon average reward, however we only focus on this infinite horizon expected discounted reward case.
A control policy is embodied by the sequence of actions {A(t)}∞t=0{A(t)}∞t=0. If these actions are chosen independently of observations then it is an open loop control policy. However, for our purposes, things are more interesting in the closed loop or feedback control case in which at each time tt we observe the state X(t)X(t), or some noisy version of it. This state feedback helps us decide on A(t)A(t).
We encode the effect of an action on the state via a family of Markovian transition probability matrices. For each action aa, we set a transition probability matrix P(a)P(a). In our case, as there are two possible actions, we have two matrices such as for example,
Hence for example in state 11 (first row), if we choose action 00 then the transitions follow [1/2 1/2 0 …0][1/2 1/2 0 …0], whereas if we choose action 11 then the transitions follow [1/4 3/4 0 …0][1/4 3/4 0 …0]. That is for a given state ii, choosing action aa implies that the next state is distributed according to [P(a)i1 P(a)i2 ⋯][P(a)i1 P(a)i2 ⋯] (where most of the entries of this probability vector are 00 in this example).
In the case of MDP and POMDP these matrices are assumed known, however in the case of RL these matrices are unknown. The difference between MDP and POMDP is that in MDP we know exactly in which state we are in, whereas in POMDPs our observations are noisy or partial and only hint at the current state. Hence in general we can treat MDP as the basic case and POMDP and RL can be viewed as two variants of MDP. In the case of POMDP the state isn’t fully observed, while in the case of RL the transition probabilities are not known.
We don’t focus on POMDPs further in this course, but rather we continue with MDP and RL. For this, we first need to understand more technical aspects of MDP.
Formally, an MDP is a discrete stochastic control process where the agent starts at time t=0t=0 in the environment with an observation ω0∈Ωω0∈Ω about the initial state s0∈Ss0∈S of the system, where ΩΩ is the space of observations and SS is the state space. At each time step t∈{1,2,3,…}t∈{1,2,3,…}, (i) the agent obtains a reward with mean rtrt, (ii) the state of the system transits from stst to a new state st+1st+1, and (iii) the agent obtains an observation ωt+1ωt+1.
The key property of MDPs is that they satisfy the Markov property, which essentially states that the future is independent of the past given the current observation, and hence the agent has no interest in looking back at the full history.
Definition 9.1 : A descrete time stochastic control process is said to be Markov if for all t=1,2,…t=1,2,…,
P(ωt+1|ω0,a0,…,ωt,at)=P(ωt+1|ωt,at)P(ωt+1|ω0,a0,…,ωt,at)=P(ωt+1|ωt,at) and P(rt|ω0,a0,…,ωt,at)=P(rt|ωt,at)P(rt|ω0,a0,…,ωt,at)=P(rt|ωt,at)
Definition 9.2 : A descrete time stochastic control process is said to be fully observable if for all t=1,2,…t=1,2,…,
ωt=st.ωt=st.
Definition 9.3 : A Markov Decision Process is a fully observable descrete time stochastic control process which satisfies the Markov property.
An MDP is a 5-tuple (S,A,T,R,γ)(S,A,T,R,γ), here
SS is the state space of the system;
AA is the action space;
T:S×A×S→[0,1]T:S×A×S→[0,1] is the transition function defining the conditional probabilities between the states;
R:S×A×S→RR:S×A×S→R is the reward function such that for some constant RmaxRmax, R(s,a,s′)≤RmaxR(s,a,s′)≤Rmax for all s,s′∈Ss,s′∈S and a∈A;a∈A;
γγ is the discount factor;
The agent selects an action at each time step according to a policy π∈Ππ∈Π. Policies can be divided into stationary and non-stationary. A non-stationary policy depends on time, and they are useful for finite horizon case problems (i.e., there exists a KK such that the agent takes actions only at t=0,1,…,Kt=0,1,…,K). In this chapter, we consider infinite horizons and stationary policies.
Stationary policies can be further divided into two classes:
Deterministic: There is a unique action aa for each state ss specified by a policy map π(s):S→Aπ(s):S→A.
Stochastic: The agent takes an action aa at state ss according to a probability distribution given by the policy function π(s,a):S×A→[0,1]π(s,a):S×A→[0,1]. Note that ∫a∈Aπ(s,a)da=1∫a∈Aπ(s,a)da=1 for all s∈Ss∈S.
With these notions in mind, we can see that
the transition probability T(st,at,st+1)=P(st+1|st,at),T(st,at,st+1)=P(st+1|st,at), and
the average reward rt=E[R(st,a,st+1)],rt=E[R(st,a,st+1)], where at∼π(st,⋅)at∼π(st,⋅) and st+1∼T(st,at,⋅).st+1∼T(st,at,⋅).
The following figure illustrates the transitions in an MDP.
As in Example 9.3.1, the goal of the agent is to maximize the expected total discounted reward defined by the value function Vπ(s):S→RVπ(s):S→R:
Vπ(s):=E[∞∑t=0γtR(st,at,st+1)|s0=s,π],Vπ(s):=E[∞∑t=0γtR(st,at,st+1)|s0=s,π], where conditioning on ππ implies that at each step, action atat is selected according to the policy ππ.
Observe that under the stationary assumption, for all tt,
Vπ(s)=E[∞∑k=0γkR(st+k,at+k,st+k+1)|st=s,π].Vπ(s)=E[∞∑k=0γkR(st+k,at+k,st+k+1)|st=s,π]. Then, the optimal expected return (i.e., optimal value function) is given by
V∗(s)=maxπ∈ΠVπ(s).V∗(s)=maxπ∈ΠVπ(s).
The following result, Theorem 9.2, by (Puterman 1994) states that there exists a deterministic optimal policy that maximizes the value function. That is, the agent has no need to sample action from a probability distribution to achieve maximum expected discounted reward.
Theorem 9.2 (Puterman [1994]): For any infinite horizon discounted MDP, there always exists a deterministic stationary policy ππ that is optimal.
From Theorem 9.2, we can assume that the set of policies ΠΠ contains only deterministic stationary polices.
Further, observe that the action taken by the agent at time tt determines not only the reward at time tt, but also the states after time tt and therefore the rewards after tt. Hence, the agent has to choose the policy strategically in order to maximize the value function. Now onwards we focus on developing such optimization algorithms.
Exercise: Determine SS, AA, TT, and RR for Example 9.3.1.
The following theorem establishes the recursive behavior of the value function: Bellman equations for value functions.
Theorem 9.3 : For any policy π∈Ππ∈Π,
Vπ(s)=∑s′∈ST(s,π(s),s′)[R(s,π(s),s′)+γVπ(s′)].Vπ(s)=∑s′∈ST(s,π(s),s′)[R(s,π(s),s′)+γVπ(s′)].
Proof. From the definition of the value function,
Vπ(s)=E[R(s,a0,s1)+∞∑t=0γtR(st,at,st+1)|s0=s,π]=Es′∼T(s,π(s),⋅)[R(s,a0,s′)]+Es′∼T(s,π(s),⋅)[∞∑t=1γtR(st,at,st+1)|s0=s,a0=π(s)]=∑s′∈ST(s,π(s),s′)R(s,π(s),s′)+γ∑s′∈ST(s,π(s),s′)[∞∑t=1γt−1R(st,at,st+1)|s1=s′,π]=∑s′∈ST(s,π(s),s′)R(s,π(s),s′)+γ∑s′∈ST(s,π(s),s′)Vπ(s′).Vπ(s)=E[R(s,a0,s1)+∞∑t=0γtR(st,at,st+1)|s0=s,π]=Es′∼T(s,π(s),⋅)[R(s,a0,s′)]+Es′∼T(s,π(s),⋅)[∞∑t=1γtR(st,at,st+1)|s0=s,a0=π(s)]=∑s′∈ST(s,π(s),s′)R(s,π(s),s′)+γ∑s′∈ST(s,π(s),s′)[∞∑t=1γt−1R(st,at,st+1)|s1=s′,π]=∑s′∈ST(s,π(s),s′)R(s,π(s),s′)+γ∑s′∈ST(s,π(s),s′)Vπ(s′).
In particular, for an optimal policy π∗π∗, we have:
Theorem 9.4 : For all s∈Ss∈S,
where r(s,a)=Es′∼T(s,a,⋅)[R(s,a,s′)]r(s,a)=Es′∼T(s,a,⋅)[R(s,a,s′)] for every pair (s,a)(s,a).
Proof. Observe that, for each s∈Ss∈S,
V∗(s)=maxπ∈ΠVπ(s)=maxπ∈ΠEa∼π(s),s′∼T(s,a,⋅)[R(s,a,s′)+γVπ(s′)]≤maxa∈A[r(s,a)+γ∑s′∈ST(s,a,s′)maxπ∈ΠVπ(s′)]=maxa∈A[r(s,a)+γ∑s′∈ST(s,a,s′)V∗(s′)].V∗(s)=maxπ∈ΠVπ(s)=maxπ∈ΠEa∼π(s),s′∼T(s,a,⋅)[R(s,a,s′)+γVπ(s′)]≤maxa∈A[r(s,a)+γ∑s′∈ST(s,a,s′)maxπ∈ΠVπ(s′)]=maxa∈A[r(s,a)+γ∑s′∈ST(s,a,s′)V∗(s′)].(2)(3)(4)(5)
If the inequality is strict, we can improve the value of state ss by using a policy (possibly non-stationary) that uses an action from argmaxa∈Ar(s,a)argmaxa∈Ar(s,a) in the first step.
In fact, (Puterman 1994) shows as a consequence of the fixed point theorem that V∗V∗ is a unique solution of (9.5).
In addition to value functions, there are few other functions interests us. Among them most important one is Q-value function Qπ(s,a):S×A→RQπ(s,a):S×A→R defined by
Qπ(s,a)=E[∞∑t=0γtR(st,at,st+1)|s0=s,a0=a].Qπ(s,a)=E[∞∑t=0γtR(st,at,st+1)|s0=s,a0=a]. The Q-value is essentially the expected total discounted reward when started in state ss with initial action aa, and following policy ππ at each step t=1,2,….t=1,2,….
Just like the value function, we can show that the Q-value function also satisfies Bellman equation as follows:
Qπ(s,a)=∑s′∈ST(s,a,s′)[R(s,a,s′)+γQπ(s′,π(s′))].Qπ(s,a)=∑s′∈ST(s,a,s′)[R(s,a,s′)+γQπ(s′,π(s′))]. Also, similar to value functions, for each (s,a)(s,a) pair, the optimal Q-value functions is defined by Q∗(s,a)=maxπ∈ΠQπ(s,a).Q∗(s,a)=maxπ∈ΠQπ(s,a). The main advantage of Q-value function is that the optimal policy can be computed directly from the Q-value function using,
π∗(s)=argmaxa∈AQ∗(s,a),π∗(s)=argmaxa∈AQ∗(s,a), where use some deterministic method to select a solution if argmaxa∈AQ∗(s,a)argmaxa∈AQ∗(s,a) is a set.
From the definitions of Vπ(s)Vπ(s) and Qπ(s,a)Qπ(s,a), it is immediately obvious that an easy way to estimate them is by using Monte Carlo algorithms, i.e., for each state ss or each state-action pair (s,a)(s,a), run several simulations starting from ss or (s,a)(s,a) and take sample mean of the values or the Q-values obtained. This is not possible in practice when limited data is available. Even when large data is available, Monte Carlo algorithms may not be computationally efficient compared to the algorithms discussed below.
Assume that both the state space SS and the action space AA are finite and discrete. Further assume that the discount factor γ<1γ<1.
It is an indirect method that finds an optimal function V∗V∗, but not optimal policy. In this implementation, we assume that it is easy to compute r(s,a)=Es′∼T(s,a,⋅)[R(s,a,s′)]r(s,a)=Es′∼T(s,a,⋅)[R(s,a,s′)] for any pair (s,a)(s,a).
Algorithm : Pseudocode for value iteration |
repeat for t=1,2,…t=1,2,… until ‖vk−vk−1‖∞≤ϵ∥vk−vk−1∥∞≤ϵ
The following theorem establishes that value iteration algorithm convergences to the optimal value. This is again a consequence of the fixed point theorem for contraction mappings.
Theorem 9.5 (Puterman [1994]):
The above algorithm converges linearly at rate γγ:
‖vk−V∗‖∞≤γk1−γ‖v1−v0‖∞,∥vk−V∗∥∞≤γk1−γ∥v1−v0∥∞,
where V∗={V∗(s):s∈S}V∗={V∗(s):s∈S}.Observe that V∗(s)=maxa∈AQ∗(s,a),∀s∈S.V∗(s)=maxa∈AQ∗(s,a),∀s∈S. Therefore, the Bellman equations for the Q-value function can be rewritten as follows: Q∗(s,a)=r(s,a)+γ∑s′∈ST(s,a,s′)(maxa′∈AQ∗(s′,a′)).Q∗(s,a)=r(s,a)+γ∑s′∈ST(s,a,s′)(maxa′∈AQ∗(s′,a′)). The Q-value iteration algorithm below takes advantage of this relation:
Algorithm : Pseudocode for Q-value iteration |
repeat for t=1,2,…t=1,2,… until ‖Qk−Qk−1‖∞≤ϵ∥Qk−Qk−1∥∞≤ϵ
Similar to the value iteration, since Q-values also satisfy contraction mapping, using the fixed value theorem, we can show that QkQk converges to the unique optimal Q-value function Q∗Q∗ linearly.
This is a direct method that finds the optimal policy.
Algorithm : Pseudocode for policy iteration |
repeat for t=1,2,… until πk=πk−1
For every s∈S, πk(s)=argmaxa∈A(r(s,a)+γEs′[vk−1(s′)|s,a]) for all s∈S
Set vk as the value of πk (compute these values using steps in the value iteration algorithm)
For most practical problems, the state space is higher-dimensional (probably continuous), and the transition probabilities and reward function of the underlying MDP model are either unknown or extremely difficult to compute. For instance, consider the engagement level example, Example 9.3.1. The transition matrices (9.3) and (9.4) are a postulated model of reality. In some situations the parameters of such matrices may be estimated from previous experience, however often this isn’t feasible due to changing conditions or lack of data. So the above iteration algorithms are hard to implement. RL algorithms are useful for solving such problems. We now focus on RL algorithms.
In this section, we explore one class of RL algorithms called Q-learning. The main idea of this method is to learn the Q-function without explicitly decomposing Q(s,a) into T, V and R. Observe from the Bellman equation of Q-value function that if we were to know Q(s,a) for every state s and action a, then we can also compute the optimal policy by selecting the a that maximizes Q(s,a) for every s.
The key of Q-learning is to continuously learn Q(⋅,⋅) while using the learned estimates to select actions as we go. First recall Q-value iteration:
Qt+1(s,a)=r(s,a)+γ∑s′∈ST(s,a,s′)(maxa′∈AQt(s′,a′)). In Q-learning, these updates are approximated using sample observations. As we operate our system with Q-learning, after an action at is chosen at time step t, the state of the system changes from state st to st+1, and a reward R(st,at,st+1) is obtained. At that point we update the (st,at) entry of the Q-function estimate as follows:
Here αt is a decaying (or constant) sequence of probabilities. A pseudocode of Q-learning is given below.
Algorithm : Pseudocode for Q-learning |
Input: Given starting state distribution D0;
Initialization: Take ˆQ to be an empty array
repeat for e=1,2,… until change in ˆQ is small
s0∼D0.
Select step sizes α0,α1,…
repeat for t=1,2,… until epoch is terminated
Select an action at.
Observe the next state st+1 and the reward R(st,at,st+1).
Take δt:=(R(st,at,st+1)+γmaxa′∈AˆQ(st+1,a′))−ˆQ(st,at)
ˆQ(st,at)←ˆQ(st,at)+αtδt
We attempt to balance exploration and exploitation. With a high probability, we decide on action a′ that maximizes Qt(st+1,⋅) - this is exploitation. However, we leave some possibility to explore other actions, and occasionally decide on an arbitrary (random) action - this is exploration. The key of the Q-learning update equation (9.6) is a weighted average of the previous estimate Qt(st,at) and a single sample of the right hand side of the Bellman equation of Q-value function. Miraculously as the system progresses under such a control, this scheme is able to estimate the Q-function and hence control the system well.
For the above Q-learning algorithm, the following convergence result holds.
Theorem 9.6 (Watkins and Dayan 1992): Suppose that the rewards are bounded and the learning rates satisfy ∞∑i=1αni(s,a)=∞,and∞∑i=1α2ni(s,a)<∞, for all (s,a)∈S×A. Let ˆQk(s,a) be the estimate of Q(s,a) in round k.
Then, ˆQk(s,a)→Q(s,a) as k→∞ almost surely for all (s,a)∈S×A, where ni(s,a) is the ith instant the action a is selected in state s.
The above theorem essentially states that if all the actions are selected in all the states infinitely often, and the learning rate decreases to zero but not too quickly, then Q-learning converges. This property shows that (at least in principle), systems controlled via Q-learning may still be controlled in an asymptotically optimal manner, even without explicit knowledge of the underlying transitions T(s,a,s′).
Issue: It is difficult to implementation of Q-learning algorithm for many practical applications where the state space is large (possibly continuous). There can be many states to visit and keep track of. The following images illustrate this point.
To overcome the scalability issue highlighted above, instead of updating the Q-value of each state-action pair, we would like use what we have learned about already visited states to approximate Q-values of similar states that not yet visited. This is possible even if already visited states are small in number. Deep RL addresses this issue via function approximation using a deep neural network.
The main idea of such methods is to use function approximation to learn a parametric approximation Q(s,a;θ) of Q(s,a). For instance, the approximate function Q(s,a;θ) can be linear in n parameters θ and n features x(s,a) as follows: Q(s,a;θ)=θ0+θ1x1(s,a)+⋯+θnxn(s,a). Or, more generally, we can use a DNN to have an approximation of the form Q(s,a;θ)=fθ(x(s,a)), where fθ represents the DNN with parameters θ= (weights, biases). Using such function approximation, for a given θ, the Q-value function can be computed approximately for unseen state-action pairs (s,a). The key point to note that instead of learning |S|×|A| matrix of Q-values, function approximation based Q-learning learns θ. That is, we are trying to find a θ such that the Bellman equations Q(s,a;θ)=r(s,a)+γ∑s′∈ST(s,a,s′)[maxa′∈AQ(s′,a′;θ)]=Es′∼T(s,a,s′)[R(s,a,s′)+γmaxa′∈AQ(s′,a′;θ)] can be approximated for every state-action pair (s,a)∈S×A. We do this approximation by minimizing squares loss function:
ℓθ(s,a)=Es′∼T(s,a,⋅)[(Q(s,a;θ)−R(s,a,s′)−γmaxa′∈AQ(s′,a′;θ))2]. Suppose if we take target(s′):=R(s,a,s′)+γmaxa′∈AQ(s′,a′;θ), then we are trying to minimize the expectation of ℓθ(s,a,s′)=(Q(s,a;θ)−target(s′))2. The following Q-learning algorithm uses gradient descent to optimize this loss for the observation s′. When we compare this with supervised learning, target(s′) acts like actual label and Q(s,a;θ) acts like prediction. However, unlike in supervised learning, target(s′) and Q(s,a;θ) are not independent.
Algorithm : Function approximation based Q-learning |
Fix an initial state s0
Initiate the parameters ˆθ
repeat for e=1,2,… until the termination condition is satisfied
Start in s0.
Select step sizes α0,α1,…
repeat for t=1,2,… until st is not a terminal state
Select an action at. Probably using the greedy strategy at∈argmaxaQ(st,a;ˆθ)
Observe the next state st+1 and the reward R(st,at,st+1).
Take δt:=(R(st,at,st+1)+γmaxa′∈AQ(st+1,a′;ˆθ))−Q(st,at;ˆθ)
ˆθ←ˆθ+αtδt∇ˆθQ(st,at;ˆθ)
In the previous chapters, we have seen several deep neural networks that provide function approximations for models in supervised learning, providing an approximation of the mapping from inputs to the corresponding outputs. Deep Q-learning networks use deep neural networks for the function approximation in the last algorithm, where the parameters θ are weights and biases of the deep neural network.
In DQN, the input feature x at time t is formed by the state-action pair (st,at). For a given θt, the output fθt(x) is the prediction for Q(st,at;θt).
Since we do not directly have a target label, we ‘bootstrap’ from the current Q-network (that is, current parameters θt) to generate a target label as R(st,at,st+1)+γmaxa∈AQ(st,a;θt).
Then to update the parameters θ, we use gradient descent on squared loss function.
Exploration: Recall that the goal is to simultaneously minimize the loss functions {ℓθ(s,a):(s,a)∈S×A}. The number of samples available for each (s,a) depends on how much we explore s and a. So, there are different number of samples available for different state-action pairs. As a consequence, depending on the transition dynamics of the system, accuracies are different for different (s,a) pairs. Of course, there is no need to have high accuracy for every state-action pair. For example, rate (s,a) with low reward can low accuracy without effecting the overall reward. Also, if two state-action pairs have similar features, then it is unnecessary to explore both to learn a good parameter. In summary, it is important to have an adaptive exploration scheme to explore state-action pairs so that the converges of the parameters happens quickly.
Stabilizing: Even though the idea of comparing the prediction Q(s,a;θ) to target(s′) is similar to supervised learning, there are several challenges in deep RL that are not present in supervised learning. Minimizing squared loss ℓθ(s,a) is here is different from minimizing loss in supervised learning where ‘true’ labels are present, because target(s′) is also an estimate that can change too quickly with θ. As a consequence, the contraction mapping property of Bellman equation may not be sufficient to guarantee convergence of Q values. There ere experiments suggesting that these errors can propagate quickly, resulting in slow convergence or even making the algorithm unstable. For more details on this, see Section 4 of (François-Lavet et al. 2018) and the references there.
The idea of function approximation using neural networks is first proposed in (Mnih et al. 2013) for ATARI games using pixel values as features.
Essentially, they created a single convolutional neural network (CNN) agent that can successfully learn to play as many ATARI 2600 games as possible. The network was not provided with game specific information or hand-designed visual features. The agent learned to play from nothing but the video input of 210×160RGB at 60Hz. Some specifications of the CNN network used are
Each frame of the video is pre-processed to get an input image of 84×84×4. This is done because working with the original frame size 210×160 pixels with a 128 color palette is computationally demanding. This process includes first converting the raw RGB frame to greyscale and then down-sampling it to 110×84 image. Final cropping to the size 84×84 focuses only on playing area of the image.
The first hidden layer has 16 channels with 8×8 kernel, stride 4 and ReLu activation function.
The second hidden layer has 32 channels with 4×4 kernel, stride 2 and ReLu activation function.
The final hidden layer is fully connected linear layer with 256 neurons.
The output layer is again a fully connected linear layer with a single out for each action a∈A (i.e., the output size is |A|). For the games considered in this paper, the number of actions |A| varied between 4 and 18.
Watch the following video explaining the paper (Mnih et al. 2013) in some more details.
Instead of greedy strategy in selecting an action in each iteration, they use ϵ-greedy strategy: selects an action using greedy strategy with probability 1−ϵ and selects a random action with probability ϵ.
Since the scale of the scores varies from game to game, in addition to the above heuristics, to limit the the scale of error derivatives and make the network easier to use across multiple games, the rewards are clipped as follows: all positive rewards are 1, all negative rewards are −1, and 0 rewards are unchanged.
The max operation in computing target(s′)=R(s,a,s′)+γmaxa′∈AQ(s′,a;θ) uses the same parameter values to select and evaluate an action. This can result in overestimation of Q-learning values. To remedy this issue, in double DQN (DDQN), the target value is computed as target(s′)=R(s,a,s′)+γQ(s′,argmaxa′∈AQ(s,a′;θk);θ−k), where θk is the current update and θ−k is the current batch update of the parameters.
So far, in all the methods, the goal is to approximate the expected total discounted reward (either value function or Q-value function). Another richer approach is to find an optimal distribution of the cumulative reward. Such approach provides more information about the randomness of the rewards and transitions of the agent. Suppose Zπ:S×A→R is a random variable for each state-action pair with the distribution of cumulative reward obtained under the policy π, then it has expectation: Qπ(s,a)=E[Zπ(s,a)] for every state-action pair (s,a). The recursive relationship for the random reward is given by Zπ(s,a)=R(s,a,s′)+γZπ(S′,A′), where S′ and A′ denote the next state and the next action, which are random in nature. Note that A′∼π(⋅|S′).
Recently such distributional DQN are used in practice and they are shown to exhibit additional advantages; refer to (Bellemore et al 2017), (Dabney et al 2017), and (Rowland et al 2018) for more details.
Page built: 2021-03-04 using R version 4.0.3 (2020-10-10)