**THIS UNIT IS STILL UNDER CONSTRUCTION**

Sequential decision making is a key topic in machine learning. It is a task of deciding a sequence of actions in an uncertain enviroment, based on experience, to achieve some goals. Squential decision making finds crucial applications in many domains, including healthcare, robotics, finance, and self-driving cars.

In certain cases decision making processes can be detached from precise details of information retrieval and inference, and however, in other cases the two are coupled. In the latter case, we observe the system, gain information and make decisions simultaneously. This is where techniques such as **Markov Decision Processes** (MDP), **Partially Observable Markov Decision Processes** (POMDP) and **Reinforcement Learning** (RL) play a role.

Over the last few years, RL has become increasingly popular in addressing important sequential decision making problems. This success is mostly due to the combination of RL with deep learning techniques. This success is due to the fact that deep learning has the ability to successfully learn different levels of abstractions from data with a lower prior knowledge, as we have seen in the previous chapters. Deep RL has created an opportunity for us to train computers to mimic some human solving capabilities.

As a first example, consider a scenario focusing on the engagement level of a student. Assume that there are \(L\) levels of engagement, \(1,2,\ldots,L\), where at level \(1\) the student is not engaged at all and at the other extreme, at level \(L\) she is maximally engaged. Our goal is to maintain engagement as high as possible over time. We do this by choosing one of two actions at any time instant. (0): “do nothing” and let the student operate independently. (1): “stimulate” the student. In general, stimulating the student has a higher tendency to increase her engagement level, however this isn’t without cost as it requires resources.

We may denote the engagement level at (discrete) time \(t\) by \(X(t)\), and for simplicity we assume here that at any time \(X(t)\) either increases or decreases by \(1\). An exception exists at the extremes of \(X(t) = 1\) and \(X(t) = L.\) In these extreme cases the student engagement either stays the same or increases/decreases respectively by \(1\). The actual transition of engagement level is random, however we assume that if our action is “stimulate” then there is more likely to be an increase of engagement than if we “do nothing”.

The control problem is the problem of deciding when to “do nothing” and when to “stimulate”. For this we formulate a **reward function** and assume that at any time \(t\) our reward is,
\[\begin{equation}
\tag{9.1}
R(t) = X(t) - \kappa \, A(t).
\end{equation}\]
Here \(\kappa\) is some positive constant and \(A(t) = 0\) if the action at time \(t\) is to “do nothing”, while \(A(t) = 1\) if the action is to “stimulate”. Hence the constant \(\kappa\) captures our relative cost of stimulation effort in comparison to the benefit of a unit of student engagement. We see that \(R(t)\) depends both on our action and the state.

The reward is accumulated over time into an **optimization objective** via,
\[\begin{equation}
\tag{9.2}
\mathbb{E} \Big[ \sum_{t=0}^\infty \beta^t R(t) \Big],
\end{equation}\]
where \(\beta \in (0,1)\) and is called the **discount factor**. Through the presence of \(\beta\), future rewards are discounted with a factor of \(\beta^t\), indicating that in general the present is more important than the future. There can also be other types of objectives, for example **finite horizon** or **infinite horizon average reward**, however we focus on this **infinite horizon expected discounted reward** case here.

A **control policy** is embodied by the sequence of actions \(\{A(t)\}_{t=0}^\infty\). If these actions are chosen independently of observations then it is an **open loop** control policy. However, for our purposes, things are more interesting in the **closed loop** or **feedback control** case in which at each time \(t\) we observe the state \(X(t)\), or some noisy version of it. This state feedback helps us decide on \(A(t)\).

We encode the effect of an action on the state via a family of Markovian transition probability matrices. For each action \(a\), we set a transition probability matrix \(P^{(a)}\). In our case, as there are two possible actions, we have two matrices such as for example, \[\begin{align} {\scriptsize P^{(0)} = \begin{bmatrix} 1/2 & 1/2 & & & &&\\ 1/2 & 0 & 1/2 & & &&\\ & 1/2 & 0 & 1/2 & &&\\ & & \ddots & \ddots & \ddots &&\\ & & & \ddots & \ddots & \ddots& \\ & & & & 1/2 & 0& 1/2\\ & & & & &1/2& 1/2 \end{bmatrix},} \tag{9.3}\\ \ \\ {\scriptsize P^{(1)} = \begin{bmatrix} 1/4 & 3/4 & & & &&\\ 1/4 & 0 & 3/4 & & &&\\ & 1/4 & 0 & 3/4 & &&\\ & & \ddots & \ddots & \ddots &&\\ & & & \ddots & \ddots & \ddots& \\ & & & & 1/4 & 0& 3/4\\ & & & & &1/4& 3/4 \end{bmatrix}.\tag{9.4} } \end{align}\] Hence for example in state \(1\) (first row), if we choose action \(0\) then the transitions follow \([1/2~~1/2~~0~\ldots 0]\), whereas if we choose action \(1\) then the transitions follow \([1/4~~3/4~~0~\ldots 0 ]\). That is for a given state \(i\), choosing action \(a\) implies that the next state is distributed according to \([P_{i1}^{(a)}~~ P_{i2}^{(a)}~~ \cdots ]\) (where most of the entries of this probability vector are \(0\) in this example).

In the case of MDP and POMDP these matrices are assumed known, however in the case of RL these matrices are unknown. The difference between MDP and POMDP is that in MDP we know exactly in which state we are in, whereas in POMDPs our observations are noisy or partial and only hint at the current state. Hence in general we can treat MDP as the basic case and POMDP and RL can be viewed as two variants of MDP. In the case of POMDP the state isn’t fully observed, while in the case of RL the transition probabilities are not known.

We don’t focus on POMDPs further in this book (one can follow QQQQ for a simple tutorial), but rather we continue with an MDP example, and then explore a basic RL example. For this, we first need to understand more technical aspects of MDP.

In a sequential decision making problem can be formalized as a discrete time stochastic control process as follows. An agent interacts with its environment by starting at a given initial state \(s_0 \in \mathcal{S}\) within its environment, with an initial observation \(w_0 \in \Omega\). At each step \(t\), the agent has to take an action \(a_t \in \mathcal A\). As a consequence, the agent obtains a reward \(r_t \in \mathcal R\), the state of the agent transits to \(s_{t+1} \in \mathcal S\), and the agent obtains an observation \(w_{t+1} \in \Omega\).

The theory of MDPs (see for example (Puterman 2014) or (Bertsekas 2005)) shows that under general conditions, for such time-homogenous infinite horizon discounted cost problems, it is enough to consider **stationary deterministic Markov policies**. In this case, a policy \(\pi\) is a function mapping every state to an action. If we denote the set of all such policies by \(\Pi\), then an optimal policy is an element \(\pi \in \Pi\) that maximizes the objective, (9.2) for any initial state. As such, the **value function** is a function defined as,
\[
V(i) = \max_{\pi \in \Pi} \mathbb{E} \Big[ \sum_{t=0}^\infty \beta^t R(t) ~|~ X(0) = i \Big],
\]
for any initial state \(i\). It defines the best possible total discounted reward (value) for any particular initial situation (state \(i\)).

We often don’t know the value function for a given problem, nevertheless it is a tool that helps find optimal policies. This is because the value function appears in the **Bellman equation**:
\[\begin{equation}
\tag{9.5}
V(i) = \max_{a \in{ \cal A}} \{Q(i,a)\},
\quad
\mbox{with}
\quad
Q(i,a) = r(i,a) + \beta \, \sum_{j} P_{i,j}^{(a)} V(j),
\end{equation}\]
where \(r(i,a)\) is the expected reward (with cost deducted) for state \(i\) and action \(a\), and the set \({\cal A}\) is the set of possible actions. In our example, \({\cal A} = \{0,1\}\) and \(r(i,a)\) is based on (9.1) yielding,
\[
r(i,0) = i,
\qquad
\mbox{and}
\qquad
r(i,1) = i- \kappa.
\]

Notice that the value function \(V(\cdot)\) appears in both sides of the Bellman equation. To understand the basic idea, consider first the **Q-function** in (9.5). It measures the “quality” of being in state \(i\) and applying action \(a\). If such a state-action pair is exhibited then the immediate expected reward \(r(i,a)\) is obtained followed by a transition to some random state \(j\). This happens with probability \(P_{i,j}^{(a)}\), at which point the problem continues and has value \(V(j)\). However, the transition is at the next time step and hence multiplication by the discount factor \(\beta\) presents the value in terms of the current time step.

The Bellman equation uses the **dynamic programming principle** to determine the optimal cost in terms of maximization of the Q-function by maximizing over all actions \(a \in {\cal A}\). It yields an equation where the “unknown” is the value function, \(V(\cdot)\).

Some MDP theory deals with the validity and properties of the Bellman equation. Then, much of the study of MDP deals with methods of solving the Bellman equation (9.5), or analyzing properties of the solution. Observe that if we knew the value function \(V(\cdot)\) or the Q-function \(Q(\cdot,\cdot)\), then we would also know an optimal policy, as for every state and action, \(i\) and \(a\), we would seek to set \(\pi(i) = a^*\) where \(a^*\) is the action that maximizes the right hand side of the Bellman equation.

In basic MDP, when confronted with a Bellman equation, there are typically several types of methods that can be used to solve it. Solving it implies finding the value function and with it an optimal policy. The main known methods include **value iteration**, **policy iteration** and **linear programming**. Here we only focus on the most basic of these methods: value iteration.

Observing that the Bellman equation (9.5) contains \(V(\cdot)\) both in the left hand side and the right hand side, value iteration iterates over successive value functions, \(V_0(\cdot), V_1(\cdot), V_2(\cdot),\ldots\), until convergence. That is we begin with some arbitrary value function, \(V_0(\cdot)\) and repeatedly apply: \[\begin{equation} \tag{9.6} V_{t+1}(i) = \max_{a \in{ \cal A}} \big\{r(i,a) + \beta \, \sum_{j} P_{i,j}^{(a)} V_t(j)\big\}, \qquad \mbox{for all} ~~i. \end{equation}\] See (Puterman 2014) for more details.

Programmatically, implementing the value iteration in (9.6) is straightforward for small state space examples. We iterate and stop when the difference between iterates under some sensible **norm** is smaller than a prescribed level, say \(\varepsilon\).

QQQQQQ

This is implemented in Listing **??** where we actually consider a collection of problems characterized by the cost parameter \(\kappa\).
For each problem, once the value function is found, we use it to determine the optimal policy. The policies are then plotted in Figure **??** for \(L=10\) and \(\beta = 0.75\).

In many practical situations there isn’t a clear model for the transition probability matrices \(P^{(a)}\), \(a \in {\cal A}\). For example in our engagement level example, the matrices (9.3) and (9.4) are a postulated model of reality. In some situations the parameters of such matrices may be estimated from previous experience, however often this isn’t feasible due to changing conditions or lack of data.

Such situations are handled by reinforcement learning. The class of RL methods is a broad class of models dealing with control of systems for which we lack parameter knowledge. In classic control theory this situation falls under the umbrella of **adaptive control**. However in contemporary robotics, self-driving cars and artificial neural networks, RL has become the key term.

Here we explore one class of RL algorithms called **Q-learning**. The main idea of this method is to learn the Q-function as in (9.5) without explicitly decomposing \(Q(i,a)\) into \(P\), \(V\) and \(r\). Observe from the Bellman equation, that if we were to know \(Q(i,a)\) for every state \(i\) and action \(a\), then we can also compute the optimal policy by selecting the \(a\) that maximizes \(Q(i,a)\) for every \(i\).

The key of Q-learning is to continuously learn \(Q(\cdot,\cdot)\) while using the learned estimates to select actions as we go. For this, denote by \(\hat{Q}_t(\cdot,\cdot)\) the estimate of \(Q(\cdot,\cdot)\) we have at time \(t\). At any given time we attempt to balance **exploration and exploitation**. With a high probability, we decide on action \(a\) that maximizes \(\hat{Q}_t(i,a)\) - this is exploitation. However, we leave some possibility to explore other actions, and occasionally decide on an arbitrary (random) action \(a\) - this is exploration. In our example, as time progresses we reduce the probability of exploration. For example, we use \(t^{-0.2}\) for this probability, which implies that as time evolves we slowly explore less and less.

As we operate our system with Q-learning, after an action is chosen, reward \(r\) is obtained and the system transitions from state \(i\) to state \(j\). At that point we update the \((i,a)\) entry of the Q-function estimate as follows:
\[\begin{equation}
\tag{9.7}
\hat{Q}_{t+1}(i,a) = (1-\alpha_t) \, \hat{Q}_t(i,a) + \alpha_t \, \Big( r+ \beta \max_{a \in{ \cal A}_s} \hat{Q}_t(j,a) \Big).
\end{equation}\]
Here \(\alpha_t\) is a decaying (or constant) sequence of probabilities (in the example below we use \(\alpha_t = t^{-0.2}\)). The key of the **Q-learning update equation** (9.7) is a weighted average of the previous estimate \(\hat{Q}_t(i,a)\) and a single sample of the right hand side of the Bellman equation (9.5). Miraculously as the system progresses under such a control, this scheme is able to estimate the Q-function and hence control the system well.

Note that ideally we would set \(\{\alpha_t\}_{t=1}^\infty\) to satisfy,
\[
\sum_{t=1}^\infty \alpha_t = \infty,
\qquad
\mbox{and}
\qquad
\sum_{t=1}^\infty \alpha_t^2 < \infty.
\]
With such a condition, based on the theory of **stochastic approximation**, it is guaranteed that as \(t \to \infty\), \(\hat{Q}_t(i,a) \to Q(i,a)\). This property shows that (at least in principle), systems controlled via Q-learning may still be controlled in an asymptotically optimal manner, even without explicit knowledge of the underlying transition matrices \(P^{(a)}\).

The main idea of such methods is to approximate functions with neural networks and train the neural networks in parallel to controlling the network. For example with **deep Q-learning**, the Q-function estimate in is represented via a neural network. Then with each iteration of , the neural network is further trained.

Page built: 2021-01-17