Reinforcement learning uses Markov Decision Processes (MDP).
A Markov Decision Process is defined by:
-
an initial state $s_0$
-
the reward distribution $R_t \sim \mathbb{P}(R | s_{t-1}, a_{t-1})$ (stochastic)
-
the transition probabilities, $s_t \sim \mathbb{P}(s | s_{t-1}, a_{t-1})$ (stochastic)
=> MDP: we know the state and the reward from the previous state only.
A policy is an action for each state, for a given MDP.
=> a policy can be seen as a strategy: we know where to go at each state. In practice the policy is actually the transition probability matrix.
$R_t$ is a random variable $\mathbb{E}[R_t | a_{t-1}, s_{t-1}] = \sum_a \mathbb{P}(s_t | s_{t-1}, a_{t-1})R(s_{t-1}, a_{t-1})$.
The gain is calculated as such: $G = R_0 + \gamma R_1 + \gamma^2 R_2 +… = \sum_t \gamma^t R_t $ where $\gamma$ is the \textit{discount factor} to take into account for the time decay (it’s better to have gains as early as possible). Hence, $G$ is a sum of random variable so it’s also a random variable.
The value function is the gain we earn at a state, for a specific policy (= transition probabilities): $\forall s, V_\pi(s_t) = \mathbb{E}_{\pi}[G | s_0 = s_t]$. Note: $s_0 = s_t$ is given so that the initial state is assumed to be $s_t$.
=> The expected gain takes into account the transition probability.
We can also write $G$ recursively: $G = G_0 = R_0 + \gamma R_1 + \gamma^2 R_2 +… = R_0 + G_1$.
Hence, $V$ can also be written:
\[V_\pi(s_t) = \mathbb{E}_\pi[R_0 + \gamma V_\pi(s_{t+1}) | s_0 = s_t]\]This expression is called Bellman equation.
We can now find the best policy i.e. the best actions to maximize the expected gain. The maximum gains at each steps are:
\[\forall s_t, V(s_t) = \max_a\mathbb{E}[R_0 + \gamma V(s_{t+1}) | s_0 = s]\]