Online control
Recall that value function is $\forall s, V_\pi(s) = \mathbb{E}[G | s_0 = s]$
Unknown model <=> "unknown strategy" <=> we don’t know the rewards in advance (we need to learn online).
=> we cannot compute the expectation $\mathbb{E}$
=> instead of the value function, we will estimate the value-action function thanks to the Bellman equation:
\[\forall s, a, Q_\pi(s,a) = \mathbb{E}_\pi[r_0 + \gamma Q(s_1, a_1) | s_0 = s, a_0 = a]\]=> for an initial state and a fixed action, we can compute the $Q$ function
=> the value-action function can be seen as the expected (=> use of transition matrix (probabilities)) gain we get at a state when doing a specific action
Note: the policy $\pi$ takes account only from next state $s_1$
We want to control the policy and find the optimal one: $\pi^* (a) = a^* \in argmax_a Q_*(s,a)$
The optimal Bellman equation becomes:
\[\forall s, a, Q_\pi(s,a) = \mathbb{E}_\pi[r_0 + \gamma max_{a'} Q(s_1, a') | s_0 = s, a_0 = a]\]How to choose initial state $a_0$?
–> Pure exploitation: we find the best action knowing the current system $\pi(s) \leftarrow argmax_a Q(s,a)$
=> problem: $Q$ is estimated, thus we have a chance to miss good actions
–> Pure exploration: $\pi(s) \leftarrow random$
=> problem: we can waste time on bad actions and thus have bad estimation quality
Solution: trade-off exploitation/exploitation (SARSA, Q-learning, …).