Q Learning

Q-learning

This algorithm is also based on $\epsilon$-greedy algorithm.

\[\pi(s) \leftarrow \begin{cases} a^* \in argmax_a Q(s,a) & \text{with probability}~1-\epsilon \\ random & \text{with probability}~\epsilon \end{cases}\]

Estimation: unlike SARSA, Q-learning aims at updating the estimator using the best action at each iteration:

$\forall t, Q(s_t, a_t) \xleftarrow{\alpha} r_t + \gamma max_{a} Q(s_{t+1}, a)$

The only modification from SARSA is the following line:

Q[state_prev, action_prev] = (1 - alpha) *  Q[state_prev, action_prev] + alpha * (rewards[action_prev] + gamma * np.max(Q[state].data))