SARSA
This algorithm is based on $\epsilon$-greedy algorithm.
\[\pi(s) \leftarrow \begin{cases} a^* \in argmax_a Q(s,a) & \text{with probability} & 1-\epsilon \\ random & \text{with probability} & \epsilon \end{cases}\]=> with a small probability ($\epsilon$ is usually lower than $5\%$), we do random actions. Otherwise we take the best action of the known ones.
Estimation is done through TD-learning (temporal differences - see Online Estimation):
\[\forall t, Q(s_t, a_t) \xleftarrow{\alpha} r_t + \gamma Q(s_{t+1}, a_{t+1})\]# action is identified with new state (after move)
# Q is the transition probability matrix: rows = states and columns = actions
def sarsa(Q, model = model, alpha = 0.1, eps = 0.1, n_iter = 100):
states = model.states
terminals = model.terminals
rewards = model.rewards
gamma = model.gamma
# random state (not terminal)
state = np.random.choice(np.setdiff1d(np.arange(len(states)), terminals))
# random action
action = np.random.choice(Q[state].indices)
new_state = action
for t in range(n_iter):
state_prev = state
action_prev = action
state = new_state
if state in terminals: # if the new action gives terminal state we start again from a new random state
state = np.random.choice(np.setdiff1d(np.arange(len(model.states)),terminals))
action = np.random.choice(Q[state].indices)
Q[state_prev, action_prev] = (1 - alpha) * Q[state_prev, action_prev] + alpha * rewards[action_prev]
# we removed the second term since state_prev was the last step
else:
# we choose the best action that has the highest expected value-action
best_action = Q[state].indices[np.argmax(Q[state].data)]
if np.random.random() < eps:
# with probability of epsilon, we choose a random action
action = np.random.choice(Q[state].indices)
else:
action = best_action
Q[state_prev, action_prev] = (1 - alpha) * Q[state_prev, action_prev] + alpha * (rewards[action_prev] + gamma * Q[state, action])
new_state = action
return Q