📝 Temporal Difference (TD) Control Algorithms Comparison: SARSA, Expected SARSA, and Q-learning #
🎯 Introduction: The TD Target is the Key Differentiator #
These three algorithms are all one-step Temporal Difference (TD) control algorithms used to estimate the optimal action-value function, $q^*$. Their core difference lies in how they compute the TD Target, which is the estimated value of the next state.
I. Core Differences Overview #
| Feature | SARSA (On-Policy) | Expected SARSA (On-Policy) | Q-learning (Off-Policy) |
|---|---|---|---|
| Policy Type | On-Policy | On-Policy | Off-Policy |
| TD Target $Y_t$ | Sampled Value | Expected Value | Maximum Value |
| TD Target Formula | $R_{t+1} + \gamma \hat{q}(S_{t+1}, \mathbf{A}_{t+1}, \mathbf{w})$ | $R_{t+1} + \gamma \sum_{a} \pi(a | S_{t+1})\hat{q}(S_{t+1}, a, \mathbf{w})$ | $R_{t+1} + \gamma \max_{a} \hat{q}(S_{t+1}, a, \mathbf{w})$ |
| Variance | High (Dependent on sampled $A’$) | Low (Averaged over $A’$, eliminates sampling randomness) | Medium-Low (Only takes the maximum) |
| Learning Goal | The $q_{\pi}$ of the current exploratory policy $\pi$ | The $q_{\pi}$ of the current exploratory policy $\pi$ | The optimal $q^*$ (independent of exploration) |
II. Detailed Differentiation and Update Mechanism #
1. SARSA (State-Action-Reward-State-Action) #
- TD Target:$Y_t = R_{t+1} + \gamma \hat{q}(S_{t+1}, \mathbf{A}_{t+1}, \mathbf{w})$
- Mechanism:The update for $Q(S, A)$ uses the $Q$ value of the actual next action $\mathbf{A}_{t+1}$ that was selected and executed by the behavior policy ($\epsilon$-greedy).
- Nature:Since the learning process (update) and the acting process (data generation) use the same policy $\pi$, it is On-Policy. SARSA learns the value function for the policy currently being followed.
2. Expected SARSA #
- TD Target:$Y_t = R_{t+1} + \gamma \sum_{a} \pi(a|S_{t+1})\hat{q}(S_{t+1}, a, \mathbf{w})$
- Mechanism:The update for $Q(S, A)$ does not use a sample $A_{t+1}$, but instead calculates the expected return by summing the $Q$ values of all possible actions in $S_{t+1}$, weighted by their probability under the current policy $\pi$.
- Advantage:By calculating the expectation, the randomness and variance from the next action $A’$ are eliminated, making the learning process more stable than standard SARSA. It remains On-Policy.
3. Q-learning #
- TD Target:$Y_t = R_{t+1} + \gamma \max_{a} \hat{q}(S_{t+1}, a, \mathbf{w})$
- Mechanism:The update for $Q(S, A)$ ignores the action chosen by the exploratory behavior policy $\mu$ ($\epsilon$-greedy). Instead, it uses the maximum $Q$ value attainable under the greedy policy ($\pi^*$) as the update target.
- Nature:Since the target policy ($\pi^$: greedy) is different from the behavior policy ($\mu$: $\epsilon$-greedy), it is Off-Policy. Q-learning directly learns the optimal value function $q^$.