Temporal Difference (TD) Control Algorithms Comparison: SARSA, Expected SARSA, and Q-learning

Comparative analysis of major one-step Temporal Difference (TD) control algorithms: SARSA, Expected SARSA, and Q-learning, focusing on their policy nature and target construction.

  ·   2 min read

📝 Temporal Difference (TD) Control Algorithms Comparison: SARSA, Expected SARSA, and Q-learning #

🎯 Introduction: The TD Target is the Key Differentiator #

These three algorithms are all one-step Temporal Difference (TD) control algorithms used to estimate the optimal action-value function, $q^*$. Their core difference lies in how they compute the TD Target, which is the estimated value of the next state.

I. Core Differences Overview #

Feature SARSA (On-Policy) Expected SARSA (On-Policy) Q-learning (Off-Policy)
Policy Type On-Policy On-Policy Off-Policy
TD Target $Y_t$ Sampled Value Expected Value Maximum Value
TD Target Formula $R_{t+1} + \gamma \hat{q}(S_{t+1}, \mathbf{A}_{t+1}, \mathbf{w})$ $R_{t+1} + \gamma \sum_{a} \pi(a | S_{t+1})\hat{q}(S_{t+1}, a, \mathbf{w})$ $R_{t+1} + \gamma \max_{a} \hat{q}(S_{t+1}, a, \mathbf{w})$
Variance High (Dependent on sampled $A’$) Low (Averaged over $A’$, eliminates sampling randomness) Medium-Low (Only takes the maximum)
Learning Goal The $q_{\pi}$ of the current exploratory policy $\pi$ The $q_{\pi}$ of the current exploratory policy $\pi$ The optimal $q^*$ (independent of exploration)

II. Detailed Differentiation and Update Mechanism #

1. SARSA (State-Action-Reward-State-Action) #

  • TD Target:$Y_t = R_{t+1} + \gamma \hat{q}(S_{t+1}, \mathbf{A}_{t+1}, \mathbf{w})$
  • Mechanism:The update for $Q(S, A)$ uses the $Q$ value of the actual next action $\mathbf{A}_{t+1}$ that was selected and executed by the behavior policy ($\epsilon$-greedy).
  • Nature:Since the learning process (update) and the acting process (data generation) use the same policy $\pi$, it is On-Policy. SARSA learns the value function for the policy currently being followed.

2. Expected SARSA #

  • TD Target:$Y_t = R_{t+1} + \gamma \sum_{a} \pi(a|S_{t+1})\hat{q}(S_{t+1}, a, \mathbf{w})$
  • Mechanism:The update for $Q(S, A)$ does not use a sample $A_{t+1}$, but instead calculates the expected return by summing the $Q$ values of all possible actions in $S_{t+1}$, weighted by their probability under the current policy $\pi$.
  • Advantage:By calculating the expectation, the randomness and variance from the next action $A’$ are eliminated, making the learning process more stable than standard SARSA. It remains On-Policy.

3. Q-learning #

  • TD Target:$Y_t = R_{t+1} + \gamma \max_{a} \hat{q}(S_{t+1}, a, \mathbf{w})$
  • Mechanism:The update for $Q(S, A)$ ignores the action chosen by the exploratory behavior policy $\mu$ ($\epsilon$-greedy). Instead, it uses the maximum $Q$ value attainable under the greedy policy ($\pi^*$) as the update target.
  • Nature:Since the target policy ($\pi^$: greedy) is different from the behavior policy ($\mu$: $\epsilon$-greedy), it is Off-Policy. Q-learning directly learns the optimal value function $q^$.