Home > Blogs > Temporal Difference (TD) Control Algorithms Comparison: SARSA, Expected SARSA, and Q-learning

Temporal Difference (TD) Control Algorithms Comparison: SARSA, Expected SARSA, and Q-learning

Comparative analysis of major one-step Temporal Difference (TD) control algorithms: SARSA, Expected SARSA, and Q-learning, focusing on their policy nature and target construction.

November 11, 2025 · 2 min read

#Reinforcement Learning #Algorithm #Afterstate #TD Control #SARSA #Q-Learning

📝 Temporal Difference (TD) Control Algorithms Comparison: SARSA, Expected SARSA, and Q-learning #

🎯 Introduction: The TD Target is the Key Differentiator #

These three algorithms are all one-step Temporal Difference (TD) control algorithms used to estimate the optimal action-value function, $q^*$. Their core difference lies in how they compute the TD Target, which is the estimated value of the next state.

I. Core Differences Overview #

Feature	SARSA (On-Policy)	Expected SARSA (On-Policy)	Q-learning (Off-Policy)
Policy Type	On-Policy	On-Policy	Off-Policy
TD Target $Y_t$	Sampled Value	Expected Value	Maximum Value
TD Target Formula	$R_{t+1} + \gamma \hat{q}(S_{t+1}, \mathbf{A}_{t+1}, \mathbf{w})$	$R_{t+1} + \gamma \sum_{a} \pi(a \| S_{t+1})\hat{q}(S_{t+1}, a, \mathbf{w})$	$R_{t+1} + \gamma \max_{a} \hat{q}(S_{t+1}, a, \mathbf{w})$
Variance	High (Dependent on sampled $A’$)	Low (Averaged over $A’$, eliminates sampling randomness)	Medium-Low (Only takes the maximum)
Learning Goal	The $q_{\pi}$ of the current exploratory policy $\pi$	The $q_{\pi}$ of the current exploratory policy $\pi$	The optimal $q^*$ (independent of exploration)

II. Detailed Differentiation and Update Mechanism #

1. SARSA (State-Action-Reward-State-Action) #

TD Target：$Y_t = R_{t+1} + \gamma \hat{q}(S_{t+1}, \mathbf{A}_{t+1}, \mathbf{w})$
Mechanism：The update for $Q(S, A)$ uses the $Q$ value of the actual next action $\mathbf{A}_{t+1}$ that was selected and executed by the behavior policy ($\epsilon$-greedy).
Nature：Since the learning process (update) and the acting process (data generation) use the same policy $\pi$, it is On-Policy. SARSA learns the value function for the policy currently being followed.

2. Expected SARSA #

TD Target：$Y_t = R_{t+1} + \gamma \sum_{a} \pi(a|S_{t+1})\hat{q}(S_{t+1}, a, \mathbf{w})$
Mechanism：The update for $Q(S, A)$ does not use a sample $A_{t+1}$, but instead calculates the expected return by summing the $Q$ values of all possible actions in $S_{t+1}$, weighted by their probability under the current policy $\pi$.
Advantage：By calculating the expectation, the randomness and variance from the next action $A’$ are eliminated, making the learning process more stable than standard SARSA. It remains On-Policy.

3. Q-learning #

TD Target：$Y_t = R_{t+1} + \gamma \max_{a} \hat{q}(S_{t+1}, a, \mathbf{w})$
Mechanism：The update for $Q(S, A)$ ignores the action chosen by the exploratory behavior policy $\mu$ ($\epsilon$-greedy). Instead, it uses the maximum $Q$ value attainable under the greedy policy ($\pi^*$) as the update target.
Nature：Since the target policy ($\pi^$: greedy) is different from the behavior policy ($\mu$: $\epsilon$-greedy), it is Off-Policy. Q-learning directly learns the optimal value function $q^$.

←

Derivation for Action-Value Function in Off-Policy Learning