Home > Blogs > Derivation for Action-Value Function in Off-Policy Learning

Derivation for Action-Value Function in Off-Policy Learning

Detailed derivation of the action-value function $Q(s, a)$ in off-policy learning using importance sampling, and an explanation of the backward loop implementation in Monte Carlo prediction.

September 30, 2025 · 2 min read

#Reinforcement Learning #Off-Policy #Monte Carlo #Derivation #Importance Sampling

Derivation for Action-Value Function in Off-Policy Learning #

Derivation for Action-Value Function $Q(s, a)$ #

The equation for off-policy estimation of $V_{\pi}(s)$ using importance sampling is:

$$V_{\pi}(s) = \mathbb{E}_{b} [\rho _{t:T-1} G_t | S_t = s]$$

where the importance sampling ratio $\rho_{t:T-1}$ is given by:

$$\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k | S_k)}{b(A_k | S_k)}$$

The action-value function under target policy $\pi$ is defined as:

$$Q_{\pi}(s, a) = \mathbb{E}_{b} [\rho _{t+1: T-1}G_t | S_t = s, A_t = a]$$

So the equation of action value uses a weighted importance sampling is:

$$Q(s,a) \doteq \frac{\sum_{t\in\mathcal{T}(s,a)} \rho_{t+1: T-1}G_t}{\sum_{t\in\mathcal{T}(s,a)} \rho_{t+1: T-1}}$$

where $\mathcal{T}(s,a)$ denotes set of all time steps in which state $s$ and action $a$ are both visited simultaneously.

Note the difference of start time of $\rho$ in equation of value function and action funtion.

Algorithmic Implementation and the Design of $W$ #

The previous section established that the weighted importance sampling estimate for the action-value function $Q(s,a)$ requires the importance sampling ratio $\rho_{t+1:T-1}$. The off-policy MC prediction algorithm, as presented in Sutton and Barto’s “Reinforcement Learning: An Introduction” (Chapter 5, page 110), implements this calculation elegantly using a backward loop through the episode’s steps.

\caption{Off-policy MC prediction for estimating $Q \approx q_{\pi}$}
\label{alg:off_policy_mc}
\SetAlgoLined
\KwIn{an arbitrary target policy $\pi$}
Initialize, for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$: 
\quad $Q(s, a) \in \mathbb{R}$ (arbitrarily) 
\quad $C(s, a) \leftarrow 0$ 

\BlankLine
\Repeat{}{
    $b \leftarrow$ any policy with coverage of $\pi$ 
    Generate an episode following $b$: $S_0, A_0, R_1, \dots, S_{T-1}, A_{T-1}, R_T$ 
    $G \leftarrow 0$ 
    $W \leftarrow 1$ 
    \For{$t = T-1, T-2, \dots, 0$}{
        \If{$W = 0$}{Break}
        $G \leftarrow \gamma G + R_{t+1}$ 
        $C(S_t, A_t) \leftarrow C(S_t, A_t) + W$ 
        $Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{W}{C(S_t, A_t)} [G - Q(S_t, A_t)]$ 
        $W \leftarrow W \frac{\pi(A_t|S_t)}{b(A_t|S_t)}$ 
    }
}

←

Reinforcement Learning for Outfit Compatibility

Temporal Difference (TD) Control Algorithms Comparison: SARSA, Expected SARSA, and Q-learning

→