Derivation for Action-Value Function in Off-Policy Learning #
Derivation for Action-Value Function $Q(s, a)$ #
The equation for off-policy estimation of $V_{\pi}(s)$ using importance sampling is:
$$V_{\pi}(s) = \mathbb{E}_{b} [\rho _{t:T-1} G_t | S_t = s]$$
where the importance sampling ratio $\rho_{t:T-1}$ is given by:
$$\rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k | S_k)}{b(A_k | S_k)}$$
The action-value function under target policy $\pi$ is defined as:
$$Q_{\pi}(s, a) = \mathbb{E}_{b} [\rho _{t+1: T-1}G_t | S_t = s, A_t = a]$$
So the equation of action value uses a weighted importance sampling is:
$$Q(s,a) \doteq \frac{\sum_{t\in\mathcal{T}(s,a)} \rho_{t+1: T-1}G_t}{\sum_{t\in\mathcal{T}(s,a)} \rho_{t+1: T-1}}$$
where $\mathcal{T}(s,a)$ denotes set of all time steps in which state $s$ and action $a$ are both visited simultaneously.
Note the difference of start time of $\rho$ in equation of value function and action funtion.
Algorithmic Implementation and the Design of $W$ #
The previous section established that the weighted importance sampling estimate for the action-value function $Q(s,a)$ requires the importance sampling ratio $\rho_{t+1:T-1}$. The off-policy MC prediction algorithm, as presented in Sutton and Barto’s “Reinforcement Learning: An Introduction” (Chapter 5, page 110), implements this calculation elegantly using a backward loop through the episode’s steps.
\caption{Off-policy MC prediction for estimating $Q \approx q_{\pi}$}
\label{alg:off_policy_mc}
\SetAlgoLined
\KwIn{an arbitrary target policy $\pi$}
Initialize, for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$:
\quad $Q(s, a) \in \mathbb{R}$ (arbitrarily)
\quad $C(s, a) \leftarrow 0$
\BlankLine
\Repeat{}{
$b \leftarrow$ any policy with coverage of $\pi$
Generate an episode following $b$: $S_0, A_0, R_1, \dots, S_{T-1}, A_{T-1}, R_T$
$G \leftarrow 0$
$W \leftarrow 1$
\For{$t = T-1, T-2, \dots, 0$}{
\If{$W = 0$}{Break}
$G \leftarrow \gamma G + R_{t+1}$
$C(S_t, A_t) \leftarrow C(S_t, A_t) + W$
$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \frac{W}{C(S_t, A_t)} [G - Q(S_t, A_t)]$
$W \leftarrow W \frac{\pi(A_t|S_t)}{b(A_t|S_t)}$
}
}