Formulation #
At each time step $t$, the agent’s environmental states is $S_t \in \mathcal{S}$, and selects an action, $A_t \in \mathcal{A}(s)$.
After the action, the agent’s state is transformed to the afterstate $H_t \in \mathcal{S^+}$. This is a deterministic process, $h=f(s,a)$.
The reward received is split into two parts: from action $R^a_{t+1}$ and from environment $R^e_{t+1}$.
In the problem which can be formed as afterstate one, the action reward is deterministic while the environmental process can be stochastic.
The probability of next state $s’ \in \mathcal{S^+}$ and $r^e \in \mathcal{R} \subset \mathbb{R}$ accurring at time $t$, given the current afterstate $h$ is:
$$ p(s’, r^e \mid h) \doteq \mathrm{Pr } { S_{t+1} = s’, R^e_{t+1} = r^e \mid H_t = h} $$
The state-transition probabilities can be written as:
$$ p(s’ \mid h) \doteq \mathrm{Pr} { S_{t+1} = s’ \mid H_t=h } = \sum_{r^e \in \mathcal{R}} p(s’, r^e \mid h) $$
The expected environmental rewards as a function of afterstate $\rho_e: S \rightarrow \mathcal{R}$:
$$ \rho_e(h) \doteq \mathbb{E} [R^e_{t+1} \mid H_t = h] = \sum_{r_e \in \mathcal{R}}r_e \sum_{s’ \in \mathcal{S}} p(s’, r^e \mid h) $$
The value function and action function of an afterstate $h$ under policy $\pi$ is:
$$ \begin{align} v_\pi(h) &\doteq \mathbb{E}[R^e_{t+1} + \gamma G_{t+1} \mid H_t = h] \ &=\sum_{s’}\sum_{r_e}p(s’, r^e \mid h)[r_e+\gamma \sum_{a’}\pi(a’ \mid s’)(v_\pi(h’)+r_a’)]\ q_\pi(h)&=q_\pi(f(s, a)) = v_\pi(h) + r_a \end{align} $$ where $r_a = \rho_a(s,a)$ is the deterministic action reward.
Algorithm for DP Methods #
Here are the Dynamic Programming (DP) algorithms formulated using the afterstate value function $V(h)$:
Algorithm 1: Iterative Policy Evaluation, for estimating $V \approx v_{\pi}$ #
Input: $\pi$, the policy to be evaluated
Parameter: A small threshold $\theta > 0$ determining accuracy of estimation
Initialize: $V(h)$, for all $h \in \mathcal{S}^+$, arbitrarily except that $V(\text{terminal}) = 0$
Repeat ($\Delta < \theta$):
- $\Delta \leftarrow 0$
- For Each $h \in \mathcal{S}^+$ (Afterstate space):
- $v \leftarrow V(h)$
- $V(h) \leftarrow \sum_{s’, r_e}p(s’, r^e \mid h)[r_e+\gamma \sum_{a’}\pi(a’ \mid s’)(V(f(s’, a’))+r_a’)]$
- $\Delta \leftarrow \max(\Delta, |v - V(h)|)$
Until $\Delta < \theta$
Algorithm 2: Policy Iteration for estimating $\pi \approx \pi_*$ #
1. Initialization:
- $V(h) \in \mathbb{R}$ and $\pi(s) \in \mathcal{A}(s)$ arbitrarily for all $s\in \mathcal{S}$
2. Policy Evaluation:
- Repeat ($\Delta < \theta$):
- For Each $h \in \mathcal{S}^+$:
- $h \leftarrow f(s,a)$
- $v \leftarrow V(h)$
- $V(h) \leftarrow \sum_{s’, r_e}p(s’, r^e \mid h)[r_e+\gamma (V(f(s’, \pi(s’)))+r_a’)]$
- $\Delta \leftarrow \max(\Delta, |v - V(h)|)$
- End Policy Evaluation Loop
- For Each $h \in \mathcal{S}^+$:
3. Policy Improvement:
- policy-stable $\leftarrow \textit{true}$
- For Each $s\in\mathrm{S}$:
- old-action $\leftarrow \pi(s)$
- $\pi(s) \leftarrow \mathrm{argmax}_a V(f(s,a)) + \rho_a(s,a)$
- If old-action $\neq \pi(s)$, then policy-stable $\leftarrow \textit{false}$
- If policy-stable, then stop and return $V \approx v_*$ and $\pi \approx \pi_*$; else go to 2
Algorithm for TD #
Algorithm 3: Afterstate Q-learning (off-policy TD control) for estimating $V \approx v_*$ #
Parameters: step size $\alpha \in (0,1]$, small $\epsilon > 0$ Initialize: $V(h)$ for all $h \in \mathcal{S}$.
For each episode:
- Initialize S
- For Each step of episode:
- Choose $A$ from $S$ using $\epsilon$-greedy policy derived from $V$
- $H \leftarrow f(S,A)$
- Take action $A$ and observe reward $R_e$ and next state $S'$
- $V(H) \leftarrow V(H) + \alpha [R_e + \gamma \max_{a’}(V(f(S’, a’)) + \rho_a(S’, a’)) - V(H)]$
- $S\leftarrow S'$
- if $S$ is terminal, then stop.
Algorithm for Tabular Dyna-Q #
Algorithm 4: Afterstate Dyna-Q algorithm for estimating $\pi \approx \pi^*$ #
Initialize: $V(h)$ for all $h \in \mathcal{S}$ and $Q(s,a)$ for all $s \in \mathcal{S}$ and $a \in \mathcal{A}(s)$ Initialize: $Model(s,a)$ for all $s \in \mathcal{S}$ and $a \in \mathcal{A}(s)$
For each episode:
- Initialize $S$
- For each step of episode:
- $A \leftarrow \varepsilon\mathrm{-greedy}(S, V)$
- $H \leftarrow f(S,A)$
- Take action $A$; observe resultant reward $R_e$ and next state $S'$
- $V(H) \leftarrow V(H) + \alpha [R_e + \gamma \max_{a’}(V(f(S’, a’)) + \rho_a(S’, a’)) - V(H)]$
- $Model(S,A)\leftarrow R_e,S'$
- For loop $n$ times (planning):
- $s \leftarrow$ random previously observed state
- $a \leftarrow$ random action previously taken in $s$
- $r_e, s’ \leftarrow Model(s,a)$
- $h \leftarrow f(s,a)$
- $V(h) \leftarrow V(h) + \alpha [r_e + \gamma \max_{a’}(V(f(s’, a’)) + \rho_a(s’, a’)) - V(h)]$
- $S\leftarrow S'$
- if $S$ is terminal, then stop.