Afterstate Formulation

Formalization of the afterstate concept in Reinforcement Learning, including value functions and Dynamic Programming / Temporal Difference algorithms.

  ·   4 min read

Formulation #

At each time step $t$, the agent’s environmental states is $S_t \in \mathcal{S}$, and selects an action, $A_t \in \mathcal{A}(s)$.

After the action, the agent’s state is transformed to the afterstate $H_t \in \mathcal{S^+}$. This is a deterministic process, $h=f(s,a)$.

The reward received is split into two parts: from action $R^a_{t+1}$ and from environment $R^e_{t+1}$.

In the problem which can be formed as afterstate one, the action reward is deterministic while the environmental process can be stochastic.

The probability of next state $s’ \in \mathcal{S^+}$ and $r^e \in \mathcal{R} \subset \mathbb{R}$ accurring at time $t$, given the current afterstate $h$ is:

$$ p(s’, r^e \mid h) \doteq \mathrm{Pr } { S_{t+1} = s’, R^e_{t+1} = r^e \mid H_t = h} $$

The state-transition probabilities can be written as:

$$ p(s’ \mid h) \doteq \mathrm{Pr} { S_{t+1} = s’ \mid H_t=h } = \sum_{r^e \in \mathcal{R}} p(s’, r^e \mid h) $$

The expected environmental rewards as a function of afterstate $\rho_e: S \rightarrow \mathcal{R}$:

$$ \rho_e(h) \doteq \mathbb{E} [R^e_{t+1} \mid H_t = h] = \sum_{r_e \in \mathcal{R}}r_e \sum_{s’ \in \mathcal{S}} p(s’, r^e \mid h) $$

The value function and action function of an afterstate $h$ under policy $\pi$ is:

$$ \begin{align} v_\pi(h) &\doteq \mathbb{E}[R^e_{t+1} + \gamma G_{t+1} \mid H_t = h] \ &=\sum_{s’}\sum_{r_e}p(s’, r^e \mid h)[r_e+\gamma \sum_{a’}\pi(a’ \mid s’)(v_\pi(h’)+r_a’)]\ q_\pi(h)&=q_\pi(f(s, a)) = v_\pi(h) + r_a \end{align} $$ where $r_a = \rho_a(s,a)$ is the deterministic action reward.

Algorithm for DP Methods #

Here are the Dynamic Programming (DP) algorithms formulated using the afterstate value function $V(h)$:


Algorithm 1: Iterative Policy Evaluation, for estimating $V \approx v_{\pi}$ #

Input: $\pi$, the policy to be evaluated
Parameter: A small threshold $\theta > 0$ determining accuracy of estimation
Initialize: $V(h)$, for all $h \in \mathcal{S}^+$, arbitrarily except that $V(\text{terminal}) = 0$
Repeat ($\Delta < \theta$):

  1. $\Delta \leftarrow 0$
  2. For Each $h \in \mathcal{S}^+$ (Afterstate space):
    • $v \leftarrow V(h)$
    • $V(h) \leftarrow \sum_{s’, r_e}p(s’, r^e \mid h)[r_e+\gamma \sum_{a’}\pi(a’ \mid s’)(V(f(s’, a’))+r_a’)]$
    • $\Delta \leftarrow \max(\Delta, |v - V(h)|)$
      Until $\Delta < \theta$

Algorithm 2: Policy Iteration for estimating $\pi \approx \pi_*$ #

1. Initialization:

  • $V(h) \in \mathbb{R}$ and $\pi(s) \in \mathcal{A}(s)$ arbitrarily for all $s\in \mathcal{S}$

2. Policy Evaluation:

  • Repeat ($\Delta < \theta$):
    • For Each $h \in \mathcal{S}^+$:
      • $h \leftarrow f(s,a)$
      • $v \leftarrow V(h)$
      • $V(h) \leftarrow \sum_{s’, r_e}p(s’, r^e \mid h)[r_e+\gamma (V(f(s’, \pi(s’)))+r_a’)]$
      • $\Delta \leftarrow \max(\Delta, |v - V(h)|)$
    • End Policy Evaluation Loop

3. Policy Improvement:

  • policy-stable $\leftarrow \textit{true}$
  • For Each $s\in\mathrm{S}$:
    • old-action $\leftarrow \pi(s)$
    • $\pi(s) \leftarrow \mathrm{argmax}_a V(f(s,a)) + \rho_a(s,a)$
    • If old-action $\neq \pi(s)$, then policy-stable $\leftarrow \textit{false}$
  • If policy-stable, then stop and return $V \approx v_*$ and $\pi \approx \pi_*$; else go to 2

Algorithm for TD #

Algorithm 3: Afterstate Q-learning (off-policy TD control) for estimating $V \approx v_*$ #

Parameters: step size $\alpha \in (0,1]$, small $\epsilon > 0$ Initialize: $V(h)$ for all $h \in \mathcal{S}$.

For each episode:

  1. Initialize S
  2. For Each step of episode:
    • Choose $A$ from $S$ using $\epsilon$-greedy policy derived from $V$
    • $H \leftarrow f(S,A)$
    • Take action $A$ and observe reward $R_e$ and next state $S'$
    • $V(H) \leftarrow V(H) + \alpha [R_e + \gamma \max_{a’}(V(f(S’, a’)) + \rho_a(S’, a’)) - V(H)]$
    • $S\leftarrow S'$
    • if $S$ is terminal, then stop.

Algorithm for Tabular Dyna-Q #

Algorithm 4: Afterstate Dyna-Q algorithm for estimating $\pi \approx \pi^*$ #

Initialize: $V(h)$ for all $h \in \mathcal{S}$ and $Q(s,a)$ for all $s \in \mathcal{S}$ and $a \in \mathcal{A}(s)$ Initialize: $Model(s,a)$ for all $s \in \mathcal{S}$ and $a \in \mathcal{A}(s)$

For each episode:

  1. Initialize $S$
  2. For each step of episode:
    • $A \leftarrow \varepsilon\mathrm{-greedy}(S, V)$
    • $H \leftarrow f(S,A)$
    • Take action $A$; observe resultant reward $R_e$ and next state $S'$
    • $V(H) \leftarrow V(H) + \alpha [R_e + \gamma \max_{a’}(V(f(S’, a’)) + \rho_a(S’, a’)) - V(H)]$
    • $Model(S,A)\leftarrow R_e,S'$
    • For loop $n$ times (planning):
      • $s \leftarrow$ random previously observed state
      • $a \leftarrow$ random action previously taken in $s$
      • $r_e, s’ \leftarrow Model(s,a)$
      • $h \leftarrow f(s,a)$
      • $V(h) \leftarrow V(h) + \alpha [r_e + \gamma \max_{a’}(V(f(s’, a’)) + \rho_a(s’, a’)) - V(h)]$
    • $S\leftarrow S'$
    • if $S$ is terminal, then stop.