Dyna-Q+ Algorithm

Detailed pseudo-code for the Dyna-Q+ algorithm, covering both deterministic and non-stationary environments, with a focus on exploration bonuses.

  ·   3 min read

Pseudo Code for Dyna-Q+ Algorithm for Deterministic Environment #


Algorithm 1: Dyna-Q+ Algorithm #

Environment model parameters: learning rate $\alpha$, discount factor $\gamma$, a small exploration bonus constant $\kappa$.
Algorithm parameters: a small exploration rate $\epsilon \in (0,1)$, a number of planning steps $n$.
Initialization:

  1. Initialize $Q(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$ arbitrarily.
  2. Initialize $\textit{Model}(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$ as a dictionary.
  3. Initialize $last_visit(s,a) \leftarrow 0$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$.
  4. Initialize step count $t \leftarrow 0$.

Main Loop: Repeat ($\text{forever}$):

  1. $S \leftarrow$ current (non-terminal) state
  2. $A \leftarrow \epsilon$-greedy($S, Q$)
  3. Take action $A$; observe resultant reward $R$ and next state $S'$
  4. $t \leftarrow t + 1$
  5. $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \max_{a}Q(S’,a) - Q(S,A)]$
  6. $last_visit(S,A) \leftarrow t$
  7. $\textit{Model}(S,A) \leftarrow (R, S’)$
  8. For Each action $a \in \mathcal{A}(S)$ where $a \neq A$:
    • $\textit{Model}(S,a) \leftarrow (0, S) \quad$ (Initialize for all other actions)
  9. For $i=1$ to $n$: (Planning Steps)
    • $S \leftarrow$ random previously observed state
    • $A \leftarrow$ random action taken in $S$
    • $R, S’ \leftarrow Model(S, A)$
    • $bonus \leftarrow \kappa \sqrt{t - last_visit(S, A)}$
    • $Q(S, A) \leftarrow Q(S, A) + \alpha[R + bonus + \gamma \max_{a}Q(S’, a) - Q(S, A)]$

Dyna-Q+ Algorithm for Non-Stationary Environments #

Answer to Exercise 8.5 in Page 168 (Reinforcement Learning: Introduction - Sutton - 2018).


Algorithm 2: Dyna-Q+ Algorithm with Model Forgetting #

Environment model parameters: learning rate $\alpha$, discount factor $\gamma$, exploration bonus constant $\kappa$.
Algorithm parameters: a small exploration rate $\epsilon \in (0,1)$, a number of planning steps $n$, a time window $\tau$ for model history.
Initialization:

  1. Initialize $Q(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$ arbitrarily.
  2. Initialize $\textit{Model}(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$ as an empty list.
  3. Initialize $last_visit(s,a) \leftarrow 0$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$.
  4. Initialize step count $t \leftarrow 0$.

Main Loop: Repeat ($\text{forever}$):

  1. $S \leftarrow$ current (non-terminal) state
  2. $A \leftarrow \epsilon$-greedy($S, Q$)
  3. Take action $A$; observe resultant reward $R$ and next state $S'$
  4. $t \leftarrow t + 1$
  5. $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \max_{a}Q(S’,a) - Q(S,A)]$
  6. $last_visit(S,A) \leftarrow t$
  7. Append $(R, S’, t)$ to $\textit{Model}(S,A)$
  8. For $i=1$ to $n$: (Planning Steps)
    • $S \leftarrow$ random previously observed state
    • $A \leftarrow$ random action taken in $S$
    • If $\textit{Model}(S, A)$ is not empty:
      • Filter $\textit{Model}(S, A)$ to get a list of experiences where timestamp $\geq t - \tau$
      • Select $(R, S’, t_{exp})$ randomly from the filtered list
      • $bonus \leftarrow \kappa \sqrt{t - last_visit(S, A)}$
      • $Q(S, A) \leftarrow Q(S, A) + \alpha[R + bonus + \gamma \max_{a}Q(S’, a) - Q(S, A)]$