Home > Blogs > Dyna-Q+ Algorithm

Dyna-Q+ Algorithm

Detailed pseudo-code for the Dyna-Q+ algorithm, covering both deterministic and non-stationary environments, with a focus on exploration bonuses.

September 10, 2025 · 3 min read

#Reinforcement Learning #Algorithm #Dyna-Q

Pseudo Code for Dyna-Q+ Algorithm for Deterministic Environment #

Algorithm 1: Dyna-Q+ Algorithm #

Environment model parameters: learning rate $\alpha$, discount factor $\gamma$, a small exploration bonus constant $\kappa$.
Algorithm parameters: a small exploration rate $\epsilon \in (0,1)$, a number of planning steps $n$.
Initialization:

Initialize $Q(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$ arbitrarily.
Initialize $\textit{Model}(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$ as a dictionary.
Initialize $last_visit(s,a) \leftarrow 0$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$.
Initialize step count $t \leftarrow 0$.

Main Loop: Repeat ($\text{forever}$):

$S \leftarrow$ current (non-terminal) state
$A \leftarrow \epsilon$-greedy($S, Q$)
Take action $A$; observe resultant reward $R$ and next state $S'$
$t \leftarrow t + 1$
$Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \max_{a}Q(S’,a) - Q(S,A)]$
$last_visit(S,A) \leftarrow t$
$\textit{Model}(S,A) \leftarrow (R, S’)$
For Each action $a \in \mathcal{A}(S)$ where $a \neq A$:
- $\textit{Model}(S,a) \leftarrow (0, S) \quad$ (Initialize for all other actions)
For $i=1$ to $n$: (Planning Steps)
- $S \leftarrow$ random previously observed state
- $A \leftarrow$ random action taken in $S$
- $R, S’ \leftarrow Model(S, A)$
- $bonus \leftarrow \kappa \sqrt{t - last_visit(S, A)}$
- $Q(S, A) \leftarrow Q(S, A) + \alpha[R + bonus + \gamma \max_{a}Q(S’, a) - Q(S, A)]$

Dyna-Q+ Algorithm for Non-Stationary Environments #

Answer to Exercise 8.5 in Page 168 (Reinforcement Learning: Introduction - Sutton - 2018).

Algorithm 2: Dyna-Q+ Algorithm with Model Forgetting #

Environment model parameters: learning rate $\alpha$, discount factor $\gamma$, exploration bonus constant $\kappa$.
Algorithm parameters: a small exploration rate $\epsilon \in (0,1)$, a number of planning steps $n$, a time window $\tau$ for model history.
Initialization:

Initialize $Q(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$ arbitrarily.
Initialize $\textit{Model}(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$ as an empty list.
Initialize $last_visit(s,a) \leftarrow 0$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$.
Initialize step count $t \leftarrow 0$.

Main Loop: Repeat ($\text{forever}$):

$S \leftarrow$ current (non-terminal) state
$A \leftarrow \epsilon$-greedy($S, Q$)
Take action $A$; observe resultant reward $R$ and next state $S'$
$t \leftarrow t + 1$
$Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \max_{a}Q(S’,a) - Q(S,A)]$
$last_visit(S,A) \leftarrow t$
Append $(R, S’, t)$ to $\textit{Model}(S,A)$
For $i=1$ to $n$: (Planning Steps)
- $S \leftarrow$ random previously observed state
- $A \leftarrow$ random action taken in $S$
- If $\textit{Model}(S, A)$ is not empty:
  - Filter $\textit{Model}(S, A)$ to get a list of experiences where timestamp $\geq t - \tau$
  - Select $(R, S’, t_{exp})$ randomly from the filtered list
  - $bonus \leftarrow \kappa \sqrt{t - last_visit(S, A)}$
  - $Q(S, A) \leftarrow Q(S, A) + \alpha[R + bonus + \gamma \max_{a}Q(S’, a) - Q(S, A)]$

←

Afterstate Formulation

Reinforcement Learning for Outfit Compatibility

→