Pseudo Code for Dyna-Q+ Algorithm for Deterministic Environment #
Algorithm 1: Dyna-Q+ Algorithm #
Environment model parameters: learning rate $\alpha$, discount factor $\gamma$, a small exploration bonus constant $\kappa$.
Algorithm parameters: a small exploration rate $\epsilon \in (0,1)$, a number of planning steps $n$.
Initialization:
- Initialize $Q(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$ arbitrarily.
- Initialize $\textit{Model}(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$ as a dictionary.
- Initialize $last_visit(s,a) \leftarrow 0$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$.
- Initialize step count $t \leftarrow 0$.
Main Loop: Repeat ($\text{forever}$):
- $S \leftarrow$ current (non-terminal) state
- $A \leftarrow \epsilon$-greedy($S, Q$)
- Take action $A$; observe resultant reward $R$ and next state $S'$
- $t \leftarrow t + 1$
- $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \max_{a}Q(S’,a) - Q(S,A)]$
- $last_visit(S,A) \leftarrow t$
- $\textit{Model}(S,A) \leftarrow (R, S’)$
- For Each action $a \in \mathcal{A}(S)$ where $a \neq A$:
- $\textit{Model}(S,a) \leftarrow (0, S) \quad$ (Initialize for all other actions)
- For $i=1$ to $n$: (Planning Steps)
- $S \leftarrow$ random previously observed state
- $A \leftarrow$ random action taken in $S$
- $R, S’ \leftarrow Model(S, A)$
- $bonus \leftarrow \kappa \sqrt{t - last_visit(S, A)}$
- $Q(S, A) \leftarrow Q(S, A) + \alpha[R + bonus + \gamma \max_{a}Q(S’, a) - Q(S, A)]$
Dyna-Q+ Algorithm for Non-Stationary Environments #
Answer to Exercise 8.5 in Page 168 (Reinforcement Learning: Introduction - Sutton - 2018).
Algorithm 2: Dyna-Q+ Algorithm with Model Forgetting #
Environment model parameters: learning rate $\alpha$, discount factor $\gamma$, exploration bonus constant $\kappa$.
Algorithm parameters: a small exploration rate $\epsilon \in (0,1)$, a number of planning steps $n$, a time window $\tau$ for model history.
Initialization:
- Initialize $Q(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$ arbitrarily.
- Initialize $\textit{Model}(s,a)$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$ as an empty list.
- Initialize $last_visit(s,a) \leftarrow 0$ for all $s \in \mathcal{S}, a \in \mathcal{A}(s)$.
- Initialize step count $t \leftarrow 0$.
Main Loop: Repeat ($\text{forever}$):
- $S \leftarrow$ current (non-terminal) state
- $A \leftarrow \epsilon$-greedy($S, Q$)
- Take action $A$; observe resultant reward $R$ and next state $S'$
- $t \leftarrow t + 1$
- $Q(S,A) \leftarrow Q(S,A) + \alpha[R + \gamma \max_{a}Q(S’,a) - Q(S,A)]$
- $last_visit(S,A) \leftarrow t$
- Append $(R, S’, t)$ to $\textit{Model}(S,A)$
- For $i=1$ to $n$: (Planning Steps)
- $S \leftarrow$ random previously observed state
- $A \leftarrow$ random action taken in $S$
- If $\textit{Model}(S, A)$ is not empty:
- Filter $\textit{Model}(S, A)$ to get a list of experiences where timestamp $\geq t - \tau$
- Select $(R, S’, t_{exp})$ randomly from the filtered list
- $bonus \leftarrow \kappa \sqrt{t - last_visit(S, A)}$
- $Q(S, A) \leftarrow Q(S, A) + \alpha[R + bonus + \gamma \max_{a}Q(S’, a) - Q(S, A)]$