Blogs

11 Nov 2025

Temporal Difference (TD) Control Algorithms Comparison: SARSA, Expected SARSA, and Q-learning

Comparative analysis of major one-step Temporal Difference (TD) control algorithms: SARSA, Expected SARSA, and Q-learning, focusing on their policy nature and target construction.

30 Sep 2025

Derivation for Action-Value Function in Off-Policy Learning

Detailed derivation of the action-value function $Q(s, a)$ in off-policy learning using importance sampling, and an explanation of the backward loop implementation in Monte Carlo prediction.

15 Sep 2025

Reinforcement Learning for Outfit Compatibility

Modeling the outfit compatibility problem as a Markov Decision Process (MDP), defining the state space, action space, and afterstate formulation for sequential item selection.

10 Sep 2025

Dyna-Q+ Algorithm

Detailed pseudo-code for the Dyna-Q+ algorithm, covering both deterministic and non-stationary environments, with a focus on exploration bonuses.

1 Sep 2024

Afterstate Formulation

Formalization of the afterstate concept in Reinforcement Learning, including value functions and Dynamic Programming / Temporal Difference algorithms.