Blogs
11 Nov 2025
Temporal Difference (TD) Control Algorithms Comparison: SARSA, Expected SARSA, and Q-learning
Comparative analysis of major one-step Temporal Difference (TD) control algorithms: SARSA, Expected SARSA, and Q-learning, focusing on their policy nature and target construction.
30 Sep 2025
Derivation for Action-Value Function in Off-Policy Learning
Detailed derivation of the action-value function $Q(s, a)$ in off-policy learning using importance sampling, and an explanation of the backward loop implementation in Monte Carlo prediction.
15 Sep 2025
Reinforcement Learning for Outfit Compatibility
Modeling the outfit compatibility problem as a Markov Decision Process (MDP), defining the state space, action space, and afterstate formulation for sequential item selection.
10 Sep 2025
Detailed pseudo-code for the Dyna-Q+ algorithm, covering both deterministic and non-stationary environments, with a focus on exploration bonuses.
1 Sep 2024
Formalization of the afterstate concept in Reinforcement Learning, including value functions and Dynamic Programming / Temporal Difference algorithms.