Off-Policy

30 Sep 2025

Derivation for Action-Value Function in Off-Policy Learning

Detailed derivation of the action-value function $Q(s, a)$ in off-policy learning using importance sampling, and an explanation of the backward loop implementation in Monte Carlo prediction.