Home > Blogs > Reinforcement Learning for Outfit Compatibility

Reinforcement Learning for Outfit Compatibility

Modeling the outfit compatibility problem as a Markov Decision Process (MDP), defining the state space, action space, and afterstate formulation for sequential item selection.

September 15, 2025 · 3 min read

#Reinforcement Learning #Deep Learning #Fashion Modeling #MDP #Afterstate

Problem Definition as an MDP #

The outfit compatibility problem is modeled as a finite Markov Decision Process (MDP). The goal is to train an agent that sequentially selects items to construct a high-quality outfit.

State Space $\mathcal{S}$ and Action Space $\mathcal{A}(s)$ #

A state $S_t$ is defined as the current incomplete outfit at time step $t$. It is represented by a set of unique item IDs, with a maximum length of $L$.

$$S_{t}={id_1, id_2, \dots, id_{t}}$$

Each element in the tuple corresponds to a unique item ID $id_i \in {1, \dots, N}$. $N$ is the total number of items.

The action space at time step $t$ is the selection of a new item to be added to the outfit. Given the state $S_{t-1}$, the set of available items for the next action, i.e., the action space $\mathcal{A}(S_{t-1})$, is the set of all item IDs that are not yet in the outfit. Let $\mathcal{I} = {1, 2, \dots, N}$ be the set of all available item IDs. The action space at time step $t$ is defined as the complement of the item set of $S_{t-1}$ with respect to the set of all available items $\mathcal{I}$ plus stop action:

$$\mathcal{A}(S_{t-1})\doteq { a \in \mathcal{I} \mid a \notin \text{set}(S_{t-1}) } \cup {\text{stop}} = (\mathcal{I} \setminus \text{set}(S_{t-1})) \cup {\text{stop}}$$

The action space can be dynamically restricted based on domain-specific rules. For instance, rules like “a dress cannot be paired with pants” would reduce the set of valid actions at each time step, pruning the search space for the agent. This approach, however, is not considered in the current simplified model for clarity.

The size of set of non-terminal states is $|\mathcal{S}| = \sum_{k=0}^{L - 1}\binom{N}{k}$, where $\binom{N}{k}$ is the number of combinations of $k$ items from a set of $N$. The size of the state space is $|\mathcal{S}^+| = |\mathcal{S}| +1$. Any next state containing $L$ items will be transferred into a terminal state.

Afterstate Formulation #

The size of the afterstate space is $|\mathcal{H}|=\sum_{k=1}^{L}\binom{N}{k}$, which is much less than the size of Q-table with size $|\mathcal{Q}|=\sum_{k=0}^{L-1}(N-k)\cdot\binom{N}{k} = \sum_{k=1}^{L}k\binom{N}{k}$.

Environment Dynamic Model #

The environment is deterministic.

If $A_t \in {1, \dots, N}$, let $S_t = (id_1, \dots, id_L)$. Find the smallest index $i$ such that $id_i = 0$. The next state $S_{t+1}$ is formed by replacing $id_i$ with $A_t$. If all slots are filled, the episode terminates.
If $A_t = \text{STOP}$, the episode immediately terminates, regardless of the number of filled slots.

Reward Function $R(S_t, A_t, S_{t+1})$ #

The reward $R_t$ is a sparse reward given only at the end of an episode. The episode terminates when the agent chooses to stop or when the outfit is complete (all slots are filled). The reward measures the Jaccard similarity between the final outfit and the set of known good outfits $\mathcal{G}$.

$$R_t = \begin{cases} \max_{G \in \mathcal{G}} \text{Jaccard}(C(S_{t+1}), G) & \text{if } \text{all slots are filled in } S_{t+1} \ 1.2 \cdot \max_{G \in \mathcal{G}} \text{Jaccard}(C(S_{t}), G) & \text{else if } A_t = \text{STOP} \ 0 & \text{otherwise} \end{cases}$$

where $C(S_{t+1})$ is the set of non-zero item IDs in the final state $S_{t+1}$, and $\text{Jaccard}(A, B) = \frac{|A \cap B|}{|A \cup B|} \in [0, 1]$. We encourage the agent to stop itself by multiplying a factor (1.2) to its reward when it chooses to stop by itself.

←

Dyna-Q+ Algorithm

Derivation for Action-Value Function in Off-Policy Learning

→