Mastering Markov Decision Processes: A Comprehensive Guide

Markov Decision Process (MDP) is a mathematical framework used to model decision-making in situations where outcomes are partly random and partly under the control of an agent. It is a key concept in reinforcement learning, a subfield of machine learning concerned with training agents to make decisions by learning from experience.

Understanding Markov Decision Processes

MDP differs from deterministic search algorithms like BFS, DFS, UCS, and A* by considering the uncertainty of the environment. Consider a robot navigating a 3x4 grid. Starting at cell (1,1) (blue circle), the robot's goal is to reach cell (3,4). It can move up $\uparrow$ , down $\downarrow$ , left $\leftarrow$ , or right $\rightarrow$ . Cell (3,4) offers a reward of 1, while cell (2,4) has a penalty of -1. Consequently, the robot will aim to avoid cell (2,4) and reach cell (3,4). Cell (3,4) and (2,4) are terminal states, which means the robot can not move from there. All other cells are non-terminal states.

If there is no uncertainty, the robot can use a deterministic search algorithm (BFS, DFS, UCS, A*) to find the shortest path to the goal as follows:

To account for uncertainty, we assume the robot has an 80% probability of moving in the intended direction and a 20% probability of moving in a perpendicular direction.

The robot can not follow a predetermined path as it does not acount for the uncertainty in its movement. In such a case, the robot must have a strategy to decide its next action no matter what state it is in. This is where MDP comes in. MDP provides a mathematical framework to make decisions in uncertain environments. The solution would be to find a policy that tells the robot what to do in each state. For exampe the policy can be as follows where the robot knowns what action to take from each state.

The goal of MDP is to find a policy ( $\pi$ ) that maximizes the expected reward over the long term. The policy is a mapping from states to actions that the robot should take, .i.e $\pi(s) = a$ . The policy is optimal if it maximizes the expected reward over the long term.

Mathematical Formulation

MDP is a 5-tuple $(S, A, T, R, \gamma)$ , where:

$S$ is a set of states.
$A: Actions(s)$ is a set of actions from state $s$ .
$T(s,a,s')$ is a Probability of reaching state $s'$ from state $s$ by taking action $a$ .
$R(s,a,s')$ is a reward function.
$\gamma$ is a discount factor.

Solution is the optimal policy $\pi(s)$ that maximizes the expected cumulative reward.

Evaluating a Policy

When the agent follows a policy $\pi$ , it generates a trajectory of states:

$s_0, a_0, s_1, a_1, s_2, a_2, s_3, ...$

The expected cumulative reward, also known as the utility function, is the sum of the discounted rewards obtained by the agent:

U_{\pi}(s) = R(s_0) + \gamma R(s_1) + \gamma^{2}R(s_2) + ...

For example, if the reward of all the non-terminal states is -0.04, the utility of the path (dotted line) is (assume $\gamma$ = 1) 0.8.

The Value of a policy is the expected average utility obtained by following the policy from each state.

V_{\pi}(s) = E_{\pi}[U_{\pi}(s)]

To evaluate a policy, we can start from a given state and follow the policy multiple times to compute the average utility. This process involves simulating the policy repeatedly to estimate the expected utility for each state. However, this approach can be computationally expensive and inefficient, especially for larger problems with many states.

To address this inefficiency, the Bellman equation is used, which provides a recursive decomposition of the value function. The Bellman equation allows us to compute the value of a policy more efficiently by breaking down the problem into smaller subproblems. The value of a state under a policy can be expressed in terms of the value of subsequent states.

The Bellman equation for the value of a policy $\pi$ is given by:

V_{\pi}(s) = R(s) + \gamma \sum_{s'} T(s, \pi(s), s') V_{\pi}(s')

The Bellman Equation establishes a direct relationship between the value of a state and the value of its neighboring states. This relationship is fundamental in the process of policy evaluation, where the value of a policy is iteratively computed. This iterative algorithm, known as Policy Evaluation, updates the value of each state based on the expected rewards and the values of subsequent states. Through repeated iterations, the algorithm converges to the true value function for the given policy, providing a more efficient and scalable method for policy evaluation in Markov Decision Processes (MDPs).

Algorithm: Policy evaluation

Initialize $V_\pi^0(s)=0$ for all states s
For $t = 1:N$
For each state s:
$V_\pi^t(s) := R(s) + \gamma \sum_{s'} T(s, \pi(s), s') V_\pi^{t-1}(s')$

Applying this algorithm yields the following value ( $\gamma=1$ and reward of non-terminal states is $-0.04$ ):

Now, let's explore how changing the policy affects the value function. By altering the policy, we can observe how the expected utility of each state is impacted. This process involves recalculating the value function using the new policy and comparing it to the previous results.

By comparing the values, we can conclude that the first policy is superior as it yields a higher expected reward.

Finding Optimal Policy

The previous algorithm helps us evaluate a given policy, but how do we find the optimal policy?

To determine the best policy, we need to identify the actions that maximize the expected utility for each state. This involves not just following a fixed policy but dynamically choosing the best possible action at each state based on the expected future rewards. The algorithm used for this purpose is called value iteration.

Value iteration is an iterative algorithm that computes the optimal policy by repeatedly updating the value of each state until the values converge to the optimal values. The process involves the following steps:

Initialization: Start with an initial guess for the value of each state. This can be an arbitrary value, often initialized to zero for simplicity.
Value Update: For each state, update its value by considering the highest expected utility from all possible actions, followed by the current value estimates of the resulting states. This is achieved using the Bellman optimality equation:
$V(s) = R(s) + \max_{a \in A(s)} \left[ \gamma \sum_{s'} T(s, a, s') V(s') \right]$
Here, $V(s)$ is the value of state $s$ , $R(s)$ is the reward in state $s$ , $\gamma$ is the discount factor, $T(s, a, s')$ is the transition probability from state $s$ to state $s'$ given action $a$ , and the summation is over all possible next states $s'$ .
Policy Extraction: Once the values have converged, derive the optimal policy by choosing the action that maximizes the expected utility for each state:
$\pi^*(s) = \arg\max_{a\in A(s)} \left[ \gamma \sum_{s'} T(s, a, s') V(s') \right]$
Here, $\pi^*(s)$ is the optimal policy for state $s$ .

Algorithm: Value Iteration

Initialize $V^0(s)=0$ for all states s
For $t = 1:N$
For each state s:
$V(s) := R(s) + \max_{a \in A(s)} \left[ \gamma \sum_{s'} T(s, a, s') V(s') \right]$

By following these steps, value iteration efficiently computes the optimal policy for a Markov Decision Process, ensuring that the chosen actions maximize the expected utility for each state.

By applying the value iteration algorithm, we can determine the optimal policy for our scenario. Using $\gamma=1$ and assigning a reward of $-0.04$ to non-terminal states, we observe that the policy in the first row opts for a longer path to reach the +1 state. This detour is intentional to avoid the cell with a reward of -1. Consequently, this strategy yields a higher value compared to the previous policy, establishing it as the optimal policy.

Effect of Discount Factor on Optimal Policy

The discount factor, denoted as $\gamma$ , plays a crucial role in Markov Decision Processes by determining the importance of future rewards. It is a value between 0 and 1, where a higher discount factor places more emphasis on future rewards, while a lower discount factor prioritizes immediate rewards.

High Discount Factor ( $\gamma \approx 1$ ):

Long-term Planning: A high discount factor encourages the agent to consider long-term rewards, making it more patient and willing to wait for larger rewards in the future.
Stability: It often leads to more stable policies as the agent values future states more consistently.

Low Discount Factor ( $\gamma \approx 0$ ):

Short-term Focus: A low discount factor makes the agent more myopic, focusing on immediate rewards rather than future benefits.
Quick Gains: It can lead to policies that prioritize quick gains, even if they are smaller.

Effect of Reward on Optimal Policy

The reward structure $R(s)$ in a Markov Decision Process (MDP) significantly influences the optimal policy. Rewards are the immediate returns received after transitioning from one state to another due to an action. The way these rewards are assigned can drastically change the behavior of the agent. The figure below illustrates optimal policies for four different ranges of $R(s)$ for the non-terminal states. When $R(s) \leq -1.6284$ , the situation is so dire that the agent heads straight for the nearest exit, even if it has a reward of $-1$ . When $-0.4278 \leq R(s) \leq -0.0850$ , the environment is quite unpleasant; the agent takes the shortest route to the $+1$ state, risking falling into the $-1$ state by accident, particularly taking the shortcut from $(2,3)$ . When life is only slightly dreary $(-0.0221 < R(s) < 0)$ , the optimal policy avoids risks entirely. In states $(2,3)$ and $(1,4)$ , the agent moves directly away from the $-1$ state to avoid accidental falls, even if it means hitting obstacles repeatedly. Finally, if $R(s) > 0$ , the environment is positively enjoyable, and the agent avoids both exits.

Reward: -1.6

Reward: -0.3

Reward: -0.01

Reward: 0.1

Conclusion

In conclusion, Markov Decision Processes (MDPs) provide a robust framework for modeling decision-making in environments with inherent uncertainty. By understanding the effects of the discount factor and reward structure on the optimal policy, we can better design agents that make informed decisions over time. Whether the focus is on long-term planning or immediate gains, MDPs offer the flexibility to tailor strategies according to specific goals and constraints. The insights gained from studying MDPs are invaluable in various fields, including robotics, economics, and artificial intelligence.

FAQs

Q1: What is a Markov Decision Process (MDP)? A Markov Decision Process (MDP) is a mathematical framework used to model decision-making in situations where outcomes are partly random and partly under the control of an agent. It is widely used in reinforcement learning to train agents to make decisions by learning from experience.

Q2: What is the discount factor in MDPs? The discount factor, denoted as $\gamma$ , determines the importance of future rewards in an MDP. It is a value between 0 and 1, where a higher discount factor places more emphasis on future rewards, while a lower discount factor prioritizes immediate rewards.

Q3: How does the reward structure affect the optimal policy in an MDP? The reward structure $R(s)$ in an MDP significantly influences the optimal policy. Rewards are the immediate returns received after transitioning from one state to another due to an action. The way these rewards are assigned can drastically change the behavior of the agent, affecting whether it focuses on long-term gains or immediate rewards.

Q4: What is the role of the discount factor in long-term planning? A high discount factor ( $\gamma \approx 1$ ) encourages the agent to consider long-term rewards, making it more patient and willing to wait for larger rewards in the future. This often leads to more stable policies as the agent values future states more consistently.

Q5: Can MDPs be used for short-term decision-making? Yes, a low discount factor ( $\gamma \approx 0$ ) makes the agent more myopic, focusing on immediate rewards rather than future benefits. This can lead to policies that prioritize quick gains, even if they are smaller.

Q6: Why are MDPs important in reinforcement learning? MDPs provide a structured way to model environments with uncertainty, allowing agents to learn optimal policies through experience. This is fundamental in reinforcement learning, where the goal is to train agents to make decisions that maximize cumulative rewards over time.