CamelEdge
ai

Mastering Markov Decision Processes: A Comprehensive Guide

picture of manydoors
Table Of Content

    By CamelEdge

    Updated on Sun Sep 08 2024


    Markov Decision Process (MDP) is a mathematical framework used to model decision-making in situations where outcomes are partly random and partly under the control of an agent. It is a key concept in reinforcement learning, a subfield of machine learning concerned with training agents to make decisions by learning from experience.

    Understanding Markov Decision Processes

    MDP differs from deterministic search algorithms like BFS, DFS, UCS, and A* by considering the uncertainty of the environment. Consider a robot navigating a 3x4 grid. Starting at cell (1,1) (blue circle), the robot's goal is to reach cell (3,4). It can move up \uparrow, down \downarrow, left \leftarrow, or right \rightarrow. Cell (3,4) offers a reward of 1, while cell (2,4) has a penalty of -1. Consequently, the robot will aim to avoid cell (2,4) and reach cell (3,4). Cell (3,4) and (2,4) are terminal states, which means the robot can not move from there. All other cells are non-terminal states.

    +1-1

    If there is no uncertainty, the robot can use a deterministic search algorithm (BFS, DFS, UCS, A*) to find the shortest path to the goal as follows:

    +1-1

    To account for uncertainty, we assume the robot has an 80% probability of moving in the intended direction and a 20% probability of moving in a perpendicular direction.

    0.80.10.1

    The robot can not follow a predetermined path as it does not acount for the uncertainty in its movement. In such a case, the robot must have a strategy to decide its next action no matter what state it is in. This is where MDP comes in. MDP provides a mathematical framework to make decisions in uncertain environments. The solution would be to find a policy that tells the robot what to do in each state. For exampe the policy can be as follows where the robot knowns what action to take from each state.

    policy visualization

    The goal of MDP is to find a policy (π\pi) that maximizes the expected reward over the long term. The policy is a mapping from states to actions that the robot should take, .i.e π(s)=a\pi(s) = a. The policy is optimal if it maximizes the expected reward over the long term.

    Mathematical Formulation

    MDP is a 5-tuple (S,A,T,R,γ)(S, A, T, R, \gamma), where:

    • SS is a set of states.
    • A:Actions(s)A: Actions(s) is a set of actions from state ss.
    • T(s,a,s)T(s,a,s') is a Probability of reaching state ss' from state ss by taking action aa.
    • R(s,a,s)R(s,a,s') is a reward function.
    • γ\gamma is a discount factor.

    Solution is the optimal policy π(s)\pi(s) that maximizes the expected cumulative reward.

    Evaluating a Policy

    When the agent follows a policy π\pi, it generates a trajectory of states:

    s0,a0,s1,a1,s2,a2,s3,...s_0, a_0, s_1, a_1, s_2, a_2, s_3, ...

    The expected cumulative reward, also known as the utility function, is the sum of the discounted rewards obtained by the agent:

    Uπ(s)=R(s0)+γR(s1)+γ2R(s2)+...U_{\pi}(s) = R(s_0) + \gamma R(s_1) + \gamma^{2}R(s_2) + ...

    For example, if the reward of all the non-terminal states is -0.04, the utility of the path (dotted line) is (assume γ\gamma = 1) 0.8.

    policy visualization

    The Value of a policy is the expected average utility obtained by following the policy from each state.

    Vπ(s)=Eπ[Uπ(s)]V_{\pi}(s) = E_{\pi}[U_{\pi}(s)]

    To evaluate a policy, we can start from a given state and follow the policy multiple times to compute the average utility. This process involves simulating the policy repeatedly to estimate the expected utility for each state. However, this approach can be computationally expensive and inefficient, especially for larger problems with many states.

    To address this inefficiency, the Bellman equation is used, which provides a recursive decomposition of the value function. The Bellman equation allows us to compute the value of a policy more efficiently by breaking down the problem into smaller subproblems. The value of a state under a policy can be expressed in terms of the value of subsequent states.

    The Bellman equation for the value of a policy π\pi is given by:

    Vπ(s)=R(s)+γsT(s,π(s),s)Vπ(s)V_{\pi}(s) = R(s) + \gamma \sum_{s'} T(s, \pi(s), s') V_{\pi}(s')

    The Bellman Equation establishes a direct relationship between the value of a state and the value of its neighboring states. This relationship is fundamental in the process of policy evaluation, where the value of a policy is iteratively computed. This iterative algorithm, known as Policy Evaluation, updates the value of each state based on the expected rewards and the values of subsequent states. Through repeated iterations, the algorithm converges to the true value function for the given policy, providing a more efficient and scalable method for policy evaluation in Markov Decision Processes (MDPs).

    Algorithm: Policy evaluation

    Initialize Vπ0(s)=0V_\pi^0(s)=0 for all states s
    For t=1:Nt = 1:N
        For each state s:

    Vπt(s):=R(s)+γsT(s,π(s),s)Vπt1(s) V_\pi^t(s) := R(s) + \gamma \sum_{s'} T(s, \pi(s), s') V_\pi^{t-1}(s')

    Applying this algorithm yields the following value (γ=1\gamma=1 and reward of non-terminal states is 0.04-0.04):

    policy evaluation

    Now, let's explore how changing the policy affects the value function. By altering the policy, we can observe how the expected utility of each state is impacted. This process involves recalculating the value function using the new policy and comparing it to the previous results.

    policy evaluation

    By comparing the values, we can conclude that the first policy is superior as it yields a higher expected reward.

    Finding Optimal Policy

    The previous algorithm helps us evaluate a given policy, but how do we find the optimal policy?

    To determine the best policy, we need to identify the actions that maximize the expected utility for each state. This involves not just following a fixed policy but dynamically choosing the best possible action at each state based on the expected future rewards. The algorithm used for this purpose is called value iteration.

    Value iteration is an iterative algorithm that computes the optimal policy by repeatedly updating the value of each state until the values converge to the optimal values. The process involves the following steps:

    1. Initialization: Start with an initial guess for the value of each state. This can be an arbitrary value, often initialized to zero for simplicity.

    2. Value Update: For each state, update its value by considering the highest expected utility from all possible actions, followed by the current value estimates of the resulting states. This is achieved using the Bellman optimality equation:

      V(s)=R(s)+maxaA(s)[γsT(s,a,s)V(s)]V(s) = R(s) + \max_{a \in A(s)} \left[ \gamma \sum_{s'} T(s, a, s') V(s') \right]

      Here, V(s)V(s) is the value of state ss, R(s)R(s) is the reward in state ss, γ\gamma is the discount factor, T(s,a,s)T(s, a, s') is the transition probability from state ss to state ss' given action aa, and the summation is over all possible next states ss'.

    3. Policy Extraction: Once the values have converged, derive the optimal policy by choosing the action that maximizes the expected utility for each state:

      π(s)=argmaxaA(s)[γsT(s,a,s)V(s)]\pi^*(s) = \arg\max_{a\in A(s)} \left[ \gamma \sum_{s'} T(s, a, s') V(s') \right]

      Here, π(s)\pi^*(s) is the optimal policy for state ss.

    Algorithm: Value Iteration

    Initialize V0(s)=0V^0(s)=0 for all states s
    For t=1:Nt = 1:N
        For each state s:

    V(s):=R(s)+maxaA(s)[γsT(s,a,s)V(s)] V(s) := R(s) + \max_{a \in A(s)} \left[ \gamma \sum_{s'} T(s, a, s') V(s') \right]

    By following these steps, value iteration efficiently computes the optimal policy for a Markov Decision Process, ensuring that the chosen actions maximize the expected utility for each state.

    By applying the value iteration algorithm, we can determine the optimal policy for our scenario. Using γ=1\gamma=1 and assigning a reward of 0.04-0.04 to non-terminal states, we observe that the policy in the first row opts for a longer path to reach the +1 state. This detour is intentional to avoid the cell with a reward of -1. Consequently, this strategy yields a higher value compared to the previous policy, establishing it as the optimal policy.

    value iteration

    Effect of Discount Factor on Optimal Policy

    The discount factor, denoted as γ\gamma, plays a crucial role in Markov Decision Processes by determining the importance of future rewards. It is a value between 0 and 1, where a higher discount factor places more emphasis on future rewards, while a lower discount factor prioritizes immediate rewards.

    High Discount Factor (γ1\gamma \approx 1):

    • Long-term Planning: A high discount factor encourages the agent to consider long-term rewards, making it more patient and willing to wait for larger rewards in the future.
    • Stability: It often leads to more stable policies as the agent values future states more consistently.

    Low Discount Factor (γ0\gamma \approx 0):

    • Short-term Focus: A low discount factor makes the agent more myopic, focusing on immediate rewards rather than future benefits.
    • Quick Gains: It can lead to policies that prioritize quick gains, even if they are smaller.

    Effect of Reward on Optimal Policy

    The reward structure R(s)R(s) in a Markov Decision Process (MDP) significantly influences the optimal policy. Rewards are the immediate returns received after transitioning from one state to another due to an action. The way these rewards are assigned can drastically change the behavior of the agent. The figure below illustrates optimal policies for four different ranges of R(s)R(s) for the non-terminal states. When R(s)1.6284R(s) \leq -1.6284, the situation is so dire that the agent heads straight for the nearest exit, even if it has a reward of 1-1. When 0.4278R(s)0.0850-0.4278 \leq R(s) \leq -0.0850, the environment is quite unpleasant; the agent takes the shortest route to the +1+1 state, risking falling into the 1-1 state by accident, particularly taking the shortcut from (2,3)(2,3). When life is only slightly dreary (0.0221<R(s)<0)(-0.0221 < R(s) < 0), the optimal policy avoids risks entirely. In states (2,3)(2,3) and (1,4)(1,4), the agent moves directly away from the 1-1 state to avoid accidental falls, even if it means hitting obstacles repeatedly. Finally, if R(s)>0R(s) > 0, the environment is positively enjoyable, and the agent avoids both exits.

    value iteration with reward -1.6

    Reward: -1.6

    value iteration with reward -0.3

    Reward: -0.3

    value iteration with reward -0.01

    Reward: -0.01

    value iteration with reward 0.1

    Reward: 0.1

    Conclusion

    In conclusion, Markov Decision Processes (MDPs) provide a robust framework for modeling decision-making in environments with inherent uncertainty. By understanding the effects of the discount factor and reward structure on the optimal policy, we can better design agents that make informed decisions over time. Whether the focus is on long-term planning or immediate gains, MDPs offer the flexibility to tailor strategies according to specific goals and constraints. The insights gained from studying MDPs are invaluable in various fields, including robotics, economics, and artificial intelligence.

    FAQs

    Q1: What is a Markov Decision Process (MDP)? A Markov Decision Process (MDP) is a mathematical framework used to model decision-making in situations where outcomes are partly random and partly under the control of an agent. It is widely used in reinforcement learning to train agents to make decisions by learning from experience.

    Q2: What is the discount factor in MDPs? The discount factor, denoted as γ\gamma, determines the importance of future rewards in an MDP. It is a value between 0 and 1, where a higher discount factor places more emphasis on future rewards, while a lower discount factor prioritizes immediate rewards.

    Q3: How does the reward structure affect the optimal policy in an MDP? The reward structure R(s)R(s) in an MDP significantly influences the optimal policy. Rewards are the immediate returns received after transitioning from one state to another due to an action. The way these rewards are assigned can drastically change the behavior of the agent, affecting whether it focuses on long-term gains or immediate rewards.

    Q4: What is the role of the discount factor in long-term planning? A high discount factor (γ1\gamma \approx 1) encourages the agent to consider long-term rewards, making it more patient and willing to wait for larger rewards in the future. This often leads to more stable policies as the agent values future states more consistently.

    Q5: Can MDPs be used for short-term decision-making? Yes, a low discount factor (γ0\gamma \approx 0) makes the agent more myopic, focusing on immediate rewards rather than future benefits. This can lead to policies that prioritize quick gains, even if they are smaller.

    Q6: Why are MDPs important in reinforcement learning? MDPs provide a structured way to model environments with uncertainty, allowing agents to learn optimal policies through experience. This is fundamental in reinforcement learning, where the goal is to train agents to make decisions that maximize cumulative rewards over time.