Deep Learning PyTorch Course, Markov Decision Process

Markov Decision Process (MDP) is an important mathematical framework that underlies reinforcement learning. MDP is a model used by agents to determine the optimal actions in a specific environment. In this post, we will delve into the concept of MDP and how to implement it using PyTorch.

1. Overview of Markov Decision Process (MDP)

MDP consists of the following components:

  • State space (S): A set of all possible states the agent can be in.
  • Action space (A): A set of all possible actions the agent can take in a specific state.
  • Transition probabilities (P): Defines the probability of transitioning to the next state based on the current state and action.
  • Reward function (R): The reward given when the agent takes a specific action in a specific state.
  • Discount factor (γ): A value that adjusts the impact of future rewards on the present value, assuming that future rewards are considered less than present rewards.

2. Mathematical Modeling of MDP

MDP is mathematically defined using the state space, action space, transition probabilities, reward function, and discount factor. MDP can be expressed as:

  • MDP = (S, A, P, R, γ).

Now, let’s explain each component in more detail:

State Space (S)

The state space is the set of all states the agent can be in. For example, in a game of Go, the state space could consist of all possible board configurations.

Action Space (A)

The action space includes all actions the agent can take based on its state. For instance, in a Go game, the agent can place a stone at a specific position.

Transition Probabilities (P)

Transition probabilities represent the likelihood of transitioning to the next state based on the current state and the chosen action. This is mathematically expressed as:

P(s', r | s, a)

Here, s' is the next state, r is the reward, s is the current state, and a is the chosen action.

Reward Function (R)

The reward function represents the reward given when the agent takes a specific action in a specific state. Rewards are a critical factor defining the agent’s goals.

Discount Factor (γ)

The discount factor γ (0 ≤ γ < 1) reflects the impact of future rewards on the present value. The closer γ is to 0, the more the agent focuses on immediate rewards, and the closer it is to 1, the more the agent focuses on long-term rewards.

3. Examples of MDP

Now that we understand the concept of MDP, let’s explore how to apply it to reinforcement learning problems through examples. Next, we will create a trained reinforcement learning agent using a simple MDP example.

3.1 Simple Grid World Example

The grid world models a world composed of a 4×4 grid. The agent is located in each grid cell and can move through specific actions (up, down, left, right). The agent’s goal is to reach the bottom right area (goal position).

Definition of States and Actions

In this grid world:

  • State: Represented by numbers from 0 to 15 for each grid cell (4×4 grid)
  • Actions: Up (0), Down (1), Left (2), Right (3)

Definition of Rewards

The agent receives a reward of +1 for reaching the goal state and 0 for any other state.

4. Implementing MDP with PyTorch

Now let’s implement the reinforcement learning agent using PyTorch. We will primarily use the Q-learning algorithm.

4.1 Environment Initialization

First, let’s define a class for creating the grid world:

import numpy as np

class GridWorld:
    def __init__(self, grid_size=4):
        self.grid_size = grid_size
        self.state = 0
        self.goal_state = grid_size * grid_size - 1
        self.actions = [0, 1, 2, 3]  # Up, Down, Left, Right
        self.rewards = np.zeros((grid_size * grid_size,))
        self.rewards[self.goal_state] = 1  # Reward for reaching the goal

    def reset(self):
        self.state = 0  # Starting state
        return self.state

    def step(self, action):
        x, y = divmod(self.state, self.grid_size)
        if action == 0 and x > 0:   # Up
            x -= 1
        elif action == 1 and x < self.grid_size - 1:  # Down
            x += 1
        elif action == 2 and y > 0:  # Left
            y -= 1
        elif action == 3 and y < self.grid_size - 1:  # Right
            y += 1
        self.state = x * self.grid_size + y
        return self.state, self.rewards[self.state]

4.2 Implementing the Q-learning Algorithm

We will train the agent using Q-learning. Here is the code to implement the Q-learning algorithm:

import torch
import torch.nn as nn
import torch.optim as optim

class QNetwork(nn.Module):
    def __init__(self, state_size, action_size):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(state_size, 24)
        self.fc2 = nn.Linear(24, 24)
        self.fc3 = nn.Linear(24, action_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

def train_agent(episodes, max_steps):
    env = GridWorld()
    state_size = env.grid_size * env.grid_size
    action_size = len(env.actions)
    
    q_network = QNetwork(state_size, action_size)
    optimizer = optim.Adam(q_network.parameters(), lr=0.001)
    criterion = nn.MSELoss()

    for episode in range(episodes):
        state = env.reset()
        total_reward = 0
        for step in range(max_steps):
            state_tensor = torch.eye(state_size)[state]
            q_values = q_network(state_tensor)
            
            action = np.argmax(q_values.detach().numpy())  # epsilon-greedy policy
            next_state, reward = env.step(action)
            total_reward += reward
            
            next_state_tensor = torch.eye(state_size)[next_state]
            target = reward + 0.99 * torch.max(q_network(next_state_tensor)).detach()
            loss = criterion(q_values[action], target)
            
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            if next_state == env.goal_state:
                break
            
            state = next_state
        print(f"Episode {episode+1}: Total Reward = {total_reward}")

5. Conclusion

In this post, we explored the concept of Markov Decision Process (MDP) and how to implement it using PyTorch. MDP is a critical framework foundational to reinforcement learning, and it is essential to understand this concept to solve real reinforcement learning problems. I hope you gain deeper insights into MDP and reinforcement learning through practice.

Additionally, I encourage you to explore more complex MDP problems and learning algorithms. Using tools like PyTorch, try implementing various environments, training agents, and building your own reinforcement learning models.

I hope this post was helpful. If you have any questions, please leave a comment!