Markov Decision Process (MDP) is an important mathematical framework that underlies reinforcement learning. MDP is a model used by agents to determine the optimal actions in a specific environment. In this post, we will delve into the concept of MDP and how to implement it using PyTorch.
1. Overview of Markov Decision Process (MDP)
MDP consists of the following components:
- State space (S): A set of all possible states the agent can be in.
- Action space (A): A set of all possible actions the agent can take in a specific state.
- Transition probabilities (P): Defines the probability of transitioning to the next state based on the current state and action.
- Reward function (R): The reward given when the agent takes a specific action in a specific state.
- Discount factor (γ): A value that adjusts the impact of future rewards on the present value, assuming that future rewards are considered less than present rewards.
2. Mathematical Modeling of MDP
MDP is mathematically defined using the state space, action space, transition probabilities, reward function, and discount factor. MDP can be expressed as:
- MDP = (S, A, P, R, γ).
Now, let’s explain each component in more detail:
State Space (S)
The state space is the set of all states the agent can be in. For example, in a game of Go, the state space could consist of all possible board configurations.
Action Space (A)
The action space includes all actions the agent can take based on its state. For instance, in a Go game, the agent can place a stone at a specific position.
Transition Probabilities (P)
Transition probabilities represent the likelihood of transitioning to the next state based on the current state and the chosen action. This is mathematically expressed as:
P(s', r | s, a)
Here, s'
is the next state, r
is the reward, s
is the current state, and a
is the chosen action.
Reward Function (R)
The reward function represents the reward given when the agent takes a specific action in a specific state. Rewards are a critical factor defining the agent’s goals.
Discount Factor (γ)
The discount factor γ (0 ≤ γ < 1)
reflects the impact of future rewards on the present value. The closer γ
is to 0, the more the agent focuses on immediate rewards, and the closer it is to 1, the more the agent focuses on long-term rewards.
3. Examples of MDP
Now that we understand the concept of MDP, let’s explore how to apply it to reinforcement learning problems through examples. Next, we will create a trained reinforcement learning agent using a simple MDP example.
3.1 Simple Grid World Example
The grid world models a world composed of a 4×4 grid. The agent is located in each grid cell and can move through specific actions (up, down, left, right). The agent’s goal is to reach the bottom right area (goal position).
Definition of States and Actions
In this grid world:
- State: Represented by numbers from 0 to 15 for each grid cell (4×4 grid)
- Actions: Up (0), Down (1), Left (2), Right (3)
Definition of Rewards
The agent receives a reward of +1 for reaching the goal state and 0 for any other state.
4. Implementing MDP with PyTorch
Now let’s implement the reinforcement learning agent using PyTorch. We will primarily use the Q-learning algorithm.
4.1 Environment Initialization
First, let’s define a class for creating the grid world:
import numpy as np
class GridWorld:
def __init__(self, grid_size=4):
self.grid_size = grid_size
self.state = 0
self.goal_state = grid_size * grid_size - 1
self.actions = [0, 1, 2, 3] # Up, Down, Left, Right
self.rewards = np.zeros((grid_size * grid_size,))
self.rewards[self.goal_state] = 1 # Reward for reaching the goal
def reset(self):
self.state = 0 # Starting state
return self.state
def step(self, action):
x, y = divmod(self.state, self.grid_size)
if action == 0 and x > 0: # Up
x -= 1
elif action == 1 and x < self.grid_size - 1: # Down
x += 1
elif action == 2 and y > 0: # Left
y -= 1
elif action == 3 and y < self.grid_size - 1: # Right
y += 1
self.state = x * self.grid_size + y
return self.state, self.rewards[self.state]
4.2 Implementing the Q-learning Algorithm
We will train the agent using Q-learning. Here is the code to implement the Q-learning algorithm:
import torch
import torch.nn as nn
import torch.optim as optim
class QNetwork(nn.Module):
def __init__(self, state_size, action_size):
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(state_size, 24)
self.fc2 = nn.Linear(24, 24)
self.fc3 = nn.Linear(24, action_size)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return self.fc3(x)
def train_agent(episodes, max_steps):
env = GridWorld()
state_size = env.grid_size * env.grid_size
action_size = len(env.actions)
q_network = QNetwork(state_size, action_size)
optimizer = optim.Adam(q_network.parameters(), lr=0.001)
criterion = nn.MSELoss()
for episode in range(episodes):
state = env.reset()
total_reward = 0
for step in range(max_steps):
state_tensor = torch.eye(state_size)[state]
q_values = q_network(state_tensor)
action = np.argmax(q_values.detach().numpy()) # epsilon-greedy policy
next_state, reward = env.step(action)
total_reward += reward
next_state_tensor = torch.eye(state_size)[next_state]
target = reward + 0.99 * torch.max(q_network(next_state_tensor)).detach()
loss = criterion(q_values[action], target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if next_state == env.goal_state:
break
state = next_state
print(f"Episode {episode+1}: Total Reward = {total_reward}")
5. Conclusion
In this post, we explored the concept of Markov Decision Process (MDP) and how to implement it using PyTorch. MDP is a critical framework foundational to reinforcement learning, and it is essential to understand this concept to solve real reinforcement learning problems. I hope you gain deeper insights into MDP and reinforcement learning through practice.
Additionally, I encourage you to explore more complex MDP problems and learning algorithms. Using tools like PyTorch, try implementing various environments, training agents, and building your own reinforcement learning models.