This course will cover the basics of deep learning and introduce the Markov Decision Process (MDP),
explaining how to implement it using PyTorch. MDP is a crucial concept in the field of reinforcement
learning and serves as an important mathematical model for finding optimal actions to achieve goals.
1. What is a Markov Decision Process?
A Markov Decision Process (MDP) is a mathematical framework that defines the elements an agent (the acting entity)
should consider in order to make optimal decisions in a given environment. An MDP consists of the following five
key elements:
- State Set (S): A set that represents all possible states of the environment.
- Action Set (A): A set of all possible actions that the agent can take in each state.
- Transition Probability (P): Represents the probability of transitioning to the next state after taking a specific action in the current state.
- Reward Function (R): Defines the reward obtained through a specific action in a specific state.
- Discount Factor (γ): A value that determines how important future rewards are compared to current rewards.
2. Mathematical Definition of MDP
An MDP is generally defined as a tuple (S, A, P, R, γ), and agents learn policies (rules for selecting better actions) based on this information. The goal of an MDP is to find the optimal policy that maximizes long-term rewards.
Relationship Between States and Actions
When taking action a ∈ A in state s ∈ S, the probability of transitioning to the next state s’ ∈ S is represented as P(s’|s, a). The reward function is expressed as R(s, a), which signifies the immediate reward received by the agent for taking action a in state s.
Policy π
The policy π defines the probability of taking action a in state s. This allows the agent to choose the optimal action for a given state.
3. Implementing MDP with PyTorch
Now, let’s implement the Markov Decision Process using PyTorch. The code below defines the MDP and shows the process
in which the agent learns the optimal policy. In this example, we simulate the agent’s journey to reach the goal
point in a simple grid environment.
Installing Required Libraries
pip install torch numpy matplotlib
Code Example
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
# Environment Definition
class GridWorld:
def __init__(self, grid_size):
self.grid_size = grid_size
self.state = (0, 0) # Initial state
self.goal = (grid_size - 1, grid_size - 1) # Goal state
self.actions = [(0, 1), (0, -1), (1, 0), (-1, 0)] # Right, Left, Down, Up
def step(self, action):
next_state = (self.state[0] + action[0], self.state[1] + action[1])
# If exceeding boundaries, state remains unchanged
if 0 <= next_state[0] < self.grid_size and 0 <= next_state[1] < self.grid_size:
self.state = next_state
# Reward and completion condition
if self.state == self.goal:
return self.state, 1, True # Goal reached
return self.state, 0, False
def reset(self):
self.state = (0, 0)
return self.state
# Q-Network Definition
class QNetwork(nn.Module):
def __init__(self, input_dim, output_dim):
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(input_dim, 24) # First hidden layer
self.fc2 = nn.Linear(24, 24) # Second hidden layer
self.fc3 = nn.Linear(24, output_dim) # Output layer
def forward(self, x):
x = nn.functional.relu(self.fc1(x))
x = nn.functional.relu(self.fc2(x))
return self.fc3(x)
# Q-learning Learner
class QLearningAgent:
def __init__(self, state_space, action_space):
self.q_network = QNetwork(state_space, action_space)
self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.001)
self.criterion = nn.MSELoss()
self.gamma = 0.99 # Discount factor
self.epsilon = 1.0 # Exploration rate
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
def choose_action(self, state):
if np.random.rand() <= self.epsilon:
return np.random.randint(0, 4) # Random action
q_values = self.q_network(torch.FloatTensor(state)).detach().numpy()
return np.argmax(q_values) # Return optimal action
def train(self, state, action, reward, next_state, done):
target = reward
if not done:
target = reward + self.gamma * np.max(self.q_network(torch.FloatTensor(next_state)).detach().numpy())
target_f = self.q_network(torch.FloatTensor(state)).detach().numpy()
target_f[action] = target
# Learning
self.optimizer.zero_grad()
output = self.q_network(torch.FloatTensor(state))
loss = self.criterion(output, torch.FloatTensor(target_f))
loss.backward()
self.optimizer.step()
# Decay exploration rate
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
# Main Loop
def main():
env = GridWorld(grid_size=5)
agent = QLearningAgent(state_space=2, action_space=4)
episodes = 1000
rewards = []
for episode in range(episodes):
state = env.reset()
done = False
total_reward = 0
while not done:
action = agent.choose_action(state)
next_state, reward, done = env.step(env.actions[action])
agent.train(state, action, reward, next_state, done)
state = next_state
total_reward += reward
rewards.append(total_reward)
# Visualization of results
plt.plot(rewards)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Training Rewards over Episodes')
plt.show()
if __name__ == "__main__":
main()
4. Code Explanation
The above code is an example of implementing MDP in a 5×5 grid environment.
The GridWorld class defines the grid environment in which the agent can move. The agent moves
based on the provided action set and receives rewards when reaching the goal point.
QNetwork class defines a deep neural network model used in Q-learning.
It takes the state dimension as input and returns the Q-values for each action as output.
The QLearningAgent class represents the agent that performs the learning process in reinforcement learning.
This agent uses policies to choose actions and updates Q-values.
The main function initializes the environment and contains the main loop executing the episodes.
In each episode, the agent selects actions based on the given state, receives rewards through the next state of the environment,
and learns accordingly. Upon completion of training, the rewards can be visualized to assess the agent’s performance.
5. Analysis of Learning Results
Observing the learning process, we find that the agent effectively navigates the map by exploring the environment
to reach the goal. The trend of rewards visualized through graphs shows how rewards change as training progresses.
Ideally, the agent learns to achieve higher rewards over time.
6. Conclusion and Future Directions
In this course, we have explained the basic concepts of deep learning, PyTorch,
and the Markov Decision Process. Through practical implementation of MDP using PyTorch,
participants could gain a deeper understanding of the related concepts.
Reinforcement learning is an extensive field with various algorithms and applicable environments.
Future courses will cover more complex environments and diverse policy learning algorithms (e.g., DQN, Policy Gradients).