Deep Learning PyTorch Course, Markov Reward Process

This course will cover the basics of deep learning and introduce the Markov Decision Process (MDP),
explaining how to implement it using PyTorch. MDP is a crucial concept in the field of reinforcement
learning and serves as an important mathematical model for finding optimal actions to achieve goals.

1. What is a Markov Decision Process?

A Markov Decision Process (MDP) is a mathematical framework that defines the elements an agent (the acting entity)
should consider in order to make optimal decisions in a given environment. An MDP consists of the following five
key elements:

State Set (S): A set that represents all possible states of the environment.
Action Set (A): A set of all possible actions that the agent can take in each state.
Transition Probability (P): Represents the probability of transitioning to the next state after taking a specific action in the current state.
Reward Function (R): Defines the reward obtained through a specific action in a specific state.
Discount Factor (γ): A value that determines how important future rewards are compared to current rewards.

2. Mathematical Definition of MDP

An MDP is generally defined as a tuple (S, A, P, R, γ), and agents learn policies (rules for selecting better actions) based on this information. The goal of an MDP is to find the optimal policy that maximizes long-term rewards.

Relationship Between States and Actions

When taking action a ∈ A in state s ∈ S, the probability of transitioning to the next state s’ ∈ S is represented as P(s’|s, a). The reward function is expressed as R(s, a), which signifies the immediate reward received by the agent for taking action a in state s.

Policy π

The policy π defines the probability of taking action a in state s. This allows the agent to choose the optimal action for a given state.

3. Implementing MDP with PyTorch

Now, let’s implement the Markov Decision Process using PyTorch. The code below defines the MDP and shows the process
in which the agent learns the optimal policy. In this example, we simulate the agent’s journey to reach the goal
point in a simple grid environment.

Installing Required Libraries

                
                pip install torch numpy matplotlib

Code Example

                
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt

# Environment Definition
class GridWorld:
    def __init__(self, grid_size):
        self.grid_size = grid_size
        self.state = (0, 0)  # Initial state
        self.goal = (grid_size - 1, grid_size - 1)  # Goal state
        self.actions = [(0, 1), (0, -1), (1, 0), (-1, 0)]  # Right, Left, Down, Up

    def step(self, action):
        next_state = (self.state[0] + action[0], self.state[1] + action[1])
        # If exceeding boundaries, state remains unchanged
        if 0 <= next_state[0] < self.grid_size and 0 <= next_state[1] < self.grid_size:
            self.state = next_state
        
        # Reward and completion condition
        if self.state == self.goal:
            return self.state, 1, True  # Goal reached
        return self.state, 0, False

    def reset(self):
        self.state = (0, 0)
        return self.state

# Q-Network Definition
class QNetwork(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(QNetwork, self).__init__()
        self.fc1 = nn.Linear(input_dim, 24)  # First hidden layer
        self.fc2 = nn.Linear(24, 24)  # Second hidden layer
        self.fc3 = nn.Linear(24, output_dim)  # Output layer

    def forward(self, x):
        x = nn.functional.relu(self.fc1(x))
        x = nn.functional.relu(self.fc2(x))
        return self.fc3(x)

# Q-learning Learner
class QLearningAgent:
    def __init__(self, state_space, action_space):
        self.q_network = QNetwork(state_space, action_space)
        self.optimizer = optim.Adam(self.q_network.parameters(), lr=0.001)
        self.criterion = nn.MSELoss()
        self.gamma = 0.99  # Discount factor
        self.epsilon = 1.0  # Exploration rate
        self.epsilon_min = 0.01
        self.epsilon_decay = 0.995

    def choose_action(self, state):
        if np.random.rand() <= self.epsilon:
            return np.random.randint(0, 4)  # Random action
        q_values = self.q_network(torch.FloatTensor(state)).detach().numpy()
        return np.argmax(q_values)  # Return optimal action

    def train(self, state, action, reward, next_state, done):
        target = reward
        if not done:
            target = reward + self.gamma * np.max(self.q_network(torch.FloatTensor(next_state)).detach().numpy())
        
        target_f = self.q_network(torch.FloatTensor(state)).detach().numpy()
        target_f[action] = target

        # Learning
        self.optimizer.zero_grad()
        output = self.q_network(torch.FloatTensor(state))
        loss = self.criterion(output, torch.FloatTensor(target_f))
        loss.backward()
        self.optimizer.step()

        # Decay exploration rate
        if self.epsilon > self.epsilon_min:
            self.epsilon *= self.epsilon_decay

# Main Loop
def main():
    env = GridWorld(grid_size=5)
    agent = QLearningAgent(state_space=2, action_space=4)
    episodes = 1000
    rewards = []

    for episode in range(episodes):
        state = env.reset()
        done = False
        total_reward = 0
        
        while not done:
            action = agent.choose_action(state)
            next_state, reward, done = env.step(env.actions[action])
            agent.train(state, action, reward, next_state, done)
            state = next_state
            total_reward += reward
        
        rewards.append(total_reward)

    # Visualization of results
    plt.plot(rewards)
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.title('Training Rewards over Episodes')
    plt.show()

if __name__ == "__main__":
    main()

4. Code Explanation

The above code is an example of implementing MDP in a 5×5 grid environment.
The GridWorld class defines the grid environment in which the agent can move. The agent moves
based on the provided action set and receives rewards when reaching the goal point.

QNetwork class defines a deep neural network model used in Q-learning.
It takes the state dimension as input and returns the Q-values for each action as output.
The QLearningAgent class represents the agent that performs the learning process in reinforcement learning.
This agent uses policies to choose actions and updates Q-values.

The main function initializes the environment and contains the main loop executing the episodes.
In each episode, the agent selects actions based on the given state, receives rewards through the next state of the environment,
and learns accordingly. Upon completion of training, the rewards can be visualized to assess the agent’s performance.

5. Analysis of Learning Results

Observing the learning process, we find that the agent effectively navigates the map by exploring the environment
to reach the goal. The trend of rewards visualized through graphs shows how rewards change as training progresses.
Ideally, the agent learns to achieve higher rewards over time.

6. Conclusion and Future Directions

In this course, we have explained the basic concepts of deep learning, PyTorch,
and the Markov Decision Process. Through practical implementation of MDP using PyTorch,
participants could gain a deeper understanding of the related concepts.
Reinforcement learning is an extensive field with various algorithms and applicable environments.
Future courses will cover more complex environments and diverse policy learning algorithms (e.g., DQN, Policy Gradients).