{"id":36493,"date":"2024-11-01T09:48:55","date_gmt":"2024-11-01T09:48:55","guid":{"rendered":"http:\/\/atmokpo.com\/w\/?p=36493"},"modified":"2024-11-01T11:52:57","modified_gmt":"2024-11-01T11:52:57","slug":"deep-learning-pytorch-course-markov-decision-process","status":"publish","type":"post","link":"https:\/\/atmokpo.com\/w\/36493\/","title":{"rendered":"Deep Learning PyTorch Course, Markov Decision Process"},"content":{"rendered":"<p><body><\/p>\n<p>Markov Decision Process (MDP) is an important mathematical framework that underlies reinforcement learning. MDP is a model used by agents to determine the optimal actions in a specific environment. In this post, we will delve into the concept of MDP and how to implement it using PyTorch.<\/p>\n<h2>1. Overview of Markov Decision Process (MDP)<\/h2>\n<p>MDP consists of the following components:<\/p>\n<ul>\n<li><strong>State space (S)<\/strong>: A set of all possible states the agent can be in.<\/li>\n<li><strong>Action space (A)<\/strong>: A set of all possible actions the agent can take in a specific state.<\/li>\n<li><strong>Transition probabilities (P)<\/strong>: Defines the probability of transitioning to the next state based on the current state and action.<\/li>\n<li><strong>Reward function (R)<\/strong>: The reward given when the agent takes a specific action in a specific state.<\/li>\n<li><strong>Discount factor (\u03b3)<\/strong>: A value that adjusts the impact of future rewards on the present value, assuming that future rewards are considered less than present rewards.<\/li>\n<\/ul>\n<h2>2. Mathematical Modeling of MDP<\/h2>\n<p>MDP is mathematically defined using the state space, action space, transition probabilities, reward function, and discount factor. MDP can be expressed as:<\/p>\n<ul>\n<li>MDP = (S, A, P, R, \u03b3).<\/li>\n<\/ul>\n<p>Now, let\u2019s explain each component in more detail:<\/p>\n<h3>State Space (S)<\/h3>\n<p>The state space is the set of all states the agent can be in. For example, in a game of Go, the state space could consist of all possible board configurations.<\/p>\n<h3>Action Space (A)<\/h3>\n<p>The action space includes all actions the agent can take based on its state. For instance, in a Go game, the agent can place a stone at a specific position.<\/p>\n<h3>Transition Probabilities (P)<\/h3>\n<p>Transition probabilities represent the likelihood of transitioning to the next state based on the current state and the chosen action. This is mathematically expressed as:<\/p>\n<pre><code>P(s', r | s, a)<\/code><\/pre>\n<p>Here, <code>s'<\/code> is the next state, <code>r<\/code> is the reward, <code>s<\/code> is the current state, and <code>a<\/code> is the chosen action.<\/p>\n<h3>Reward Function (R)<\/h3>\n<p>The reward function represents the reward given when the agent takes a specific action in a specific state. Rewards are a critical factor defining the agent&#8217;s goals.<\/p>\n<h3>Discount Factor (\u03b3)<\/h3>\n<p>The discount factor <code>\u03b3 (0 \u2264 \u03b3 &lt; 1)<\/code> reflects the impact of future rewards on the present value. The closer <code>\u03b3<\/code> is to 0, the more the agent focuses on immediate rewards, and the closer it is to 1, the more the agent focuses on long-term rewards.<\/p>\n<h2>3. Examples of MDP<\/h2>\n<p>Now that we understand the concept of MDP, let\u2019s explore how to apply it to reinforcement learning problems through examples. Next, we will create a trained reinforcement learning agent using a simple MDP example.<\/p>\n<h3>3.1 Simple Grid World Example<\/h3>\n<p>The grid world models a world composed of a 4&#215;4 grid. The agent is located in each grid cell and can move through specific actions (up, down, left, right). The agent&#8217;s goal is to reach the bottom right area (goal position).<\/p>\n<h4>Definition of States and Actions<\/h4>\n<p>In this grid world:<\/p>\n<ul>\n<li>State: Represented by numbers from 0 to 15 for each grid cell (4&#215;4 grid)<\/li>\n<li>Actions: Up (0), Down (1), Left (2), Right (3)<\/li>\n<\/ul>\n<h4>Definition of Rewards<\/h4>\n<p>The agent receives a reward of +1 for reaching the goal state and 0 for any other state.<\/p>\n<h2>4. Implementing MDP with PyTorch<\/h2>\n<p>Now let&#8217;s implement the reinforcement learning agent using PyTorch. We will primarily use the Q-learning algorithm.<\/p>\n<h3>4.1 Environment Initialization<\/h3>\n<p>First, let\u2019s define a class for creating the grid world:<\/p>\n<pre><code>import numpy as np\n\nclass GridWorld:\n    def __init__(self, grid_size=4):\n        self.grid_size = grid_size\n        self.state = 0\n        self.goal_state = grid_size * grid_size - 1\n        self.actions = [0, 1, 2, 3]  # Up, Down, Left, Right\n        self.rewards = np.zeros((grid_size * grid_size,))\n        self.rewards[self.goal_state] = 1  # Reward for reaching the goal\n\n    def reset(self):\n        self.state = 0  # Starting state\n        return self.state\n\n    def step(self, action):\n        x, y = divmod(self.state, self.grid_size)\n        if action == 0 and x > 0:   # Up\n            x -= 1\n        elif action == 1 and x < self.grid_size - 1:  # Down\n            x += 1\n        elif action == 2 and y > 0:  # Left\n            y -= 1\n        elif action == 3 and y < self.grid_size - 1:  # Right\n            y += 1\n        self.state = x * self.grid_size + y\n        return self.state, self.rewards[self.state]\n<\/code><\/pre>\n<h3>4.2 Implementing the Q-learning Algorithm<\/h3>\n<p>We will train the agent using Q-learning. Here is the code to implement the Q-learning algorithm:<\/p>\n<pre><code>import torch\nimport torch.nn as nn\nimport torch.optim as optim\n\nclass QNetwork(nn.Module):\n    def __init__(self, state_size, action_size):\n        super(QNetwork, self).__init__()\n        self.fc1 = nn.Linear(state_size, 24)\n        self.fc2 = nn.Linear(24, 24)\n        self.fc3 = nn.Linear(24, action_size)\n\n    def forward(self, x):\n        x = torch.relu(self.fc1(x))\n        x = torch.relu(self.fc2(x))\n        return self.fc3(x)\n\ndef train_agent(episodes, max_steps):\n    env = GridWorld()\n    state_size = env.grid_size * env.grid_size\n    action_size = len(env.actions)\n    \n    q_network = QNetwork(state_size, action_size)\n    optimizer = optim.Adam(q_network.parameters(), lr=0.001)\n    criterion = nn.MSELoss()\n\n    for episode in range(episodes):\n        state = env.reset()\n        total_reward = 0\n        for step in range(max_steps):\n            state_tensor = torch.eye(state_size)[state]\n            q_values = q_network(state_tensor)\n            \n            action = np.argmax(q_values.detach().numpy())  # epsilon-greedy policy\n            next_state, reward = env.step(action)\n            total_reward += reward\n            \n            next_state_tensor = torch.eye(state_size)[next_state]\n            target = reward + 0.99 * torch.max(q_network(next_state_tensor)).detach()\n            loss = criterion(q_values[action], target)\n            \n            optimizer.zero_grad()\n            loss.backward()\n            optimizer.step()\n\n            if next_state == env.goal_state:\n                break\n            \n            state = next_state\n        print(f\"Episode {episode+1}: Total Reward = {total_reward}\")\n<\/code><\/pre>\n<h2>5. Conclusion<\/h2>\n<p>In this post, we explored the concept of Markov Decision Process (MDP) and how to implement it using PyTorch. MDP is a critical framework foundational to reinforcement learning, and it is essential to understand this concept to solve real reinforcement learning problems. I hope you gain deeper insights into MDP and reinforcement learning through practice.<\/p>\n<p>Additionally, I encourage you to explore more complex MDP problems and learning algorithms. Using tools like PyTorch, try implementing various environments, training agents, and building your own reinforcement learning models.<\/p>\n<footer>\n<p>I hope this post was helpful. If you have any questions, please leave a comment!<\/p>\n<\/footer>\n<p><\/body><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Markov Decision Process (MDP) is an important mathematical framework that underlies reinforcement learning. MDP is a model used by agents to determine the optimal actions in a specific environment. In this post, we will delve into the concept of MDP and how to implement it using PyTorch. 1. Overview of Markov Decision Process (MDP) MDP &hellip; <a href=\"https:\/\/atmokpo.com\/w\/36493\/\" class=\"more-link\">\ub354 \ubcf4\uae30<span class=\"screen-reader-text\"> &#8220;Deep Learning PyTorch Course, Markov Decision Process&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[149],"tags":[],"class_list":["post-36493","post","type-post","status-publish","format-standard","hentry","category-pytorch-study"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.2 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Deep Learning PyTorch Course, Markov Decision Process - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/atmokpo.com\/w\/36493\/\" \/>\n<meta property=\"og:locale\" content=\"ko_KR\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Deep Learning PyTorch Course, Markov Decision Process - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"og:description\" content=\"Markov Decision Process (MDP) is an important mathematical framework that underlies reinforcement learning. MDP is a model used by agents to determine the optimal actions in a specific environment. In this post, we will delve into the concept of MDP and how to implement it using PyTorch. 1. Overview of Markov Decision Process (MDP) MDP &hellip; \ub354 \ubcf4\uae30 &quot;Deep Learning PyTorch Course, Markov Decision Process&quot;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/atmokpo.com\/w\/36493\/\" \/>\n<meta property=\"og:site_name\" content=\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\" \/>\n<meta property=\"article:published_time\" content=\"2024-11-01T09:48:55+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-11-01T11:52:57+00:00\" \/>\n<meta name=\"author\" content=\"root\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:site\" content=\"@bebubo4\" \/>\n<meta name=\"twitter:label1\" content=\"\uae00\uc4f4\uc774\" \/>\n\t<meta name=\"twitter:data1\" content=\"root\" \/>\n\t<meta name=\"twitter:label2\" content=\"\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04\" \/>\n\t<meta name=\"twitter:data2\" content=\"5\ubd84\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/atmokpo.com\/w\/36493\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36493\/\"},\"author\":{\"name\":\"root\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\"},\"headline\":\"Deep Learning PyTorch Course, Markov Decision Process\",\"datePublished\":\"2024-11-01T09:48:55+00:00\",\"dateModified\":\"2024-11-01T11:52:57+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36493\/\"},\"wordCount\":666,\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"articleSection\":[\"PyTorch Study\"],\"inLanguage\":\"ko-KR\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/atmokpo.com\/w\/36493\/\",\"url\":\"https:\/\/atmokpo.com\/w\/36493\/\",\"name\":\"Deep Learning PyTorch Course, Markov Decision Process - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"isPartOf\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#website\"},\"datePublished\":\"2024-11-01T09:48:55+00:00\",\"dateModified\":\"2024-11-01T11:52:57+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/atmokpo.com\/w\/36493\/#breadcrumb\"},\"inLanguage\":\"ko-KR\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/atmokpo.com\/w\/36493\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/atmokpo.com\/w\/36493\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\ud648\",\"item\":\"https:\/\/atmokpo.com\/w\/en\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Deep Learning PyTorch Course, Markov Decision Process\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/atmokpo.com\/w\/#website\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/atmokpo.com\/w\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"ko-KR\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/atmokpo.com\/w\/#organization\",\"name\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\",\"url\":\"https:\/\/atmokpo.com\/w\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"contentUrl\":\"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png\",\"width\":400,\"height\":400,\"caption\":\"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8\"},\"image\":{\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/bebubo4\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7\",\"name\":\"root\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"ko-KR\",\"@id\":\"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g\",\"caption\":\"root\"},\"sameAs\":[\"http:\/\/atmokpo.com\/w\"],\"url\":\"https:\/\/atmokpo.com\/w\/author\/root\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Deep Learning PyTorch Course, Markov Decision Process - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/atmokpo.com\/w\/36493\/","og_locale":"ko_KR","og_type":"article","og_title":"Deep Learning PyTorch Course, Markov Decision Process - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","og_description":"Markov Decision Process (MDP) is an important mathematical framework that underlies reinforcement learning. MDP is a model used by agents to determine the optimal actions in a specific environment. In this post, we will delve into the concept of MDP and how to implement it using PyTorch. 1. Overview of Markov Decision Process (MDP) MDP &hellip; \ub354 \ubcf4\uae30 \"Deep Learning PyTorch Course, Markov Decision Process\"","og_url":"https:\/\/atmokpo.com\/w\/36493\/","og_site_name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","article_published_time":"2024-11-01T09:48:55+00:00","article_modified_time":"2024-11-01T11:52:57+00:00","author":"root","twitter_card":"summary_large_image","twitter_creator":"@bebubo4","twitter_site":"@bebubo4","twitter_misc":{"\uae00\uc4f4\uc774":"root","\uc608\uc0c1 \ub418\ub294 \ud310\ub3c5 \uc2dc\uac04":"5\ubd84"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/atmokpo.com\/w\/36493\/#article","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/36493\/"},"author":{"name":"root","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7"},"headline":"Deep Learning PyTorch Course, Markov Decision Process","datePublished":"2024-11-01T09:48:55+00:00","dateModified":"2024-11-01T11:52:57+00:00","mainEntityOfPage":{"@id":"https:\/\/atmokpo.com\/w\/36493\/"},"wordCount":666,"publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"articleSection":["PyTorch Study"],"inLanguage":"ko-KR"},{"@type":"WebPage","@id":"https:\/\/atmokpo.com\/w\/36493\/","url":"https:\/\/atmokpo.com\/w\/36493\/","name":"Deep Learning PyTorch Course, Markov Decision Process - \ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","isPartOf":{"@id":"https:\/\/atmokpo.com\/w\/#website"},"datePublished":"2024-11-01T09:48:55+00:00","dateModified":"2024-11-01T11:52:57+00:00","breadcrumb":{"@id":"https:\/\/atmokpo.com\/w\/36493\/#breadcrumb"},"inLanguage":"ko-KR","potentialAction":[{"@type":"ReadAction","target":["https:\/\/atmokpo.com\/w\/36493\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/atmokpo.com\/w\/36493\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\ud648","item":"https:\/\/atmokpo.com\/w\/en\/"},{"@type":"ListItem","position":2,"name":"Deep Learning PyTorch Course, Markov Decision Process"}]},{"@type":"WebSite","@id":"https:\/\/atmokpo.com\/w\/#website","url":"https:\/\/atmokpo.com\/w\/","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","description":"","publisher":{"@id":"https:\/\/atmokpo.com\/w\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/atmokpo.com\/w\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"ko-KR"},{"@type":"Organization","@id":"https:\/\/atmokpo.com\/w\/#organization","name":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8","url":"https:\/\/atmokpo.com\/w\/","logo":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/","url":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","contentUrl":"https:\/\/atmokpo.com\/w\/wp-content\/uploads\/2024\/11\/logo.png","width":400,"height":400,"caption":"\ub77c\uc774\ube0c\uc2a4\ub9c8\ud2b8"},"image":{"@id":"https:\/\/atmokpo.com\/w\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/bebubo4"]},{"@type":"Person","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/91b6b3b138fbba0efb4ae64b1abd81d7","name":"root","image":{"@type":"ImageObject","inLanguage":"ko-KR","@id":"https:\/\/atmokpo.com\/w\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/708197b41fc6435a7ce22d951b25d4a47e9e904270cb1f04682d4f025066f80c?s=96&d=mm&r=g","caption":"root"},"sameAs":["http:\/\/atmokpo.com\/w"],"url":"https:\/\/atmokpo.com\/w\/author\/root\/"}]}},"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36493","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/comments?post=36493"}],"version-history":[{"count":1,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36493\/revisions"}],"predecessor-version":[{"id":36494,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/posts\/36493\/revisions\/36494"}],"wp:attachment":[{"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/media?parent=36493"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/categories?post=36493"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/atmokpo.com\/w\/wp-json\/wp\/v2\/tags?post=36493"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}