A Simple Intro to Q-Learning in R: Floor Plan Navigation

The question to be answered here is: What’s the best way to get from Room 2 to Room 5 (outside)? Notice that by answering this question using reinforcement learning, we will also know how to find optimal routes from any room to outside. And if we run the iterative algorithm again for a new target state, we can find out the optimal route from any room to that new target state.

Since Q-Learning is model-free, we don’t need to know how likely it is that our agent will move between any room and any other room (the transition probabilities). If you had observed the behavior in this system over time, you might be able to find that information, but it many cases it just isn’t available. So the key for this problem is to construct a Rewards Matrix that explains the benefit (or penalty!) of attempting to go from one state (room) to another.

Assigning the rewards is somewhat arbitrary, but you should give a large positive value to your target state and negative values to states that are impossible or highly undesirable. Here’s the guideline we’ll use for this problem:

  • -1 if “you can’t get there from here”
  • 0 if the destination is not the target state
  • 100 if the destination is the target state

We’ll start constructing our rewards matrix by listing the states we’ll come FROM down the rows , and the states we’ll go TO in the columns . First, let’s fill the diagonal with -1 rewards, because we don’t want our agent to stay in the same place (that is, move from Room 1 to Room 1, or from Room 2 to Room 2, and so forth).

Next, let’s move across the first row. Starting in Room 0, we only have one choice: go to Room 4. All other possibilities are blocked (-1):

Now let’s fill in the row labeled 1. From Room 1, you have two choices: go to Room 3 (which is not great but permissible, so give it a 0) or go to Room 5 (the target, worth 100)!

Continue moving row by row, determining if you can’t get there from here (-1), you can but it’s not the target (0), or it’s the target(100). You’ll end up with a final rewards matrix that looks like this:

Now create this rewards matrix in R:

R <- matrix(c(-1, -1, -1, -1, 0, 1,
       -1, -1, -1, 0, -1, 0,
       -1, -1, -1, 0, -1, -1, 
       -1, 0, 0, -1, 0, -1,
        0, -1, -1, 0, -1, 0,
       -1, 100, -1, -1, 100, 100), nrow=6, ncol=6, byrow=TRUE)

And run the code. Notice that we’re calling the target state 6 instead of 5 because even though we have a room labeled with a zero, our matrix starts with a 1s so we have to adjust:

source("https://raw.githubusercontent.com/NicoleRadziwill/R-Functions/master/qlearn.R")

results  round(results)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    0    0    0    0   64   80
[2,]    0    0    0   51    0   80
[3,]    0    0    0   51    0    0
[4,]    0   64   41    0   64    0
[5,]   64    0    0   51    0   80
[6,]    0   84    0    0   84  100

You can read this table of average value to obtain policies. A policy is a “path” through the states of the system:

  • Start at Room 0 (first row, labeled 1): Choose Room 4 (80), then from Room 4 choose Room 5 (100)
  • Start at Room 1: Choose Room 5 (100)
  • Start at Room 2: Choose Room 3 (64), from Room 3 choose Room 1 or Room 4 (80); from 1 or 4 choose 5 (100)
  • Start at Room 3: Choose Room 1 or Room 4 (80), then Room 5 (100)
  • Start at Room 4: Choose Room 5 (100)
  • Start at Room 5: Stay at Room 5 (100)

To answer the original problem, we would take route 2-3-1-5 or 2-3-4-5 to get out the quickest if we started in Room 2. This is easy to see with a simple map, but is much more complicated when the maps get bigger.

R-bloggers责编内容来自:R-bloggers (源链) | 更多关于

阅读提示:酷辣虫无法对本内容的真实性提供任何保证,请自行验证并承担相关的风险与后果!
本站遵循[CC BY-NC-SA 4.0]。如您有版权、意见投诉等问题,请通过eMail联系我们处理。
酷辣虫 » 综合技术 » A Simple Intro to Q-Learning in R: Floor Plan Navigation

喜欢 (0)or分享给?

专业 x 专注 x 聚合 x 分享 CC BY-NC-SA 4.0

使用声明 | 英豪名录