RL 101 – Tic Tac Toe(井字棋游戏)

井字棋游戏算是五子棋的简化版:

两个玩家,一个打圈(◯),一个打叉(✗),轮流在3乘3的格上打自己的符号,最先以横、直、斜连成一线则为胜。如果双方都下得正确无误,将得和局。

井字棋是强化学习一个典型例子,可被归类为 Two players zero-sum game,RL 表格型求解方法实现 reinforcement-learning-an-introduction/tic_tac_toe.py

def train(epochs, print_every_n=500):
player1 = Player(epsilon=0.01)
player2 = Player(epsilon=0.01)
judger = Judger(player1, player2)
player1_win = 0.0
player2_win = 0.0
for i in range(1, epochs + 1):
# 完成一轮游戏
winner = judger.play(print_state=False)
if winner == 1:
player1_win += 1
if winner == -1:
player2_win += 1
if i % print_every_n == 0:
print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))
# 一轮游戏结束,便可以获得 reward,并迭代更新所有 state 的价值
player1.backup()
player2.backup()
# 重新初始化
judger.reset()
# 保存 value(state),后面被用来选择 action
player1.save_policy()
player2.save_policy()
# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():
while True:
player1 = HumanPlayer()
player2 = Player(epsilon=0)
judger = Judger(player1, player2)
player2.load_policy()
# 对弈
winner = judger.play()
if winner == player2.symbol:
print("You lose!")
elif winner == player1.symbol:
print("You win!")
else:
print("It is a tie!")

同样的可以使用 openai/gym
hill-a/stable-baselines
对井字棋进行抽象,但需根据其有两个 player 适当修改 reward 等。

除表格型方法之外可以引入神经网络来估计状态价值函数,输入可以是棋盘格的状态,例如可以使用 deepmind/open_spiel
AlphaZero
来玩井字棋:

# https://github.com/deepmind/open_spiel/blob/master/docs/alpha_zero.md
# 安装 https://github.com/deepmind/open_spiel/blob/master/docs/install.md
# 训练模型 & 对弈
$ az_path=exp/tic_tac_toe_alpha_zero
$ python3 open_spiel/python/examples/tic_tac_toe_alpha_zero.py --path ${az_path}
$ python3 open_spiel/python/examples/mcts.py --game=tic_tac_toe --player1=human --player2=az --az_path=${az_path}/checkpoint-25
2020-12-26 21:26:57.202819: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-26 21:26:57.221343: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fcfa55182a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-26 21:26:57.221356: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25
I1226 21:26:57.446192 4433972672 saver.py:1293] Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25
Initial state:
...
...
...
Choose an action (empty to print legal actions): 4
Player 0 sampled action: x(1,1)
Next state:
...
.x.
...
Player 1 sampled action: o(2,2)
Next state:
...
.x.
..o
Choose an action (empty to print legal actions): 7
Player 0 sampled action: x(2,1)
Next state:
...
.x.
.xo
Player 1 sampled action: o(0,1)
Next state:
.o.
.x.
.xo
Choose an action (empty to print legal actions): 3
Player 0 sampled action: x(1,0)
Next state:
.o.
xx.
.xo
Player 1 sampled action: o(1,2)
Next state:
.o.
xxo
.xo
Choose an action (empty to print legal actions): 2
Player 0 sampled action: x(0,2)
Next state:
.ox
xxo
.xo
Player 1 sampled action: o(2,0)
Next state:
.ox
xxo
oxo
Choose an action (empty to print legal actions): 0
Player 0 sampled action: x(0,0)
Next state:
xox
xxo
oxo
Returns: 0.0 0.0 , Game actions: x(1,1) o(2,2) x(2,1) o(0,1) x(1,0) o(1,2) x(0,2) o(2,0) x(0,0)
Number of games played: 1
Number of distinct games played: 1
Players: human az
Overall wins [0, 0]
Overall returns [0.0, 0.0]

除此之外 deepmind/open_spiel
还提供了 DQN
和表格型方法的对弈学习

# DQN agent vs Tabular Q-Learning agents trained on Tic Tac Toe.
$ python3 open_spiel/python/examples/tic_tac_toe_dqn_vs_tabular.py

AlphaZero 同样适用于除 GO 之外的 two players games。

封面取自 Welcome to Spinning Up in Deep RL!

数字编程
我还没有学会写个人说明!
上一篇

没想到接近年底,竟然等来了真正的年度最佳

下一篇

2020年最火的6种编程语言推荐

你也可能喜欢

评论已经被关闭。

插入图片