RL[0] - 初见

结构

背景
Q-Learning with table
Q-Learning with network
后记

背景

RL是reinforcement learning的缩写, 属于机器学习的一个领域,严谨的定义如下:

Reinforcement learning (RL) is an area of [machine learning inspired by behaviourist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

我理解RL是一个最优解的寻找问题,通过不同的trick让计算机面对action play的场景下做出最有利的行动,比如玩游戏

Q-Learning with table

q-learning是RL算法中的一个分支, 从wiki中扒的定义如下:Q-learning is a model-free reinforcement learning technique. Specifically, Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process(MDP).
我最开始学习是从Playing Atari with Deep Reinforcement Learning和Simple Reinforcement Learning with Tensorflow开始的,论文主要是讲DQN的
论文中有对MDR&BellmanEquation的详细描述, 简单抽离一下:
我们的agent在每一个场景下可以做出一系列的action中的一个(A = {1, . . . , K}),因为这个action会获得相应reward, take 那个action有2个因素

后续位置的好坏
当前action带来的reward
reward可以是系统返回的,也可以是agent观测到的(就像人玩游戏一样,观察游戏画面),当前take 那个action不禁需要考虑那个action收获最大,还要考虑这个action到了那个状态,因为后续状态决定了后续的reward, 所以我们应该选择action=a 满足

Q(s,a) = r + γ(max(Q(s’,a’))

其中γ是一个系数,标识当下比未来的权重,s是当前状态s'是跳转之后的状态,这个选择就是BellmanEquation
整个环境,action, reward数学模型和框架就是MDR(马尔科夫决策过程)

Screen Shot 2017-12-02 at 8.05.15 PM.png

假如当前这一步action的reward系统可以返回,只要我们之后后续步骤的最优解就可以每一个都按照bellman equation来走了, 所以问题就转化成了如何求解每个状态的最优解,然后把他们存起来,agent执行的时候查表即可
以下我们以FrozenLake为例子看下Q value的table是如何计算的

这个系列的例子都来自Simple Reinforcement Learning with Tensorflow

使用 OpenAI gym我们很容易可以模拟很多toy game

The FrozenLake environment consists of a 4x4 grid of blocks, each one either being the start block, the goal block, a safe frozen block, or a dangerous hole. The objective is to have an agent learn to navigate from the start to the goal without moving onto a hole. At any given time the agent can choose to move either up, down, left, or right.

每个状态有4个action,一共16个状态,所以table是16X4, mdr的decision making是partly random and partly under the control of a decision maker, 有人叫这个是ξ-greedy方式, 我refactor作者原来的变量命名方式,
代码如下(#开头是原作者的Comments, '# #'或者"""包围是我加的comments)

# coding=utf-8
import numpy as np
import gym

env = gym.make('FrozenLake-v0')

q_table = np.zeros([env.observation_space.n, env.action_space.n])
# Set learning parameters
lr = .8
y = .95
num_episodes = 2000
rewards = []
for i in range(num_episodes):
    # Reset environment and get first new observation
    s = env.reset()
    reward_episode = 0
    game_over = False
    j = 0
    # The Q-Table learning algorithm
    while j < 99:
        j += 1
        # Choose an action by greedily (with noise) picking from Q table
        """
        randn return a sequence of numbers from the "standard normal" distribution, with when i becoming larger and
        larger, random take smaller and smaller impact of decision making, @very begging it just random choice 
        """
        action_to_be_taken = np.argmax(q_table[s, :] + np.random.randn(1, env.action_space.n) * (1. / (i + 1)))
        # Get new state and reward from environment
        new_statue, reward, game_over, _ = env.step(action_to_be_taken)
        # Update Q-Table with new knowledge
        """
        according to bell-equation Q(s,a) = r + γ(max(Q(s’,a’)) 
        q_table[s, action_to_be_taken] = r + γ*max(q_table[new_state,:]), after each iteration max(q_table[new_state,:]) 
        may be changed,  hence need updated, let 's lr = γ, q_table[s, action_to_be_taken]  = q = r + γ*max_old, 
        q + γ*(r + y*max_new - q) = r + γ*max_old + γ*(r + y*max_new - r - max_old) = r + γ*y*max_new which exactly 
        equal to new Q(S, a) 's value
        """
        q_table[s, action_to_be_taken] = q_table[s, action_to_be_taken] + lr * (reward + y * np.max(q_table[new_statue, :]) - q_table[s, action_to_be_taken])
        reward_episode += reward
        s = new_statue
        if game_over:
            break
    rewards.append(reward_episode)

print "Score over time: " + str(sum(rewards) / num_episodes)
print "Final Q-Table Values"
print q_table

原来代码中episode是2000, 我尝试了10000,20000,30000的结果如下,可以看出后续增加episode,q-table的值趋于稳定

q-learning with model

table方式虽然高效,但是面对现实问题,table的size可能是非常恐怖的巨大,难以放入内存中,于是就有另一种思路,q_table的value的值不是每一个保存,给出当前状态s模拟计算出每个action对应Q-value, Q-value最大就是最有利的选择

在FrozenLake的例子中, 我们用一层1X16的网络来标识当前的状态, 输出是4个action的q-value,所以网络结构是16X4. 我们用tensorflow来训练矩阵,
其中loss函数

loss = ∑(Q-target - Q)²

代码如下:

# coding=utf-8
import matplotlib.pyplot as plt
import numpy as np
import gym
import tensorflow as tf

env = gym.make('FrozenLake-v0')

tf.reset_default_graph()


# These lines establish the feed-forward part of the network used to choose actions
input_state = tf.placeholder(shape=[1, 16], dtype=tf.float32)
xavier_init = tf.contrib.layers.xavier_initializer()
W = tf.Variable(xavier_init([16, 4]))
q_out = tf.matmul(input_state, W)
predict = tf.argmax(q_out, 1)[0]

# Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
target_q = tf.placeholder(shape=[1, 4], dtype=tf.float32)
loss = tf.reduce_sum(tf.square(target_q - q_out))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
update_model = trainer.minimize(loss)

init = tf.global_variables_initializer()

# Set learning parameters
y = .99
e = 0.1
num_episodes = 2000
# create lists to contain total rewards and steps per episode
rewards = []
counts = []
with tf.Session() as sess:
    sess.run(init)
    for i in range(num_episodes):
        # Reset environment and get first new observation
        s = env.reset()
        reward_episode = 0
        d = False
        j = 0
        # The Q-Network
        # # in case of dead loop within the game, need up limit jump out
        while j < 99:
            j += 1
            # Choose an action by greedily (with e chance of random action) from the Q-network
            # # 1*16 dimension array, a[i] = 1 if i == s otherwise 0
            a, q_out_value = sess.run([predict, q_out], feed_dict={input_state: np.identity(16)[s:s + 1]})
            # # ξ-greedy selection
            if np.random.rand(1) < e:
                a = env.action_space.sample()
            # Get new state and reward from environment
            new_state, r, d, _ = env.step(a)
            # Obtain the Q' values by feeding the new state through our network
            new_q = sess.run(q_out, feed_dict={input_state: np.identity(16)[new_state:new_state + 1]})
            # Obtain maxQ' and set our target value for chosen action.
            new_max_q = np.max(new_q)
            target_value = q_out_value
            # #
            target_value[0, a] = r + y * new_max_q
            # Train our network using target and predicted Q values
            _, W1 = sess.run([update_model, W], feed_dict={input_state: np.identity(16)[s:s + 1], target_q: target_value})
            reward_episode += r
            s = new_state
            if d:
                # Reduce chance of random action as we train the model.
                e = 1. / ((i / 50) + 10)
                break
        rewards.append(reward_episode)
        counts.append(j)
print W1
print "Percent of succesful episodes: " + str(sum(rewards) / num_episodes) + "%"
plt.plot(rewards)
plt.plot(counts)

这个例子网络结构太简单了,用cpu就可以跑, 750个episode就可以达到成绩,盗用个图plot的图

image.png

后记

初见RL, 后续会有更加有意思的,比如人玩游戏的看到的图像然后反应,理应卷积神经网络抽取图像特征来处理而不是类似one_hot数组输入,agent缓存过往训练片段随机抽取batch,大大将强训练效果(experience replay),再比如用两个network来训练Double DQN 和同一个网络中抽离a和v Dueling DQN

感谢medium的作者辛苦讲解和deepmind的论文无私付出

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 194,319评论 5赞 459
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 81,801评论 2赞 371
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 141,567评论 0赞 319
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 52,156评论 1赞 263
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 61,019评论 4赞 355
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 46,090评论 1赞 272
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 36,500评论 3赞 381
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 35,192评论 0赞 253
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 39,474评论 1赞 290
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 34,566评论 2赞 309
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 36,338评论 1赞 326
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 32,212评论 3赞 312
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 37,572评论 3赞 298
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 28,890评论 0赞 17
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 30,169评论 1赞 250
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 41,478评论 2赞 341
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 40,661评论 2赞 335

RL[0] - 初见

结构

背景

Q-Learning with table

q-learning with model

后记

推荐阅读更多精彩内容