PS: 插播一个 RL 信息
(You’ll see in papers that the RL process is called the Markov Decision Process (MDP).)
对比 Monte Carlo 和 Temporal Difference Learning
-
Monte Carlo 学习方法,是在最后完成一轮之后,更新返回目标值Gt。
对应学习公式
-
Temporal Difference Learning: learning at each step
TD target 就是这个 temporal difference 的学习目标,
对应学习公式:
on-policy 和 off-policy
... ...
value-based 和 policy-based
“Value-based method”: it means that it finds its optimal policy indirectly by training a value-function or action-value function that will tell us what’s the value of each state or each state-action pair.
policy-based
TD approach
“Uses a TD approach”: updates its action-value function at each step.
state-value function 和 action-value function
... ...
Q-learning
Q-Learning is an off-policy value-based method that uses a TD approach to train its action-value function:
以上各种名词解释都可以在文章链接中找到,只在这里做个记录
更新 Q 表的 Bellman Equation公式为:
其实, 在设计环境实验的过程中, 奖励和惩罚以及结束的条件,都需要进行模拟的赋值,规定好这些特殊地点的数值,所以这些参数可以看成是已知的,我们要做的就是将最后的 Q-table 中的数值,在相应的策略下,通过不同的状态和动作,尽可能的将最终的 累计奖励(cumulative reward)最大化,就可以了。