Chapter 7: n-step Bootstrapping
n-step TD methods span a spectrum with MC methods at one end and one-step TD methods at the other.
n-step TD Prediction
The target estimation of value functions in n-step TD is a combination of the first n steps' sample rewards and bootstrapping the value estimation of the sampled state after n steps, i.e., Correspondingly, the update rule is Note that only the value of changes at step , while the values of all the other states remain unchanged, i.e., for . An issue here is that the value estimation of may be updated long ago and does not reflect the true value of under the current policy in expectation. This is not covered in the RL book and I'm not sure if this will cause any problem in RL.
One-step TD (as introduced in the last chapter) can be seen as a special case of n-step TD when , i.e., MC can be seen as the extreme of n-step TD in the opposite direction when equals to the episode length, i.e., .
n-step Sarsa
n-step Sarsa is a natural generalization of 1-step Sarsa with the target estimation as and the corresponding update rule is The RL book gives a gridworld example as shown below to illustrate the advantage of n-step Sarsa compared to one-step Sarsa. When the reward is sparse, n-step Sarsa can help speed up the reward propagation the earlier states.
n-step Off-policy Control
For n-step off-policy control, we need to take importance sampling into consideration, which leads to the update rule as follows: where . Note that the ratio starts from step because we do not have to care how likely we were to select the action ; now that we have selected it we want to learn fully from what happens, with importance sampling only for subsequent actions. This also explains why one-step Q-learning do not have the ratio term, as for .
Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm
Importance sampling is required in Q-learning because it is a sample update method. So we need to multiply with the importance sampling ratio to make the update's expectation unbiased w.r.t. the target policy. Thus a natural way to avoid importance sampling is to perform expected update w.r.t. the target policy , i.e., the n-step tree backup algorithm.
In its simplest case with , tree backup is exactly expected Sarsa, i.e.,
For , we can further expand the corresponding term in the equation above to get a two-step target. Recursively, we can get the tree backup target as follows: And the update rule without importance sampling is