Chapter 7

Chapter 7: n-step Bootstrapping

n-step TD methods span a spectrum with MC methods at one end and one-step TD methods at the other.

n-step TD Prediction

The target estimation of value functions in n-step TD is a combination of the first n steps' sample rewards and bootstrapping the value estimation of the sampled state after n steps, i.e., $G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}).$ Correspondingly, the update rule is $V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)].$ Note that only the value of $S_t$ changes at step $t$ , while the values of all the other states remain unchanged, i.e., $V_{t+n}(s) = V_{t+n-1}(s)$ for $\forall s \neq S_t$ . An issue here is that the value estimation of $V_{t+n-1}(s)$ may be updated long ago and does not reflect the true value of $s$ under the current policy $\pi_{t+n-1}$ in expectation. This is not covered in the RL book and I'm not sure if this will cause any problem in RL.
One-step TD (as introduced in the last chapter) can be seen as a special case of n-step TD when $n=1$ , i.e., $G_{t:t+1} = R_{t+1} + \gamma V_t(S_{t+1}).$ MC can be seen as the extreme of n-step TD in the opposite direction when $n$ equals to the episode length, i.e., $G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1} R_T$ .

n-step Sarsa

n-step Sarsa is a natural generalization of 1-step Sarsa with the target estimation as $G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^{n} Q_{t+n-1}(S_{t+n}, A_{t+n}),$ and the corresponding update rule is $Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)].$ The RL book gives a gridworld example as shown below to illustrate the advantage of n-step Sarsa compared to one-step Sarsa. When the reward is sparse, n-step Sarsa can help speed up the reward propagation the earlier states.

n-step Sarsa example.

n-step Off-policy Control

For n-step off-policy control, we need to take importance sampling into consideration, which leads to the update rule as follows: $Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha \rho_{t+1:t+n} [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)],$ where $\rho_{t:h} = \prod_{k=t}^{min(h,T-1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$ . Note that the ratio starts from step $t+1$ because we do not have to care how likely we were to select the action $A_t$ ; now that we have selected it we want to learn fully from what happens, with importance sampling only for subsequent actions. This also explains why one-step Q-learning do not have the ratio term, as $\rho_{t+1:t+n} = 1$ for $n=1$ .

Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

Importance sampling is required in Q-learning because it is a sample update method. So we need to multiply with the importance sampling ratio to make the update's expectation unbiased w.r.t. the target policy. Thus a natural way to avoid importance sampling is to perform expected update w.r.t. the target policy $\pi$ , i.e., the n-step tree backup algorithm.
In its simplest case with $n=1$ , tree backup is exactly expected Sarsa, i.e., $G_{t:t+1} = R_{t+1} + \gamma \sum_a \pi(a|S_{t+1}) Q_t(S_{t+1}, a).$
For $a = A_{t+1}$ , we can further expand the corresponding $Q_t(S_{t+1}, a)$ term in the equation above to get a two-step target. Recursively, we can get the tree backup target as follows: $G_{t:t+n} = R_{t+1} + \gamma \sum_{a \neq A_{t+1}} \pi(a|S_{t+1}) Q_{t+n-1}(S_{t+1},a) + \gamma \pi(A_{t+1}|S_{t+1}) G_{t+1:t+n}.$ And the update rule without importance sampling is $Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)].$