Chapter 4: Dynamic Programming
Dynamic programming computes optimal policies given a perfect model of the environment as an MDP.
Policy Iteration
-
Policy evaluation: compute the value function of a given policy in an iterative way as The existence and uniqueness of are guaranteed as long as either or eventual termination is guaranteed from all states under the policy . And the sequence converges to as under the same condition.
This kind of value updates are called expected updates because they are based on an expectation over all possible next states rather than a sample next state.
The value updates can both be computed in-place, which can utilizes the new state values as soon as possible. Convergence is also guaranteed for in-place update, and the order in which states have their values updated during the sweep has a significant influence on the rate of converge. - Policy improvement: improve the policy by making it greedy w.r.t. the value function of the original policy. The policy improvement theorem guarantees that this must give us a strictly better policy except when the original policy is already optimal.
Policy iteration works in an interaction loop of policy evaluation and policy improvement.
More generally, generalized policy iteration (GPI) focuses on the general idea of interaction between policy evaluation and policy improvement, which compete and cooperate with each other to reach the optimum.
Variants with Approximations for Acceleration
-
Value iteration: update the value function as which can be seen as a truncated version of policy evaluation with .
It can also be seen as directly solving the nonlinear Bellman optimality equation with iterative method. -
Asynchronous DP: only update the value of some states instead of the whole state set.
Convergence is guanranteed only if it continues to update the values of all the states, so it does not guarantee less computation.
But it provides more flexiblity and good ways to incorporate domain knowledge. First, we can try to order the updates to let value information propogate from state to state in an efficient way. Second, we might try to skip updating some states entirely if they are not relevant to optimal behavior. Third, we can use real-time interaction trajectory to specify the state update choice, which helps the DP algorithm to focus on the state subset most relevant to the agent.
Efficiency of DP
Given an MDP with states and actions, direct search requires steps to exhaustively examine all possible policies. A DP method is guaranteed to find an optimal policy in polynomial time.
I think the efficiency improvement is mainly brought by utlizing the system dynamics, which helps model the structure of the state set.
DP can be seen as a model-based method. So even if we choose a deterministic policy, we can still "imagine" how choosing other actions will behave without explicit exploration.
Approximations in DP
- Policy evaluation may have noise, and is dependent on the value estimation of successor states (bootstrapping)
- Even if the policy evaluation is accurate, the policy is improved based on the value function of the original policy