1.Policy Gradient
1.1.Excepted reward
: reward
: actor(parameter)
:trajectory
which:
1.2.Maximize Expected Reward
optimize:
1.3.Tips
1.3.1.Add a Baseline (reduce variance)
1.3.2.Assign Suitable Credit
1.3.3.Add Discount Factor
2.Proximal Policy Optimization
2.1.Advantage Function
2.2. On-policy Off-policy
Importance Sampling
Off-policy means sample data from, and use the data to train
Objective function:
2.3.Trust Region Policy Optimization (TRPO)
Constrained optimization:
2.4.PPO
Unconstrained optimization: