The book Artifial Intelligence: A Modern Approach is an excellent reference for basics. Here, I want to note down concise definitions of four bins RL algorithms are usually classified into.

The on-policy vs. off-policy distinction stems from the fact that the data used to train RL is potentially influenced by the RL policy itself. This is an important difference between RL and supervised learning.

On-Policy

Data for every SGD minibatch is collected using the current policy.

  • Data collection is tightly embedded in the SGD loop
  • Needs a fast simulator, which is likely to be low-quality
  • Does not suffer from covariate shift
  • Can visit dangerous states during training
  • Suitable for problems with large action spaces because covering all possible (s, a) pairs for off-policy learning would be intractable
  • Training is inefficient because of noisy policy gradients and non-iid training samples. They are addressed by reducing the variance in policy gradients through advantage functions and “trust region” (TRPO, PPO) – discouraging the SGD step from making large changes to the policy

Examples

  • “Vanilla” policy gradient
  • TD-Gammon
  • SARSA
  • PPO

Off-Policy

  • Needs large memory for the “replay buffer”
  • (s, a, s’) data collection happens offline according to a “behavior model” which involves a mix of greedy and random policies
  • Offline simulator can be slow, and therefore high-quality
  • Suitable for problems with small action spaces
  • Suffers from compounding “covariate shift” between distribution of training data and the distribution encountered during deployment. Robot not able to recover from a bad state
  • Training can be data efficient because random sampling the replay buffer can reduce the non-iidness of training samples

Examples

  • Q learning
  • DQN

Model-based

Uses the state transition function P(s’ | s,a) as a part of training. It might even learn that as a part of training.

Examples

Model-free

Does not use P(s’ | s,a), instead uses large number of samples from the simulator. Algorithms with time-differencing training objectives (value iteration, DQN) are model-free. Model-free seems to be more popular than model-based currently (I don’t know why).

Examples

(page source)