Reinforcement Learning: Terminology and Examples

The book Artifial Intelligence: A Modern Approach is an excellent reference for basics. Here, I want to note down concise definitions of four bins RL algorithms are usually classified into.

The on-policy vs. off-policy distinction stems from the fact that the data used to train RL is potentially influenced by the RL policy itself. This is an important difference between RL and supervised learning.

On-Policy

Data for every SGD minibatch is collected using the current policy.

Data collection is tightly embedded in the SGD loop
Needs a fast simulator, which is likely to be low-quality
Does not suffer from covariate shift
Can visit dangerous states during training
Suitable for problems with large action spaces because covering all possible (s, a) pairs for off-policy learning would be intractable
Training is inefficient because of noisy policy gradients and non-iid training samples. They are addressed by reducing the variance in policy gradients through advantage functions and “trust region” (TRPO, PPO) – discouraging the SGD step from making large changes to the policy

Examples

“Vanilla” policy gradient
TD-Gammon
SARSA
PPO

Off-Policy

Needs large memory for the “replay buffer”
(s, a, s’) data collection happens offline according to a “behavior model” which involves a mix of greedy and random policies
Offline simulator can be slow, and therefore high-quality
Suitable for problems with small action spaces
Suffers from compounding “covariate shift” between distribution of training data and the distribution encountered during deployment. Robot not able to recover from a bad state
Training can be data efficient because random sampling the replay buffer can reduce the non-iidness of training samples

Examples

Q learning
DQN

Model-based

Uses the state transition function P(s’ | s,a) as a part of training. It might even learn that as a part of training.

Examples

Model-free

Does not use P(s’ | s,a), instead uses large number of samples from the simulator. Algorithms with time-differencing training objectives (value iteration, DQN) are model-free. Model-free seems to be more popular than model-based currently (I don’t know why).

Examples

Additional Links

Stable Baselines: Implementations of famous model-free RL algorithms
awesome-rl-envs: List of simulators suitable for model-free RL training
A good blog post explaining RL algorithms taxonomy

(page source)