Reinforcement Learning: Terminology and Examples
The book Artifial Intelligence: A Modern Approach is an excellent reference for basics. Here, I want to note down concise definitions of four bins RL algorithms are usually classified into.
The on-policy vs. off-policy distinction stems from the fact that the data used to train RL is potentially influenced by the RL policy itself. This is an important difference between RL and supervised learning.
On-Policy
Data for every SGD minibatch is collected using the current policy.
- Data collection is tightly embedded in the SGD loop
- Needs a fast simulator, which is likely to be low-quality
- Does not suffer from covariate shift
- Can visit dangerous states during training
- Suitable for problems with large action spaces because covering all possible (s, a) pairs for off-policy learning would be intractable
- Training is inefficient because of noisy policy gradients and non-iid training samples. They are addressed by reducing the variance in policy gradients through advantage functions and “trust region” (TRPO, PPO) – discouraging the SGD step from making large changes to the policy
Examples
- “Vanilla” policy gradient
- TD-Gammon
- SARSA
- PPO
Off-Policy
- Needs large memory for the “replay buffer”
- (s, a, s’) data collection happens offline according to a “behavior model” which involves a mix of greedy and random policies
- Offline simulator can be slow, and therefore high-quality
- Suitable for problems with small action spaces
- Suffers from compounding “covariate shift” between distribution of training data and the distribution encountered during deployment. Robot not able to recover from a bad state
- Training can be data efficient because random sampling the replay buffer can reduce the non-iidness of training samples
Examples
- Q learning
- DQN
Model-based
Uses the state transition function P(s’ | s,a) as a part of training. It might even learn that as a part of training.
Examples
Model-free
Does not use P(s’ | s,a), instead uses large number of samples from the simulator. Algorithms with time-differencing training objectives (value iteration, DQN) are model-free. Model-free seems to be more popular than model-based currently (I don’t know why).
Examples
Additional Links
- Stable Baselines: Implementations of famous model-free RL algorithms
- awesome-rl-envs: List of simulators suitable for model-free RL training
- A good blog post explaining RL algorithms taxonomy