Reinforcement Learning (RL) is one of the most promising subfields of AI, with applications as diverse as self-driving cars and stock trading. A well-known weakness of the RL approach is that researchers have to define a reward function corresponding to an agent’s goal. For complex goals, this can be hard and misspecified rewards may not only result in bad performance but also unsafe behaviour. Hence, various organisations from Google’s DeepMind over OpenAI and Stanford’s CHAI have aimed to make the reward function part of the learning process as opposed to a hyperparameter that is specified before training. However, just because a goal is learned does not mean that it is aligned with human intentions.

