This is the first post in a series on rewards and values.
The reward that would be most familiar is probably food. We often use treats to train animals, and eating is pleasurable for most people. These rewards are clearly an external thing, aren’t they? This idea is, in some ways, echoed in machine reinforcement learning, as shown in a diagram (pictured below) from the introductory book by Richard Sutton and Andrew Barto. Intuitively this makes sense. We get something from the environment that is pleasurable; the reward feels as though its origin is external. But we can, in the case of animals and people, trace reward and pleasure to internal brain locations and processes. And machines can potentially benefit from this reworking of reinforcement learning, to make explicit that the reward comes from within the agent.
So let’s trace the sensations of a food “reward”. The animal smells and tastes the food; the olfactory and gustatory receptors transmit a signal to the brain that then identifies the odour and taste. A process is performed within the brain deciding whether the food was a pleasurable or unpleasant. This response is learned and causes impulses to seek or avoid the food in future.
Nothing in the food is inherently rewarding. It is the brain that processes the sensations of the food and the brain that produces reward chemicals. For a more detailed article on pleasure and reward in the brain see Berridge and Kringelbach (2008). Choosing the right food when training animals is a process of finding something that their brain responds to as a reward. Once a good treat has been found the animal knows what it wants, and training it is the processes of teaching the animal what to do to get the rewarding treat.
We can consider standard implementations of reinforcement learning in machines as a similar process: the machine searches the environment (or “state-space“) and if it performs the right actions to get to the right state it gets a reward. Differences are notable: the agent might not know anything about the environment, how actions move it from one state to another, or what state gives the reward. Animals, on the other hand, come with some knowledge of the environment and themselves, they have some sense of causality and sequences of events, and animals very quickly recognise treats that cause reward.
Another subtle difference is that the machine doesn’t usually know what the target or objective is; the agent performs a blind search. Reinforcement learning works by simulating the agent exploring some (usually simplified) environment until it finds a reward, and then calculating increases in value of states and actions that preceded the reward. Computers can crunch the numbers in simulation, but complexity of the environment and large numbers of available actions are the enemy. Each extra state “dimension” and action adds an exponential increase in the amount of required computation (see “curse of dimensionality“). This sounds different from an animal, that have very simple associations with objects or actions as the targets of rewards. More on this later!
An extension of the machine reinforcement learning problem is the case where the agent doesn’t know what environment state it is in. Rather than getting the environment state the agent only makes “observations” in this model, known as a “partially observable Markov decision process” or POMDP. From these observations the agent can infer the state and predict the action that should be taken, but the agent typically has reduced certainty. Nevertheless, the rewards it receives are still a function of the true state and action. The agent is not generating rewards from its observations, but receiving them from some genie (the trainer or experimenter) that knows the state and gives it the reward. This is a disconnect between what the agent actually senses (the observations) and the rewards that is relevant for autonomous agents including robots.
These implementations of reinforcement learning mimic the training of an animal with treats, where the whole animal is an agent and the trainer is part of the environment that gives rewards. But it doesn’t seem a good model of reward originating in the internal brain processes. Without sensing the food the brain wouldn’t know that it had just been rewarded—it could be argued that brain (and hence the agent) wasn’t rewarded. How much uncertainty in sensations can there be before the brain doesn’t recognise that it has been rewarded? In a computer, where the environment and the agent are all simulated, the distinction between reward coming from the environment or self-generated in the agent may not matter. But in an autonomous robot, where no trainer is giving it rewards, it must sense the environment and decide only from its own observations whether it should be rewarded.
The implementation of reinforcement learning for autonomous agents and robots will be a topic of a later post. Next post, however, I will cover the problem of machines “observing” the world. How do we representing the world as “states” and the robot capabilities as “actions”? I will discuss how animals appear to solve the problem and recent advances in reinforcement learning.