Self-rewarding autonomous machines

MarsPerceptionThis is the third post in a series about rewards and values. I’ve written about rewards being generated from within the brain, rather than actually being external, and how autonomous robots must be able to calculate their own rewards in a similar way. I’ve also written about a way in which the world is represented for robots – the “state space” – and the benefits of using object-oriented representations. Combining these two things will be the order of this post: a robot that perceives the world as objects and that rewards itself. But how exactly should this be done and what is the use of learning this way?

First, let’s consider another feature of reinforcement learning. I’ve described a state space, but I didn’t go into “state-action values”. In a world that is divided up into states, and each state has a range of possible actions. The value (state-action value) of any of those actions in that state is the reward we can expect from performing it. Consider the problem of balancing an inverted pendulum: we only get a negative reward (punishment) if the pendulum tips too far to the left or right, or if we reach the limit of movement left or right. If we are in a state of the pendulum tipping to the left, moving in the opposite direction will speed the descent and the punishment. The action with more value is the one that keeps the pendulum balanced. It would be common for the angles and the positions of the base of the inverted pendulum to be partitioned into increments. Smaller partitions gives finer control, but also means that more time needs to be spent to explore and calculate the value of actions in each of those partitions. This is an inverted pendulum version of the “gridworld”.

Now consider a robot that sees the scene in front of it. The robot has inbuilt perceptual algorithms and detects objects in the environment. Each of these detected objects affords a set of complex actions – moving towards, moving away, moving around, inspecting, etc. Again these actions might have values, but how does the robot determine the object-action value, and how does it decide what reward it should receive? Remember that, in an autonomous agent, rewards are internal responses to sensory feedback.  Therefore the rewards the autonomous agent receives is what it gives itself, and must come from its own sensory feedback. The reward could come from visually detecting the object, or from any other sensor information detected from the object. An internal “reward function”, objective, or external instruction determines whether the sensory and perceptual feedback is rewarding.

Now the reward is no longer directly telling the robot what path it should be taking or how it should be controlling its motors, the reward is telling it what it should be seeking or doing at a higher level of abstraction. The act of finding an object may be rewarding, otherwise the reward may come from interacting with the object and detecting its properties. Imagine a robot on Mars, inspecting rocks, but only knowing whether it found the chemicals it was looking for in the rock after probing it with tools. Visually detecting a rock and then detecting the chemicals the robot is looking associates the reward and value with visual qualities of the rock – seeking more rocks with that appearance. If the objective of the robot is to complete a movement, such as grasping a cup, the inclusion of self-perception allows the robot to monitor and detect its success. With detailed sensory feedback, telling the robot where it is in relation to its target, feedback controllers can be used – typically a much more efficient and generalised process for controlling movements than the random search for a trajectory using gridworld-style reinforcement learning.

So if this is such a great way of applying reinforcement learning in robots, why aren’t we? The simple matter is that our current algorithms and processes for perception and perceptual learning just aren’t good enough to recognise objects robustly. So what good is this whole idea if we can’t visually recognise objects? Planning out how to create a robot with robust autonomy, and that is capable of learning, can point us in a direction to focus our efforts in research and development. Perception is still a major stumbling block in the application of autonomous robotics. Biological clues and recent successes suggest that deep convolutional neural networks might be the way to go, but new faster ways of creating and training them are probably necessary. Multiple sensor modalities and active interaction and learning will likely also be important. Once we have more powerful perceptual abilities in robots they can do more effective monitoring of their own motion and adapt their feedback controllers to produce successful movements. With success being determined by the application of reinforcement learning, the learning cycle can be self-adapting and autonomous.

More can still be said of the development of the reward function that decides what sensory states and percepts are rewarding (pleasurable) and what are punishing (painful). The next post will speculate on the biological evolution of rewards and values – which comes first? – and how this might have relevance to a robot deciding what sensory states it should find rewarding.

State space: Quantities and qualities

GridWorldThis is the second post in a series on rewards and values, the previous post discussed whether rewards are external stimuli or internal brain activity. This post discusses the important issue of representing the world in a computer or robot, and the practice of describing the world as discrete quantities or abstract qualities.

The world we live in can usually be described as being in a particular state, e.g., the sky is cloudy, the door is closed, the car is travelling at 42km/h. To be absolutely precise about the state of the world we need to use quantities and measurements: position, weight, number, volume, and so on. But how often do we people know the precise quantitative state of the world we live in? Animals and people often get by without quantifying the exact conditions of the world around them, instead perceiving the qualities of the world – recognising and categorising things, and making relative judgements about position, weight, speed, etc. But then why are robots often programmed to use quantitative descriptions of the world rather than qualitative descriptions? This is a complex issue that won’t be comprehensively answered in this post, but some differences in computer-based representation with quantities and with qualities will be described.

For robots, the world can be represented as a state space. When dealing with measurable quantities the state space often divides up the world into partitions. A classic example in reinforcement learning is navigating a “gridworld“. In the gridworld, the environment the agent finds itself is literally a square grid, and the agent can only move in the four compass directions (north, south, east and west). In the computer these actions and states would usually represented as numbers: state 1, state 2, …, state n, and action 1, action 2, …, action m. The “curse of dimesionality” appears because to store the value of every state-action pair the number of states multiplied by the number of actions. If we add another dimension to the environment with another k possible values, our number of states is multiplied by k. A ten by ten grid, with another dimension of 10 values goes from having 100 states to 1000 states. With four different movements available the agent has 4 actions, so there would be 4000 state-action pairs.

While this highlights one serious problem of representing the world quantitatively, an equally serious problem is deciding how fine should our quantity divisions be? If the agent is a mobile robot driving around a laboratory with only square obstacles, we could probably get by dividing the world up into 50cm x 50cm squares. But if the robot was required to pick something up from a tabletop, it might need to has accuracy down to the centimetre. If it drives around the lab as well as picking things up from tabletops, dividing up the world gets trickier. The grid that makes up the world can’t just describe occupancy, areas of the grid occupied by objects of interest need to be specified as those objects, adding more state dimensions to the representation.

When we people make a choice to do something, like walk to the door, we don’t typically update that choice each time we move 50cm. We collapse all the steps along the way into a single action. Hierarchical reinforcement learning does just this, with algorithms coming under this banner collecting low level actions into high level actions, hierarchically. One popular framework collects actions into “options”, a method of selecting actions (e.g., ‘go north’ 100 times) and evaluating end-conditions (e.g., hit a wall or run out of time) that allow for a reduction in the number of times an agent needs to make a choice (e.g., choose ‘go north’ 100 times) to see how things pan out. This simplifies the process of choosing actions that the agent performs, but it doesn’t simplify the representation of the environment.

When we look around we see the objects – right now you are looking at some sort of computer screen – we also see objects that make up any room we’re in: the door, the floor, the walls, tables and chairs. In our minds, we would seem to represent the world around us as a combined visual and spatial collection of objects. Describing the things in the world as the objects they are in the minds of people allows our “unit of representation” to be any size, and can dramatically simplify the way the world is described. And that is what is happening in more recent developments in machine learning, specifically with relational reinforcement learning and object-oriented reinforcement learning.

In relational reinforcement learning, things in the world are described by their relationship to other things. For example, the coffee cup is on the table and the coffee is in the coffee cup. These relations can usually be described using simple logic statements. Similar to relational abstraction of the world, object-oriented reinforcement learning allows objects to have properties and have associated actions, much like classes in object-oriented programming. Given that object-oriented programming was designed partly because it was related to how we people describe the world, viewing the world as objects has a lot of conceptual benefits. The agent considers the state of objects and learns the effects of actions with those objects. In the case of a robot, we reduce the problem of having large non-meaningful state spaces, but then run into the challenge of recognising objects – a serious hurdled in the world of robotics that isn’t yet solved.

A historical reason for ‘why were quantitative divisions for state space used in the first place?’ is because some problems, such as balancing a broom or gathering momentum to get up a slope, were designed to use as little prior information and as little sensory feedback as possible. This challenge turned into how to get a system to efficiently learn to solve these problems when having to blindly search for a reward or avoid a punishment. Generally speaking, many of the tasks requiring the discrete division of a continuous range are ones that involve some sort of motor control. The same sort of tasks that people perform using vision and touch to provide much more detailed feedback than plain success or failure. The same sort of tasks that we couldn’t feel our success or failure unless we could sense what was happening and had hard-wired responses or some goal in mind (or had someone watching to give feedback). This might mean that reinforcement learning is really the wrong tool for learning low-level motor control, unless that is, we don’t care to give our robots eyes.

This leads me to the topic of the next post in this series on rewards and values: “Self-rewarding autonomous machines“. I’ll discuss how a completely autonomous machine will need to have perceptual capabilities of detecting “good” and “bad” events and reward themselves. I’ll also discuss how viewing the world as “objects” that drive actions will lead to a natural analogy with how animals and people function in the world.

Rewards: External or internal?

This is the first post in a series on rewards and values.

The reward that would be most familiar is probably food. We often use treats to train animals, and eating is pleasurable for most people. These rewards are clearly an external thing, aren’t they? This idea is, in some ways, echoed in machine reinforcement learning, as shown in a diagram (pictured below) from the introductory book by Richard Sutton and Andrew Barto. Intuitively this makes sense. We get something from the environment that is pleasurable; the reward feels as though its origin is external. But we can, in the case of animals and people, trace reward and pleasure to internal brain locations and processes. And machines can potentially benefit from this reworking of reinforcement learning, to make explicit that the reward comes from within the agent.

Agent-environment interaction in reinforcement learning

Figure 3.1 from Sutton and Barto, 1998, Reinforcement Learning: An Introduction, MIT Press. Online:

So let’s trace the sensations of a food “reward”. The animal smells and tastes the food; the olfactory and gustatory receptors transmit a signal to the brain that then identifies the odour and taste. A process is performed within the brain deciding whether the food was a pleasurable or unpleasant. This response is learned and causes impulses to seek or avoid the food in future.

Nothing in the food is inherently rewarding. It is the brain that processes the sensations of the food and the brain that produces reward chemicals. For a more detailed article on pleasure and reward in the brain see Berridge and Kringelbach (2008). Choosing the right food when training animals is a process of finding something that their brain responds to as a reward. Once a good treat has been found the animal knows what it wants, and training it is the processes of teaching the animal what to do to get the rewarding treat.

Agent environment interation with internal reward.

Modified Figure 3.1 from Sutton and Barto, 1998, Reinforcement Learning: An Introduction, MIT Press. Online:

We can consider standard implementations of reinforcement learning in machines as a similar process: the machine searches the environment (or “state-space“) and if it performs the right actions to get to the right state it gets a reward. Differences are notable: the agent might not know anything about the environment, how actions move it from one state to another, or what state gives the reward. Animals, on the other hand, come with some knowledge of the environment and themselves, they have some sense of causality and sequences of events, and animals very quickly recognise treats that cause reward.

Another subtle difference is that the machine doesn’t usually know what the target or objective is; the agent performs a blind search. Reinforcement learning works by simulating the agent exploring some (usually simplified) environment until it finds a reward, and then calculating increases in value of states and actions that preceded the reward. Computers can crunch the numbers in simulation, but complexity of the environment and large numbers of available actions are the enemy. Each extra state “dimension” and action adds an exponential increase in the amount of required computation (see “curse of dimensionality“). This sounds different from an animal, that have very simple associations with objects or actions as the targets of rewards. More on this later!

An extension of the machine reinforcement learning problem is the case where the agent doesn’t know what environment state it is in. Rather than getting the environment state the agent only makes “observations” in this model, known as a “partially observable Markov decision process” or POMDP. From these observations the agent can infer the state and predict the action that should be taken, but the agent typically has reduced certainty. Nevertheless, the rewards it receives are still a function of the true state and action. The agent is not generating rewards from its observations, but receiving them from some genie (the trainer or experimenter) that knows the state and gives it the reward. This is a disconnect between what the agent actually senses (the observations) and the rewards that is relevant for autonomous agents including robots.

These implementations of reinforcement learning mimic the training of an animal with treats, where the whole animal is an agent and the trainer is part of the environment that gives rewards. But it doesn’t seem a good model of reward originating in the internal brain processes. Without sensing the food the brain wouldn’t know that it had just been rewarded—it could be argued that brain (and hence the agent) wasn’t rewarded. How much uncertainty in sensations can there be before the brain doesn’t recognise that it has been rewarded? In a computer, where the environment and the agent are all simulated, the distinction between reward coming from the environment or self-generated in the agent may not matter. But in an autonomous robot, where no trainer is giving it rewards, it must sense the environment and decide only from its own observations whether it should be rewarded.

The implementation of reinforcement learning for autonomous agents and robots will be a topic of a later post. Next post, however, I will cover the problem of machines “observing” the world. How do we representing the world as “states” and the robot capabilities as “actions”? I will discuss how animals appear to solve the problem and recent advances in reinforcement learning.

Rewards and values: Introduction

Reward functions are a fundamental part of reinforcement learning for machines. Based partly on Pavlovian, or classical conditioning, exemplified by the pairing of ringing a bell (conditioned stimulus) with the presentation of food (unconditioned stimulus) to a dog repeatedly, resulting in the ringing of the bell alone to cause the dog to salivate (conditioned response).

More recently, developments in reinforcement learning, particularly temporal difference learning, have been compared to the function of reward learning parts of the brain. Pathologies of these reward producing parts of the brain, particularly Parkinson’s disease and Huntington’s disease, show the importance of the reward neurotransmitter dopamine in brain functions for controlling movement and impulses, as well as seeking pleasure.

The purpose and function of these reward centres in the basal ganglia of the brain, could have important implications in way in which we apply reinforcement learning. Especially in autonomous agents and robots. An understanding of the purpose of rewards, and their impact on the development of values in machines and people, also has some interesting philosophical implications that will be discussed

This post introduces what may become a spiral of related posts on concepts of rewards and values covering:

Hopefully this narrowing of post topics results in giving me focus to write and some interesting discourse on the each of the themes of this blog. Suggestions and comments are welcome!