I spy with my computer vision eye… Wally? Waldo?

Lately I’ve been devoting a bit of my attention to image processing and computer vision. It’s interesting to see so many varied processes applied to the problem over the last 50 or so years, especially when computer vision was once thought to be solvable in a single summer’s work. We humans perceive things with such apparent ease, it was probably thought that it would be a much simpler problem than playing chess. Now, after decades of focused attention, the attempts that appear most successful at image recognition of handwritten digits, street signs, toys, or even thousands of real-world images, are those that, in some way, model the networks of connections and processes of the brain.

You may have heard about the Google learning system that learned to recognise the faces of cats and people from YouTube videos. This is part of a revolution in artificial neural networks known as deep learning. Among deep learning architectures are ones that use many units that activate stochastically and clever learning rules (e.g., stochastic gradient descent and contrastive divergence). The networks can be trained to perform image classification to state-of-the-art levels of accuracy. Perhaps another interesting thing about these developments, a number of which have come from Geoffrey Hinton and his associates, is that some of them are “generative”. That is, while learning to classify images, these networks can be “turned around” or “unfolded” to create images, compress and cluster images, or perform image completion. This has obvious parallels to the human ability to imagine scenes, and the current understanding of the mammalian primary visual cortex that appears to essentially recreate images received at the retina.

A related type of artificial neural network that has had considerable success is the convolutional neural network. Convolution here is just a fancy term for sliding a small patch of network connections across the entire image to find the result at all locations. These networks also typically uses many layers of neurons, and has achieved similar success in image recognition. These convolutional networks may model known processes in the visual cortices, such as simple cells that detect edges of certain orientations. Outlines in images are combined into complex sets of features and classified. An earlier learning system, known as the neocognitron, used layers of simple cell-like filters without the convolution.

The process of applying the same edge-detection filter over the whole image is similar to the parallel processing that occurs in the brain. Though the thousands of neurons functioning simultaneously has an obvious practical difference to the sequential computation performed in the hardware of a computer; however, GPUs with many processor cores now allow parallel processing in machines. If rather than using direction selective simple cells to detect edges we use image features (such as a loop in a handwritten digit, or the dark circle representing the wheel of a vehicle), we might say the convolution process is similar to scanning an image with our eyes.

Even when we humans are searching for something hidden in a scene, such as our friend Wally (or Waldo), our attention typically centres on one thing at a time. Scanning large, detailed images for Wally often takes us a long time. A computer trained to find Wally in an image using a convolutional network could methodically scan the image a lot faster than us with current hardware. It mightn’t be hard to get a computer to beat us in this challenge for many Where’s Wally images with biologically-inspired image recognition systems (rather than more common, but brittle, image processing techniques).

Even though I think these advances are great, it seems there are things missing from what we are trying to do with these computer vision systems and how we’re trying to train them. We are still throwing information at these learning systems as the disembodied number-crunching machines they are. Though consider how our visual perception abilities allow us to recognise objects in images with little regard for scale, translation, shear, rotation or even colour and illumination; these things are major hurdles for computer vision systems, but for us, they just provide us more information about the scene. These are things we learn to do. Most of the focus of computer vision seems to be related to concept of the “what pathway”, rather than the “how pathway”, of two-streams hypothesis of vision processing in the brain. Maybe researchers could start looking at ways of making these deep networks take that next step. Though extracting information from a scene, such as locating sources of illumination or the motion of objects relative to the camera, might be hard to fit into the current trends of trying to perform unsupervised learning from enormous amounts of unlabelled data.

I think there may be significant advantages to treating the learning system as embodied, and make the real-world property of object permanence something the learning system can latch onto. It’s certainly something that can provide a great deal of leverage in our own learning about objects and how our interactions influence them. It is worth mentioning that machine learning practitioners already commonly create new numerous modified training images from their given set and see measurable improvements. This is similar to what happens when a person or animal is exposed to an object and given the chance to view it from multiple angles and under different lighting conditions. Having a series of contiguous view-points is likely to more easily allow parts of our brain to learn to compensate for different perspectives that scale, shear, rotate and translate the view of objects. It may even be important to learning to predict and recreate different perspectives in our imagination.

Wars of Ideas and killer robots

The book Wired for War, by P. W. Singer, is a fairly broad-ranging account of the history of how technology changes war, right up to the current rise in robotics. A number of “revolutions in military affairs” have occurred over the course of history, and there has been a very poor record of world powers making the transition before getting usurped by the early adopters. Nonetheless, the use of robots and drones in combat is the source of an ongoing debate. Still, war has existed for a long time, and as we approach a time when people may no longer be directly involved in the combat, the focus should probably be turned back to the reasons we fight.

I’m no historian, but I have some passing knowledge of major conflicts in recorded history, and I have the internet at my fingertips, retinas and cochleas. Nation states and empires began to appear in ancient times, and the Greeks and the Romans were happy to conquer as much as they could. In the Middle Ages the Islamic Empires expanded and the the European Christians fought them for lands they both consider holy. The Cold War between the Soviet Union and the United States of America was fought largely on the grounds of different societal and economic ideals: capitalism versus communism.

Though probably an oversimplification, I’m going to fit a trend to these events. The underlying basis for these wars could be described as strong convictions: “we are people of a different ethnicity or nation–we want resources, power and glory”; “we have religious beliefs–we want to spread our religion and we want what is holy to us”; and “we have different beliefs regarding how our society and economy should be structured–we want to subjugate the intellectually and morally inferior”.

Even more generally these thoughts boil down to ideas of national identity, religion and ethnicity, and ethics and morality. These beliefs often combine dangerously with ideas of infallibility, self-importance and superiority. If people didn’t take these ideas so seriously, many wars might not have occurred. There has recently been a reduction in wars among developed nations; most likely a result of the spread of democracy and moderate views. Nevertheless, the ideas of nationalism, ethnicity and religion are still deeply ingrained in many places and are significant factors in current tensions and wars all over the world. If there were strong enough economic incentives developed nations would likely still enter into conflicts.

Recent conflicts have been complicated by the lack of clear boundaries between sides. With the ideas underlying conflicts often coming from ethnicity and religion, the boundaries become blurry and the groups of people diffuse. Non-state actors emerge in larger areas and populations. As military technology gets more powerful and accessible, people holding fringe ideas can exert more, threat, force and damage than they ever could before. Explosives are a glaring example of this.

Robots are the source of the current debate though, even though groups with access to advanced robots are still mostly limited to advanced militaries and corporations. The main concerns that surround the use of robots are: wars will likely be easier to start and more common as countries don’t risk their own casualties; and concerns that autonomous robots might be worse at discriminating civilians from combatants.

Robots will almost certainly make wars less unattractive, but whether there being less reluctant to take part in wars is actually a bad thing is somewhat dependent on the wars and conflicts that are entered into. Peacekeeping would be a great use of robots, though perhaps not robots of the “killer” variety. Horrific conflicts are happening right now, and developed countries intervene minimally or not at all because of issues such as low economic incentives, UN vetoes, and the certain loss of life they would sustain.

No doubt it would be possible to start wars; probably a less noble practice than interventions in civil war and genocide. However, initiating wars is no longer an easy thing to do secretively these days. The proliferation of digital media recording devices and the internet make it much harder for wars to not draw international attention. But perhaps more important is that most developed countries that possess robots are the liberal democracies, where there is more to the opposition of war than just the loss of soldiers’ lives. This opposition to war is a large source of negative sentiment people have for killer robots in the first place.

Even though the “more wars” issue is far from resolved, let’s turn our attention to the use of killer robots in the conflict itself.

First, from a technical perspective, robots will one day almost certainly be more capable and more objective in determining the combatant/non-combatant status of people than human soldiers. Also the robots aren’t at risk of dying in the same way as a person, the need to rush decisions and retaliate with lethal force is reduced. But let’s return to the idea-centric view of conflict, and consider the use of robots in conflicts such as the “War on Terror“.

The drones being used in Pakistan and Afghanistan are being used against people that believe in the oppression of women and death sentences for blasphemers–people who oppose many things considered universal rights by the West. It seems that to many it’s a forgone conclusion that the “bad guys” need to be killed, and the main issue using robots and drones is civilian casualties. However, a real problem is that many “civilians” share the beliefs underlying the conflict, and at any moment the only difference between a civilian and a combatant might be whether they are firing a weapon or carrying a bomb.

Robotic war technology may get to the point of perfect accuracy and discrimination, but the fact will remain that the “combatants” are regular people fighting for their beliefs. If “perfect” robotics weapons were created that were capable of immediately killing any person who plants a bomb or shoots a rifle, this would be an incredible tool for war, or rather, oppression. I think that kind of oppression would deserve a lot of concern.

In spite of something as oppressive as a ubiquitous army of perfect killer robots, people in possession of  the right (or wrong) mixture of ideas, and strong enough conviction, won’t likely give up. Suicide-bombers don’t let death dissuade them. Is oppression and violence even the best response to profoundly incompatible beliefs and ideas? Even ideas that, themselves, advocate oppression and violence?

Counter-insurgencies are not conventional wars. Belief and ideas are central to their cause–the combatants aren’t going to give-in because their leader is killed or their land taken. The conflict is unlikely to end if the fighting only targets people, it needs to target their beliefs and ideas. Hence the conceived strategy to win the “hearts and minds” of the people. Ideas are not “defeated” until there aren’t any people who still dogmatically follow them.

While robotics look to be the next revolution in military affairs in conflict between nation states and counter-insurgencies, improvements in technology and techniques for influencing beliefs that are the cause for war might be a better revolution. To that end, rather than having robots that kill, a productive use of robots could be to safely educate, debate with, and persuade violent opponents to change beliefs and come to a peaceful resolution. Making robots capable of functioning as diplomats might be a bigger technical challenge than making robots that can distinguish civilians from combatants. But let’s be fanciful.

It continues to be a great tragedy that the ideas that give rise to conflict are themselves are rarely put to the test. It’s unfortunate, but I think it’s no coincidence. Many of the most persistent ideas–the ideas people fight to defend–are put on pedestals: challenging the idea is treason, blasphemy or, even worse, politically incorrect. 😐

Evolving rewards and values

EvolvingRewardsThis is the fourth post in a series on rewards and values. Previous posts have discussed how rewards are internally generated in the brain and how machines might be able to learn autonomously the same way. But the problem still exists: how is the “reward function” created? In the biological case, evolution is a likely candidate. However, the increasing complexity of behaviour seen in humans seems to have led to an interesting flexibility in what are perceived as rewards and punishments. In robots a similar flexibility might be necessary.

Note: here the phrase “reward function” is used to describe the process of taking some input (e.g. the perceived environment) and calculating “reward” as output. A similar phrase and meaning is used for “value function”.

Let’s start with the question posed in the previous post: what comes first – values or rewards? The answer might be different depending on whether we are talking about machine or biological reinforcement learning. A robot or a simulated agent will usually be given a reward function by a designer. The agent will explore the environment and receive rewards and punishments, and it will learn a “value function”. So we could say that, to the agent, the rewards precede the values. At the very least, the rewards precede the learning of values. But the designer knew what the robot should be rewarded for – knew what result the agent should value. The designers had some valuable state in mind when they designed the reward function. To the designer, the value informs the reward.

How about rewards in animals and humans? The reward centres of the brain are not designed in the sense that they have a designer. Instead they are evolved. What we individually value and desire as animals and humans is largely determined by what we feel is pleasurable and what is not pleasurable. This value is translated to learned drives to perform certain actions. The process of genetic recombination and mutation, a key component of evolution, produces different animal anatomies (including digestive systems) and pleasure responses to the environment. Animals that find pleasure in eating food that is readily available and compatible with the digestive system will have a much greater chance of survival than animals that only find pleasure in eating things that are rare or poisonous. Through natural selection pleasure could be expected to converge to what is valuable to the animal.

In answer to the question: what comes first – rewards or values? – it would seem that value comes first. Of course this definition of “value” is related to the objective fact of what the animal or agent must do to achieve its goals of survival or some given purpose. But what of humans? Evolutionary psychology and evolutionary neuroscience, have reasonable sense to say that, along with brain size and structure, many human behaviours and underlying neural processes have been developed through natural selection. While hypotheses are difficult to test, people seem to have evolved to feel pleasure from socialising – driving us to make social bonds and form groups. And people seem to have evolved feelings of social discomfort – displeasure from embarrassment and being rejected. Although the circumstances that caused the selection of social behaviours isn’t clear, many of our pleasure and displeasure responses seem to be able to be rationalised in terms of evolution.

An interesting aspect of the human pleasure response is the pleasure from achievements. Olympic gold medallists certainly would be normal to feel elation at winning. But even small or common victories, such as our first unaided steps as a child or managing to catch a ball, can elicit varying amounts of pleasure and satisfaction. Is this a pleasure due to the adulation and praise of onlookers that we have been wired to enjoy? Or is there a more fundamental case of success at any self-determined goal causing pleasure? This could be related to the loss of pleasure and enjoyment that is often associated with Parkinson’s disease. Areas of the brain related to inhibiting and coordinating movement, which deteriorate as part of Parkinson’s disease, are also strongly associated with reward and pleasure generation.

And we can bring this back to an autonomous robot that generates its own reward: a robot that has multiple purposes will need to have different ways of valuing the objects and states of the environment depending on what its current goal is. When crossing the road the robot needs to avoid cars; when cleaning a car the robot might even need to enter the car. This kind of flexibility in determining what the goal is, and reward feedback that determines on one level whether the goal has been reached, and another that determines whether the goal was as good as it was thought to be, could be an important process in the development of “intelligent” robots.

However, before I conclude, let’s consider one fall out from the planetary dominance of humans – our selection pressures have nearly disappeared. We are, after all, the current dominant species on this planet. Perception-reward centres that were not evolved to deal with newly discovered and human manufactured stimuli, aren’t likely to be strongly selected against. And through our ingenuity we have found powerful ways to “game” our evolved pleasure centres – finding and manufacturing super-normal stimuli.

Dan Dennett: Cute, sweet, sexy, funny (TED talk video available on YouTube).

The video linked above features Dan Dennett describing what how evolution has influencing our feelings of what is “cute, sweet, sexy and funny”. The result is possibly the opposite of what we intuitively feel and think: there is nothing inherently cute, sweet, sexy or funny; these sensations and feelings evolved in to find value in our surroundings and each other. We have evolved to find babies and young animals cute, we have evolved to find food that is high in energy tasty, and we have evolved to find healthy members of the opposite sex attractive. Funny wasn’t explained clearly – Dan Dennett described hypothesis that it was related to helping find boring or unpleasant jobs bearable. I would speculate that humour might also have been selected for making people more socially attractive, or making them, at the very least, more bearable. 🙂

Building on this understanding of pleasure and other feelings being evolved, the topic of the next post in this series will be super-normal stimuli and how they influence our views on human values and ethics. Let’s begin the adventure into the moral minefield.

Self-rewarding autonomous machines

MarsPerceptionThis is the third post in a series about rewards and values. I’ve written about rewards being generated from within the brain, rather than actually being external, and how autonomous robots must be able to calculate their own rewards in a similar way. I’ve also written about a way in which the world is represented for robots – the “state space” – and the benefits of using object-oriented representations. Combining these two things will be the order of this post: a robot that perceives the world as objects and that rewards itself. But how exactly should this be done and what is the use of learning this way?

First, let’s consider another feature of reinforcement learning. I’ve described a state space, but I didn’t go into “state-action values”. In a world that is divided up into states, and each state has a range of possible actions. The value (state-action value) of any of those actions in that state is the reward we can expect from performing it. Consider the problem of balancing an inverted pendulum: we only get a negative reward (punishment) if the pendulum tips too far to the left or right, or if we reach the limit of movement left or right. If we are in a state of the pendulum tipping to the left, moving in the opposite direction will speed the descent and the punishment. The action with more value is the one that keeps the pendulum balanced. It would be common for the angles and the positions of the base of the inverted pendulum to be partitioned into increments. Smaller partitions gives finer control, but also means that more time needs to be spent to explore and calculate the value of actions in each of those partitions. This is an inverted pendulum version of the “gridworld”.

Now consider a robot that sees the scene in front of it. The robot has inbuilt perceptual algorithms and detects objects in the environment. Each of these detected objects affords a set of complex actions – moving towards, moving away, moving around, inspecting, etc. Again these actions might have values, but how does the robot determine the object-action value, and how does it decide what reward it should receive? Remember that, in an autonomous agent, rewards are internal responses to sensory feedback.  Therefore the rewards the autonomous agent receives is what it gives itself, and must come from its own sensory feedback. The reward could come from visually detecting the object, or from any other sensor information detected from the object. An internal “reward function”, objective, or external instruction determines whether the sensory and perceptual feedback is rewarding.

Now the reward is no longer directly telling the robot what path it should be taking or how it should be controlling its motors, the reward is telling it what it should be seeking or doing at a higher level of abstraction. The act of finding an object may be rewarding, otherwise the reward may come from interacting with the object and detecting its properties. Imagine a robot on Mars, inspecting rocks, but only knowing whether it found the chemicals it was looking for in the rock after probing it with tools. Visually detecting a rock and then detecting the chemicals the robot is looking associates the reward and value with visual qualities of the rock – seeking more rocks with that appearance. If the objective of the robot is to complete a movement, such as grasping a cup, the inclusion of self-perception allows the robot to monitor and detect its success. With detailed sensory feedback, telling the robot where it is in relation to its target, feedback controllers can be used – typically a much more efficient and generalised process for controlling movements than the random search for a trajectory using gridworld-style reinforcement learning.

So if this is such a great way of applying reinforcement learning in robots, why aren’t we? The simple matter is that our current algorithms and processes for perception and perceptual learning just aren’t good enough to recognise objects robustly. So what good is this whole idea if we can’t visually recognise objects? Planning out how to create a robot with robust autonomy, and that is capable of learning, can point us in a direction to focus our efforts in research and development. Perception is still a major stumbling block in the application of autonomous robotics. Biological clues and recent successes suggest that deep convolutional neural networks might be the way to go, but new faster ways of creating and training them are probably necessary. Multiple sensor modalities and active interaction and learning will likely also be important. Once we have more powerful perceptual abilities in robots they can do more effective monitoring of their own motion and adapt their feedback controllers to produce successful movements. With success being determined by the application of reinforcement learning, the learning cycle can be self-adapting and autonomous.

More can still be said of the development of the reward function that decides what sensory states and percepts are rewarding (pleasurable) and what are punishing (painful). The next post will speculate on the biological evolution of rewards and values – which comes first? – and how this might have relevance to a robot deciding what sensory states it should find rewarding.

State space: Quantities and qualities

GridWorldThis is the second post in a series on rewards and values, the previous post discussed whether rewards are external stimuli or internal brain activity. This post discusses the important issue of representing the world in a computer or robot, and the practice of describing the world as discrete quantities or abstract qualities.

The world we live in can usually be described as being in a particular state, e.g., the sky is cloudy, the door is closed, the car is travelling at 42km/h. To be absolutely precise about the state of the world we need to use quantities and measurements: position, weight, number, volume, and so on. But how often do we people know the precise quantitative state of the world we live in? Animals and people often get by without quantifying the exact conditions of the world around them, instead perceiving the qualities of the world – recognising and categorising things, and making relative judgements about position, weight, speed, etc. But then why are robots often programmed to use quantitative descriptions of the world rather than qualitative descriptions? This is a complex issue that won’t be comprehensively answered in this post, but some differences in computer-based representation with quantities and with qualities will be described.

For robots, the world can be represented as a state space. When dealing with measurable quantities the state space often divides up the world into partitions. A classic example in reinforcement learning is navigating a “gridworld“. In the gridworld, the environment the agent finds itself is literally a square grid, and the agent can only move in the four compass directions (north, south, east and west). In the computer these actions and states would usually represented as numbers: state 1, state 2, …, state n, and action 1, action 2, …, action m. The “curse of dimesionality” appears because to store the value of every state-action pair the number of states multiplied by the number of actions. If we add another dimension to the environment with another k possible values, our number of states is multiplied by k. A ten by ten grid, with another dimension of 10 values goes from having 100 states to 1000 states. With four different movements available the agent has 4 actions, so there would be 4000 state-action pairs.

While this highlights one serious problem of representing the world quantitatively, an equally serious problem is deciding how fine should our quantity divisions be? If the agent is a mobile robot driving around a laboratory with only square obstacles, we could probably get by dividing the world up into 50cm x 50cm squares. But if the robot was required to pick something up from a tabletop, it might need to has accuracy down to the centimetre. If it drives around the lab as well as picking things up from tabletops, dividing up the world gets trickier. The grid that makes up the world can’t just describe occupancy, areas of the grid occupied by objects of interest need to be specified as those objects, adding more state dimensions to the representation.

When we people make a choice to do something, like walk to the door, we don’t typically update that choice each time we move 50cm. We collapse all the steps along the way into a single action. Hierarchical reinforcement learning does just this, with algorithms coming under this banner collecting low level actions into high level actions, hierarchically. One popular framework collects actions into “options”, a method of selecting actions (e.g., ‘go north’ 100 times) and evaluating end-conditions (e.g., hit a wall or run out of time) that allow for a reduction in the number of times an agent needs to make a choice (e.g., choose ‘go north’ 100 times) to see how things pan out. This simplifies the process of choosing actions that the agent performs, but it doesn’t simplify the representation of the environment.

When we look around we see the objects – right now you are looking at some sort of computer screen – we also see objects that make up any room we’re in: the door, the floor, the walls, tables and chairs. In our minds, we would seem to represent the world around us as a combined visual and spatial collection of objects. Describing the things in the world as the objects they are in the minds of people allows our “unit of representation” to be any size, and can dramatically simplify the way the world is described. And that is what is happening in more recent developments in machine learning, specifically with relational reinforcement learning and object-oriented reinforcement learning.

In relational reinforcement learning, things in the world are described by their relationship to other things. For example, the coffee cup is on the table and the coffee is in the coffee cup. These relations can usually be described using simple logic statements. Similar to relational abstraction of the world, object-oriented reinforcement learning allows objects to have properties and have associated actions, much like classes in object-oriented programming. Given that object-oriented programming was designed partly because it was related to how we people describe the world, viewing the world as objects has a lot of conceptual benefits. The agent considers the state of objects and learns the effects of actions with those objects. In the case of a robot, we reduce the problem of having large non-meaningful state spaces, but then run into the challenge of recognising objects – a serious hurdled in the world of robotics that isn’t yet solved.

A historical reason for ‘why were quantitative divisions for state space used in the first place?’ is because some problems, such as balancing a broom or gathering momentum to get up a slope, were designed to use as little prior information and as little sensory feedback as possible. This challenge turned into how to get a system to efficiently learn to solve these problems when having to blindly search for a reward or avoid a punishment. Generally speaking, many of the tasks requiring the discrete division of a continuous range are ones that involve some sort of motor control. The same sort of tasks that people perform using vision and touch to provide much more detailed feedback than plain success or failure. The same sort of tasks that we couldn’t feel our success or failure unless we could sense what was happening and had hard-wired responses or some goal in mind (or had someone watching to give feedback). This might mean that reinforcement learning is really the wrong tool for learning low-level motor control, unless that is, we don’t care to give our robots eyes.

This leads me to the topic of the next post in this series on rewards and values: “Self-rewarding autonomous machines“. I’ll discuss how a completely autonomous machine will need to have perceptual capabilities of detecting “good” and “bad” events and reward themselves. I’ll also discuss how viewing the world as “objects” that drive actions will lead to a natural analogy with how animals and people function in the world.

Rewards: External or internal?

This is the first post in a series on rewards and values.

The reward that would be most familiar is probably food. We often use treats to train animals, and eating is pleasurable for most people. These rewards are clearly an external thing, aren’t they? This idea is, in some ways, echoed in machine reinforcement learning, as shown in a diagram (pictured below) from the introductory book by Richard Sutton and Andrew Barto. Intuitively this makes sense. We get something from the environment that is pleasurable; the reward feels as though its origin is external. But we can, in the case of animals and people, trace reward and pleasure to internal brain locations and processes. And machines can potentially benefit from this reworking of reinforcement learning, to make explicit that the reward comes from within the agent.

Agent-environment interaction in reinforcement learning

Figure 3.1 from Sutton and Barto, 1998, Reinforcement Learning: An Introduction, MIT Press. Online: http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node28.html

So let’s trace the sensations of a food “reward”. The animal smells and tastes the food; the olfactory and gustatory receptors transmit a signal to the brain that then identifies the odour and taste. A process is performed within the brain deciding whether the food was a pleasurable or unpleasant. This response is learned and causes impulses to seek or avoid the food in future.

Nothing in the food is inherently rewarding. It is the brain that processes the sensations of the food and the brain that produces reward chemicals. For a more detailed article on pleasure and reward in the brain see Berridge and Kringelbach (2008). Choosing the right food when training animals is a process of finding something that their brain responds to as a reward. Once a good treat has been found the animal knows what it wants, and training it is the processes of teaching the animal what to do to get the rewarding treat.

Agent environment interation with internal reward.

Modified Figure 3.1 from Sutton and Barto, 1998, Reinforcement Learning: An Introduction, MIT Press. Online: http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node28.html

We can consider standard implementations of reinforcement learning in machines as a similar process: the machine searches the environment (or “state-space“) and if it performs the right actions to get to the right state it gets a reward. Differences are notable: the agent might not know anything about the environment, how actions move it from one state to another, or what state gives the reward. Animals, on the other hand, come with some knowledge of the environment and themselves, they have some sense of causality and sequences of events, and animals very quickly recognise treats that cause reward.

Another subtle difference is that the machine doesn’t usually know what the target or objective is; the agent performs a blind search. Reinforcement learning works by simulating the agent exploring some (usually simplified) environment until it finds a reward, and then calculating increases in value of states and actions that preceded the reward. Computers can crunch the numbers in simulation, but complexity of the environment and large numbers of available actions are the enemy. Each extra state “dimension” and action adds an exponential increase in the amount of required computation (see “curse of dimensionality“). This sounds different from an animal, that have very simple associations with objects or actions as the targets of rewards. More on this later!

An extension of the machine reinforcement learning problem is the case where the agent doesn’t know what environment state it is in. Rather than getting the environment state the agent only makes “observations” in this model, known as a “partially observable Markov decision process” or POMDP. From these observations the agent can infer the state and predict the action that should be taken, but the agent typically has reduced certainty. Nevertheless, the rewards it receives are still a function of the true state and action. The agent is not generating rewards from its observations, but receiving them from some genie (the trainer or experimenter) that knows the state and gives it the reward. This is a disconnect between what the agent actually senses (the observations) and the rewards that is relevant for autonomous agents including robots.

These implementations of reinforcement learning mimic the training of an animal with treats, where the whole animal is an agent and the trainer is part of the environment that gives rewards. But it doesn’t seem a good model of reward originating in the internal brain processes. Without sensing the food the brain wouldn’t know that it had just been rewarded—it could be argued that brain (and hence the agent) wasn’t rewarded. How much uncertainty in sensations can there be before the brain doesn’t recognise that it has been rewarded? In a computer, where the environment and the agent are all simulated, the distinction between reward coming from the environment or self-generated in the agent may not matter. But in an autonomous robot, where no trainer is giving it rewards, it must sense the environment and decide only from its own observations whether it should be rewarded.

The implementation of reinforcement learning for autonomous agents and robots will be a topic of a later post. Next post, however, I will cover the problem of machines “observing” the world. How do we representing the world as “states” and the robot capabilities as “actions”? I will discuss how animals appear to solve the problem and recent advances in reinforcement learning.

Rewards and values: Introduction

Reward functions are a fundamental part of reinforcement learning for machines. Based partly on Pavlovian, or classical conditioning, exemplified by the pairing of ringing a bell (conditioned stimulus) with the presentation of food (unconditioned stimulus) to a dog repeatedly, resulting in the ringing of the bell alone to cause the dog to salivate (conditioned response).

More recently, developments in reinforcement learning, particularly temporal difference learning, have been compared to the function of reward learning parts of the brain. Pathologies of these reward producing parts of the brain, particularly Parkinson’s disease and Huntington’s disease, show the importance of the reward neurotransmitter dopamine in brain functions for controlling movement and impulses, as well as seeking pleasure.

The purpose and function of these reward centres in the basal ganglia of the brain, could have important implications in way in which we apply reinforcement learning. Especially in autonomous agents and robots. An understanding of the purpose of rewards, and their impact on the development of values in machines and people, also has some interesting philosophical implications that will be discussed

This post introduces what may become a spiral of related posts on concepts of rewards and values covering:

Hopefully this narrowing of post topics results in giving me focus to write and some interesting discourse on the each of the themes of this blog. Suggestions and comments are welcome!

Mind the Leap: Introduction

BlogIntroIt’s been a long time since I created this blog.  I wrote a lot of draft posts, but never edited or posted them; until now.  The best place to start is probably a more detailed description of the things that I want to cover in this space.  Hopefully it will not only inform potential readers of what they might expect from this blog, but also keep me on track to writing on the main topics I want to share ideas on.

First: My day job (although I’m not currently getting paid) is postgraduate research on robot intelligence.  As one of the few PhD students who hasn’t become jaded after working on the same research topic for years, I still find studying robotics and artificial intelligence really engaging and enjoyable.  A part of this blog will be devoted to talking about these topics, but usually at a non-technical, conceptual level.

Second: Intelligence is such a fraught term though, that I have spent a lot of time looking into the underlying neuroscience and thinking about biological intelligence, consciousness, the mind and the brain.  This continues to be a big influence on my approach to robot intelligence.  While the some additions in the path to the evolution of the human brain might not be necessary for functional robot intelligence, people are the primary example of the general intelligence we want in our robots.  Some of this blog will discuss how neuroscience and cognitive science might translate into AI and robotics.

Third: As the brain becomes less of a mystery, the soul is no longer a necessary hypothesis.  Physicalism, the belief that the world is only matter and energy and without a spiritual dimension, is a starting point a lot of my thoughts about the world.  A significant amount of what I would like to discuss is more philosophical in nature.  While I usually try to have a scientific underpinning—or use a thought experiment as an intuition pump—philosophical, moral and ethical issues often remain disputable.  Nonetheless, I think about these issues, and I think they are important enough that another voice can’t hurt.

Those are the main themes and topics this blog will cover.  The style of writing is something I want to be conscious of too.  There are a fine lines between entertaining and obfuscating; informative and long-winded; and concise and plain.  Many of my drafts were possibly drifting towards long-winded attempts to be entertaining.  With a personal credo of trying to improve at all things I do, I’ll look for a balance.  Humour, like morality, is subjective.  But that doesn’t mean there aren’t ways of doing these things better.  Potential readers beware: there’s no telling what you’ll be subjected to.  Even, sentences that a preposition they end in.  Yoda would be proud.  Or really disappointed.  Or just confused… I’m not sure.  ( Lame grammar joke, Star Wars reference, and smiley face: check. 😀 )