Learning algorithms for people: Supervised learning

Access to education is widely considered a human right, and, as such, many people spend years at school learning. Many of these people also spend a lot of time practising sport, musical instruments and other hobbies and skills. But how exactly do people go about trying to learn? In machine learning, algorithms are clearly defined procedures for learning. Strangely, though the human brain is a machine of sorts, we don’t really consider experimenting with “algorithms” for our own learning. Perhaps we should.

Machine learning is typically divided into three paradigms: supervised learning, reinforcement learning, and unsupervised learning. These roughly translate into “learning with detailed feedback”, “learning with rewards and punishments” and “learning without any feedback” respectively. These types of learning have some close relationships to the learning that people and animals already do.

Many people already do supervised learning, although probably much more haphazardly than a machine algorithm might dictate. Supervised learning  is good when the answers are available. So when practising for a quiz, or practising a motor skill, we make attempts, then try to adjust based on error we observe. A basic algorithm for people to perform supervised learning to memorise discrete facts could be written as:

given quiz questions, Q, correct answers, A, and stopping criteria, S
    do
        for each quiz question q in Q
            record predicted answer p
        for each predicted answer p
            compare p with correct answer, a
            record error, e
    while stopping criteria, S, are not met

Anyone could use this procedure for rote memorisation of facts, using a certain percentage of correct answers and a set time as the stopping criteria. However, this algorithm supposes the existence of questions associated with the facts to memorise. Memorisation can be difficult without a context to prompt recall and questions can also help links these facts together. Much like it being common for people to find recall better when knowledge is presented visually, aurally and in tactile formats. The machine learning equivalent would be adding extra input dimensions to associate with the output. Supervised learning also makes sense for trying to learn motor skills, this is roughly what many people do already when practising skills for sports or musical instruments.

It makes sense to use slightly different procedures for practising motor skills compared to doing quizzes. In addition to getting the desired outcome, gaining proficiency also requires the practising the technique of the skill.  Good outcomes can often be achieved with poor technique, and poor outcomes might occur with good technique. But to attain a high proficiency, technique is very important. To learn a skill well, it is necessary to pay attention not only to errors in the outcome, but also errors in the technique. For this reason, it is good to first spend time focusing practise on the technique. Once the technique is correct, focus can then be more effectively directed toward achieving the desired outcome.

given correct skill technique, T, and stopping criteria, S
    do
        attempt skill
        compare attempt technique to correct technique, T
        note required adjustments to technique
     while stopping criteria, S, not met

given desired skill outcome, O, and stopping criteria, S
     do
         attempt skill
         compare attempt outcome to desired outcome, O
         note required adjustments to skill
     while stopping criteria, S, are not met

These basic, general algorithms spell out the obvious of what many people already do: learn through repetition of phases of attempts, evaluations and adjustments. It’s possible to continue to describe current methods of teaching and learning as algorithms. And it’s also possible to search for optimal learning processes, characterising the learning algorithms we use, and the structure of education, to discover what is most effective. It may be that different people learn more effectively using different algorithms, or that some people could benefit from practising these algorithms to get better at learning. In future, I will try to write some further posts about learning topics and skills, and applications for different paradigms of learning, as well as algorithms describing systems of education.

I spy with my computer vision eye… Wally? Waldo?

Lately I’ve been devoting a bit of my attention to image processing and computer vision. It’s interesting to see so many varied processes applied to the problem over the last 50 or so years, especially when computer vision was once thought to be solvable in a single summer’s work. We humans perceive things with such apparent ease, it was probably thought that it would be a much simpler problem than playing chess. Now, after decades of focused attention, the attempts that appear most successful at image recognition of handwritten digits, street signs, toys, or even thousands of real-world images, are those that, in some way, model the networks of connections and processes of the brain.

You may have heard about the Google learning system that learned to recognise the faces of cats and people from YouTube videos. This is part of a revolution in artificial neural networks known as deep learning. Among deep learning architectures are ones that use many units that activate stochastically and clever learning rules (e.g., stochastic gradient descent and contrastive divergence). The networks can be trained to perform image classification to state-of-the-art levels of accuracy. Perhaps another interesting thing about these developments, a number of which have come from Geoffrey Hinton and his associates, is that some of them are “generative”. That is, while learning to classify images, these networks can be “turned around” or “unfolded” to create images, compress and cluster images, or perform image completion. This has obvious parallels to the human ability to imagine scenes, and the current understanding of the mammalian primary visual cortex that appears to essentially recreate images received at the retina.

A related type of artificial neural network that has had considerable success is the convolutional neural network. Convolution here is just a fancy term for sliding a small patch of network connections across the entire image to find the result at all locations. These networks also typically uses many layers of neurons, and has achieved similar success in image recognition. These convolutional networks may model known processes in the visual cortices, such as simple cells that detect edges of certain orientations. Outlines in images are combined into complex sets of features and classified. An earlier learning system, known as the neocognitron, used layers of simple cell-like filters without the convolution.

The process of applying the same edge-detection filter over the whole image is similar to the parallel processing that occurs in the brain. Though the thousands of neurons functioning simultaneously has an obvious practical difference to the sequential computation performed in the hardware of a computer; however, GPUs with many processor cores now allow parallel processing in machines. If rather than using direction selective simple cells to detect edges we use image features (such as a loop in a handwritten digit, or the dark circle representing the wheel of a vehicle), we might say the convolution process is similar to scanning an image with our eyes.

Even when we humans are searching for something hidden in a scene, such as our friend Wally (or Waldo), our attention typically centres on one thing at a time. Scanning large, detailed images for Wally often takes us a long time. A computer trained to find Wally in an image using a convolutional network could methodically scan the image a lot faster than us with current hardware. It mightn’t be hard to get a computer to beat us in this challenge for many Where’s Wally images with biologically-inspired image recognition systems (rather than more common, but brittle, image processing techniques).

Even though I think these advances are great, it seems there are things missing from what we are trying to do with these computer vision systems and how we’re trying to train them. We are still throwing information at these learning systems as the disembodied number-crunching machines they are. Though consider how our visual perception abilities allow us to recognise objects in images with little regard for scale, translation, shear, rotation or even colour and illumination; these things are major hurdles for computer vision systems, but for us, they just provide us more information about the scene. These are things we learn to do. Most of the focus of computer vision seems to be related to concept of the “what pathway”, rather than the “how pathway”, of two-streams hypothesis of vision processing in the brain. Maybe researchers could start looking at ways of making these deep networks take that next step. Though extracting information from a scene, such as locating sources of illumination or the motion of objects relative to the camera, might be hard to fit into the current trends of trying to perform unsupervised learning from enormous amounts of unlabelled data.

I think there may be significant advantages to treating the learning system as embodied, and make the real-world property of object permanence something the learning system can latch onto. It’s certainly something that can provide a great deal of leverage in our own learning about objects and how our interactions influence them. It is worth mentioning that machine learning practitioners already commonly create new numerous modified training images from their given set and see measurable improvements. This is similar to what happens when a person or animal is exposed to an object and given the chance to view it from multiple angles and under different lighting conditions. Having a series of contiguous view-points is likely to more easily allow parts of our brain to learn to compensate for different perspectives that scale, shear, rotate and translate the view of objects. It may even be important to learning to predict and recreate different perspectives in our imagination.

Self-rewarding autonomous machines

MarsPerceptionThis is the third post in a series about rewards and values. I’ve written about rewards being generated from within the brain, rather than actually being external, and how autonomous robots must be able to calculate their own rewards in a similar way. I’ve also written about a way in which the world is represented for robots – the “state space” – and the benefits of using object-oriented representations. Combining these two things will be the order of this post: a robot that perceives the world as objects and that rewards itself. But how exactly should this be done and what is the use of learning this way?

First, let’s consider another feature of reinforcement learning. I’ve described a state space, but I didn’t go into “state-action values”. In a world that is divided up into states, and each state has a range of possible actions. The value (state-action value) of any of those actions in that state is the reward we can expect from performing it. Consider the problem of balancing an inverted pendulum: we only get a negative reward (punishment) if the pendulum tips too far to the left or right, or if we reach the limit of movement left or right. If we are in a state of the pendulum tipping to the left, moving in the opposite direction will speed the descent and the punishment. The action with more value is the one that keeps the pendulum balanced. It would be common for the angles and the positions of the base of the inverted pendulum to be partitioned into increments. Smaller partitions gives finer control, but also means that more time needs to be spent to explore and calculate the value of actions in each of those partitions. This is an inverted pendulum version of the “gridworld”.

Now consider a robot that sees the scene in front of it. The robot has inbuilt perceptual algorithms and detects objects in the environment. Each of these detected objects affords a set of complex actions – moving towards, moving away, moving around, inspecting, etc. Again these actions might have values, but how does the robot determine the object-action value, and how does it decide what reward it should receive? Remember that, in an autonomous agent, rewards are internal responses to sensory feedback.  Therefore the rewards the autonomous agent receives is what it gives itself, and must come from its own sensory feedback. The reward could come from visually detecting the object, or from any other sensor information detected from the object. An internal “reward function”, objective, or external instruction determines whether the sensory and perceptual feedback is rewarding.

Now the reward is no longer directly telling the robot what path it should be taking or how it should be controlling its motors, the reward is telling it what it should be seeking or doing at a higher level of abstraction. The act of finding an object may be rewarding, otherwise the reward may come from interacting with the object and detecting its properties. Imagine a robot on Mars, inspecting rocks, but only knowing whether it found the chemicals it was looking for in the rock after probing it with tools. Visually detecting a rock and then detecting the chemicals the robot is looking associates the reward and value with visual qualities of the rock – seeking more rocks with that appearance. If the objective of the robot is to complete a movement, such as grasping a cup, the inclusion of self-perception allows the robot to monitor and detect its success. With detailed sensory feedback, telling the robot where it is in relation to its target, feedback controllers can be used – typically a much more efficient and generalised process for controlling movements than the random search for a trajectory using gridworld-style reinforcement learning.

So if this is such a great way of applying reinforcement learning in robots, why aren’t we? The simple matter is that our current algorithms and processes for perception and perceptual learning just aren’t good enough to recognise objects robustly. So what good is this whole idea if we can’t visually recognise objects? Planning out how to create a robot with robust autonomy, and that is capable of learning, can point us in a direction to focus our efforts in research and development. Perception is still a major stumbling block in the application of autonomous robotics. Biological clues and recent successes suggest that deep convolutional neural networks might be the way to go, but new faster ways of creating and training them are probably necessary. Multiple sensor modalities and active interaction and learning will likely also be important. Once we have more powerful perceptual abilities in robots they can do more effective monitoring of their own motion and adapt their feedback controllers to produce successful movements. With success being determined by the application of reinforcement learning, the learning cycle can be self-adapting and autonomous.

More can still be said of the development of the reward function that decides what sensory states and percepts are rewarding (pleasurable) and what are punishing (painful). The next post will speculate on the biological evolution of rewards and values – which comes first? – and how this might have relevance to a robot deciding what sensory states it should find rewarding.

State space: Quantities and qualities

GridWorldThis is the second post in a series on rewards and values, the previous post discussed whether rewards are external stimuli or internal brain activity. This post discusses the important issue of representing the world in a computer or robot, and the practice of describing the world as discrete quantities or abstract qualities.

The world we live in can usually be described as being in a particular state, e.g., the sky is cloudy, the door is closed, the car is travelling at 42km/h. To be absolutely precise about the state of the world we need to use quantities and measurements: position, weight, number, volume, and so on. But how often do we people know the precise quantitative state of the world we live in? Animals and people often get by without quantifying the exact conditions of the world around them, instead perceiving the qualities of the world – recognising and categorising things, and making relative judgements about position, weight, speed, etc. But then why are robots often programmed to use quantitative descriptions of the world rather than qualitative descriptions? This is a complex issue that won’t be comprehensively answered in this post, but some differences in computer-based representation with quantities and with qualities will be described.

For robots, the world can be represented as a state space. When dealing with measurable quantities the state space often divides up the world into partitions. A classic example in reinforcement learning is navigating a “gridworld“. In the gridworld, the environment the agent finds itself is literally a square grid, and the agent can only move in the four compass directions (north, south, east and west). In the computer these actions and states would usually represented as numbers: state 1, state 2, …, state n, and action 1, action 2, …, action m. The “curse of dimesionality” appears because to store the value of every state-action pair the number of states multiplied by the number of actions. If we add another dimension to the environment with another k possible values, our number of states is multiplied by k. A ten by ten grid, with another dimension of 10 values goes from having 100 states to 1000 states. With four different movements available the agent has 4 actions, so there would be 4000 state-action pairs.

While this highlights one serious problem of representing the world quantitatively, an equally serious problem is deciding how fine should our quantity divisions be? If the agent is a mobile robot driving around a laboratory with only square obstacles, we could probably get by dividing the world up into 50cm x 50cm squares. But if the robot was required to pick something up from a tabletop, it might need to has accuracy down to the centimetre. If it drives around the lab as well as picking things up from tabletops, dividing up the world gets trickier. The grid that makes up the world can’t just describe occupancy, areas of the grid occupied by objects of interest need to be specified as those objects, adding more state dimensions to the representation.

When we people make a choice to do something, like walk to the door, we don’t typically update that choice each time we move 50cm. We collapse all the steps along the way into a single action. Hierarchical reinforcement learning does just this, with algorithms coming under this banner collecting low level actions into high level actions, hierarchically. One popular framework collects actions into “options”, a method of selecting actions (e.g., ‘go north’ 100 times) and evaluating end-conditions (e.g., hit a wall or run out of time) that allow for a reduction in the number of times an agent needs to make a choice (e.g., choose ‘go north’ 100 times) to see how things pan out. This simplifies the process of choosing actions that the agent performs, but it doesn’t simplify the representation of the environment.

When we look around we see the objects – right now you are looking at some sort of computer screen – we also see objects that make up any room we’re in: the door, the floor, the walls, tables and chairs. In our minds, we would seem to represent the world around us as a combined visual and spatial collection of objects. Describing the things in the world as the objects they are in the minds of people allows our “unit of representation” to be any size, and can dramatically simplify the way the world is described. And that is what is happening in more recent developments in machine learning, specifically with relational reinforcement learning and object-oriented reinforcement learning.

In relational reinforcement learning, things in the world are described by their relationship to other things. For example, the coffee cup is on the table and the coffee is in the coffee cup. These relations can usually be described using simple logic statements. Similar to relational abstraction of the world, object-oriented reinforcement learning allows objects to have properties and have associated actions, much like classes in object-oriented programming. Given that object-oriented programming was designed partly because it was related to how we people describe the world, viewing the world as objects has a lot of conceptual benefits. The agent considers the state of objects and learns the effects of actions with those objects. In the case of a robot, we reduce the problem of having large non-meaningful state spaces, but then run into the challenge of recognising objects – a serious hurdled in the world of robotics that isn’t yet solved.

A historical reason for ‘why were quantitative divisions for state space used in the first place?’ is because some problems, such as balancing a broom or gathering momentum to get up a slope, were designed to use as little prior information and as little sensory feedback as possible. This challenge turned into how to get a system to efficiently learn to solve these problems when having to blindly search for a reward or avoid a punishment. Generally speaking, many of the tasks requiring the discrete division of a continuous range are ones that involve some sort of motor control. The same sort of tasks that people perform using vision and touch to provide much more detailed feedback than plain success or failure. The same sort of tasks that we couldn’t feel our success or failure unless we could sense what was happening and had hard-wired responses or some goal in mind (or had someone watching to give feedback). This might mean that reinforcement learning is really the wrong tool for learning low-level motor control, unless that is, we don’t care to give our robots eyes.

This leads me to the topic of the next post in this series on rewards and values: “Self-rewarding autonomous machines“. I’ll discuss how a completely autonomous machine will need to have perceptual capabilities of detecting “good” and “bad” events and reward themselves. I’ll also discuss how viewing the world as “objects” that drive actions will lead to a natural analogy with how animals and people function in the world.

Rewards and values: Introduction

Reward functions are a fundamental part of reinforcement learning for machines. Based partly on Pavlovian, or classical conditioning, exemplified by the pairing of ringing a bell (conditioned stimulus) with the presentation of food (unconditioned stimulus) to a dog repeatedly, resulting in the ringing of the bell alone to cause the dog to salivate (conditioned response).

More recently, developments in reinforcement learning, particularly temporal difference learning, have been compared to the function of reward learning parts of the brain. Pathologies of these reward producing parts of the brain, particularly Parkinson’s disease and Huntington’s disease, show the importance of the reward neurotransmitter dopamine in brain functions for controlling movement and impulses, as well as seeking pleasure.

The purpose and function of these reward centres in the basal ganglia of the brain, could have important implications in way in which we apply reinforcement learning. Especially in autonomous agents and robots. An understanding of the purpose of rewards, and their impact on the development of values in machines and people, also has some interesting philosophical implications that will be discussed

This post introduces what may become a spiral of related posts on concepts of rewards and values covering:

Hopefully this narrowing of post topics results in giving me focus to write and some interesting discourse on the each of the themes of this blog. Suggestions and comments are welcome!