Learning Algorithms for People: Reinforcement Learning

In a previous post I described some simple ways in which learning could be performed algorithmically, replicating the basic process of supervised learning in machines. Of course it’s no secret that repetition is important for learning many things, but the exact way in which we repeat a training set or trial depends on what it is we are trying to learn or teach. I made a series of post on values and reinforcement learning machines and animals, so here I will describe a process for applying reinforcement learning to developing learning strategies. Perhaps more importantly though, I will discuss a significant notion in machine learning and its relationship to psychological results of conditioning — introducing the value function.

Let’s start with some pseudo-code for a human reinforcement learning algorithm that might be representative of certain styles of learning:

given learning topic, S, a set of assessment, T, and study plan, Q
    for each assessment, t in T
        study learning topic, S, using study plan, Q
        answer test questions in t
        record grade feedback, r
        from feedback, r, update study plan, Q 

This algorithm is vague on the details, but this approach of updating a study plan fits into the common system of education; one where people are given material to learn and given grades as feedback on their responses to assignments and tests.

Let’s come back to the basic definitions of computer-based reinforcement learning. The typical components in reinforcement learning are the state-space, S, which describes the environment, an action-space, A, are the options for attempting to transition to different states, and the value function, Q, that is used to pick an action in any given state. The reinforcement feedback, reward and punishment, can be treated as coming from the environment, and is used to update the value function.

The algorithm above doesn’t easily fit into this structure. Nevertheless, we could consider the combination of the learning topic, S, and the grade-system as the environment. Each assessment, t, is a trial of the study plan, Q, with grades, r, providing an evaluation of the effectiveness of study. The study plan is closely related to the value function — it directs the choices of how to traverse the state-space (learning topic).

This isn’t a perfect analogy, but it leads us to the point of reinforcement feedback: to adjust what is perceived as valuable. We could try to use a reinforcement learning algorithm whenever we are trying to search for the best solution for a routine or a skill, and all we receive as feedback is a success-failure measurement.

Though, coming back to the example algorithm provided, considering grades as the only reinforcement feedback in education is a terrible over-simplification. For example, consider the case of a school in a low socio-economic area where getting a good grade will actually get you punished by your peers. Or consider the case of a child that is given great praise for being “smart”. In a related situation, consider the case of praising a young girl for “looking pretty”. How is the perception of value, particularly self-worth, effected by this praise?

Children, and people in general, feel that the acceptance and approval of their peers is a reward. Praise is a form of approval, and criticism is a form of punishment, and each is informative of what should be valued. If children are punished for looking smart, they will probably value the acceptance of their peers over learning. If children are praised for being smart, they may end up simply avoiding anything that makes them look unintelligent. If children are praised for looking pretty, they may end up valuing looking pretty over being interesting and “good” people.

A solution could be to try to be more discerning about what we praise and criticise. The article linked above makes a good point about praising children for “working hard” rather than “being smart”. Children who feel that their effort is valued are more likely to try hard, even in the face of failure. Children who feel that praise will only come when they are successful, will try to avoid failure. Trying to give children self-esteem by praising them for being smart or pretty is, in fact, making their self-esteem contingent on that very thing they are being praised for.

It may seem manipulative, but praise and criticism are ways we can reward and punish people. Most people will adjust their “value-function”, their perception of what is valuable, and, as a result, they will adjust their actions to try to attain further praise, or avoid further punishment. What we praise and criticise ourselves for, is a reflection of what we value in ourselves. And our self-praise and self-criticism can also be used to influence our values and self-esteem, and hence our actions.

Learning algorithms for people: Supervised learning

Access to education is widely considered a human right, and, as such, many people spend years at school learning. Many of these people also spend a lot of time practising sport, musical instruments and other hobbies and skills. But how exactly do people go about trying to learn? In machine learning, algorithms are clearly defined procedures for learning. Strangely, though the human brain is a machine of sorts, we don’t really consider experimenting with “algorithms” for our own learning. Perhaps we should.

Machine learning is typically divided into three paradigms: supervised learning, reinforcement learning, and unsupervised learning. These roughly translate into “learning with detailed feedback”, “learning with rewards and punishments” and “learning without any feedback” respectively. These types of learning have some close relationships to the learning that people and animals already do.

Many people already do supervised learning, although probably much more haphazardly than a machine algorithm might dictate. Supervised learning  is good when the answers are available. So when practising for a quiz, or practising a motor skill, we make attempts, then try to adjust based on error we observe. A basic algorithm for people to perform supervised learning to memorise discrete facts could be written as:

given quiz questions, Q, correct answers, A, and stopping criteria, S
    do
        for each quiz question q in Q
            record predicted answer p
        for each predicted answer p
            compare p with correct answer, a
            record error, e
    while stopping criteria, S, are not met

Anyone could use this procedure for rote memorisation of facts, using a certain percentage of correct answers and a set time as the stopping criteria. However, this algorithm supposes the existence of questions associated with the facts to memorise. Memorisation can be difficult without a context to prompt recall and questions can also help links these facts together. Much like it being common for people to find recall better when knowledge is presented visually, aurally and in tactile formats. The machine learning equivalent would be adding extra input dimensions to associate with the output. Supervised learning also makes sense for trying to learn motor skills, this is roughly what many people do already when practising skills for sports or musical instruments.

It makes sense to use slightly different procedures for practising motor skills compared to doing quizzes. In addition to getting the desired outcome, gaining proficiency also requires the practising the technique of the skill.  Good outcomes can often be achieved with poor technique, and poor outcomes might occur with good technique. But to attain a high proficiency, technique is very important. To learn a skill well, it is necessary to pay attention not only to errors in the outcome, but also errors in the technique. For this reason, it is good to first spend time focusing practise on the technique. Once the technique is correct, focus can then be more effectively directed toward achieving the desired outcome.

given correct skill technique, T, and stopping criteria, S
    do
        attempt skill
        compare attempt technique to correct technique, T
        note required adjustments to technique
     while stopping criteria, S, not met

given desired skill outcome, O, and stopping criteria, S
     do
         attempt skill
         compare attempt outcome to desired outcome, O
         note required adjustments to skill
     while stopping criteria, S, are not met

These basic, general algorithms spell out the obvious of what many people already do: learn through repetition of phases of attempts, evaluations and adjustments. It’s possible to continue to describe current methods of teaching and learning as algorithms. And it’s also possible to search for optimal learning processes, characterising the learning algorithms we use, and the structure of education, to discover what is most effective. It may be that different people learn more effectively using different algorithms, or that some people could benefit from practising these algorithms to get better at learning. In future, I will try to write some further posts about learning topics and skills, and applications for different paradigms of learning, as well as algorithms describing systems of education.

Writing this blog in style (or not)

It’s been a long time since I’ve posted on this blog. I’ve written numerous lengthy drafts that haven’t (and may not) see the light of day. This has prompted me to consider what I want to achieve with this blog, and how I might change the style and content of my writing. And that’s what I’ll write about here: the style that my posts might start to take moving forward, and the rationale for how and why that might work better.

Going back to the beginning, I wanted a platform to share my ideas and opinions on a broad range of topics. A blog seemed to be a good alternative to more formal publishing, and the pseudo-anonymity the internet still affords means that expressing some of my more unusual ideas might be less likely to come back and haunt me. However, the incorporation of recreational writing in my everyday life, when having a lot of other things that need doing, is less easy to manage than I’d hoped. I’m sure many bloggers know this feeling.

The posts I’ve written in the past have been of a moderate length, and have been reasonably planned out rather than spontaneous. I’ve been aiming for word limits, but under the current circumstances, time limits might be more effective. This would probably push me to write in a different format and style. Most ideas and topics can, and probably should, be broken down into smaller elements. This was done  for the first series of blog posts I wrote on values, but my posts could probably be broken down even further. Long and winding posts might be useful for explaining the drawing of links–painting larger stories with vivid similes and poetic phrases–but are less concise and sometimes lose clarity.

Part of the beauty of blogs is the incremental nature in which posts can be made that build a story or a concept. Leveraging the blog as a central store of writing, posts can be very short and reference previous or future posts rather than cover the material all in one place. And that is what the aim will be for the immediate future: to write shorter posts, not necessarily self-contained, but focused on a single idea concept or topic. Ignoring the immediate gaps that are almost certainly exist at the beginning, but hopefully identifying them, pointing to references, or dealing with them in future posts or comments as they become apparent.

This is almost a necessity when I hope to be dealing with topics that cross philosophy, ethics, psychology, cognitive science, neuroscience, technology, artificial intelligence and robotics. The boundaries blur and merge, as the horizons lie within each others’ borders. I hope that, if I still have anyone reading, people will challenge these ideas, point out when I’ve made unfounded assumptions and prompt me to find references and evidence.

I’m always open to revising my point of view in light of sufficiently compelling evidence and arguments. I’ll see what I can do to write arguments that are thought-provoking and compelling too.

Rewards: External or internal?

This is the first post in a series on rewards and values.

The reward that would be most familiar is probably food. We often use treats to train animals, and eating is pleasurable for most people. These rewards are clearly an external thing, aren’t they? This idea is, in some ways, echoed in machine reinforcement learning, as shown in a diagram (pictured below) from the introductory book by Richard Sutton and Andrew Barto. Intuitively this makes sense. We get something from the environment that is pleasurable; the reward feels as though its origin is external. But we can, in the case of animals and people, trace reward and pleasure to internal brain locations and processes. And machines can potentially benefit from this reworking of reinforcement learning, to make explicit that the reward comes from within the agent.

Agent-environment interaction in reinforcement learning

Figure 3.1 from Sutton and Barto, 1998, Reinforcement Learning: An Introduction, MIT Press. Online: http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node28.html

So let’s trace the sensations of a food “reward”. The animal smells and tastes the food; the olfactory and gustatory receptors transmit a signal to the brain that then identifies the odour and taste. A process is performed within the brain deciding whether the food was a pleasurable or unpleasant. This response is learned and causes impulses to seek or avoid the food in future.

Nothing in the food is inherently rewarding. It is the brain that processes the sensations of the food and the brain that produces reward chemicals. For a more detailed article on pleasure and reward in the brain see Berridge and Kringelbach (2008). Choosing the right food when training animals is a process of finding something that their brain responds to as a reward. Once a good treat has been found the animal knows what it wants, and training it is the processes of teaching the animal what to do to get the rewarding treat.

Agent environment interation with internal reward.

Modified Figure 3.1 from Sutton and Barto, 1998, Reinforcement Learning: An Introduction, MIT Press. Online: http://webdocs.cs.ualberta.ca/~sutton/book/ebook/node28.html

We can consider standard implementations of reinforcement learning in machines as a similar process: the machine searches the environment (or “state-space“) and if it performs the right actions to get to the right state it gets a reward. Differences are notable: the agent might not know anything about the environment, how actions move it from one state to another, or what state gives the reward. Animals, on the other hand, come with some knowledge of the environment and themselves, they have some sense of causality and sequences of events, and animals very quickly recognise treats that cause reward.

Another subtle difference is that the machine doesn’t usually know what the target or objective is; the agent performs a blind search. Reinforcement learning works by simulating the agent exploring some (usually simplified) environment until it finds a reward, and then calculating increases in value of states and actions that preceded the reward. Computers can crunch the numbers in simulation, but complexity of the environment and large numbers of available actions are the enemy. Each extra state “dimension” and action adds an exponential increase in the amount of required computation (see “curse of dimensionality“). This sounds different from an animal, that have very simple associations with objects or actions as the targets of rewards. More on this later!

An extension of the machine reinforcement learning problem is the case where the agent doesn’t know what environment state it is in. Rather than getting the environment state the agent only makes “observations” in this model, known as a “partially observable Markov decision process” or POMDP. From these observations the agent can infer the state and predict the action that should be taken, but the agent typically has reduced certainty. Nevertheless, the rewards it receives are still a function of the true state and action. The agent is not generating rewards from its observations, but receiving them from some genie (the trainer or experimenter) that knows the state and gives it the reward. This is a disconnect between what the agent actually senses (the observations) and the rewards that is relevant for autonomous agents including robots.

These implementations of reinforcement learning mimic the training of an animal with treats, where the whole animal is an agent and the trainer is part of the environment that gives rewards. But it doesn’t seem a good model of reward originating in the internal brain processes. Without sensing the food the brain wouldn’t know that it had just been rewarded—it could be argued that brain (and hence the agent) wasn’t rewarded. How much uncertainty in sensations can there be before the brain doesn’t recognise that it has been rewarded? In a computer, where the environment and the agent are all simulated, the distinction between reward coming from the environment or self-generated in the agent may not matter. But in an autonomous robot, where no trainer is giving it rewards, it must sense the environment and decide only from its own observations whether it should be rewarded.

The implementation of reinforcement learning for autonomous agents and robots will be a topic of a later post. Next post, however, I will cover the problem of machines “observing” the world. How do we representing the world as “states” and the robot capabilities as “actions”? I will discuss how animals appear to solve the problem and recent advances in reinforcement learning.

Rewards and values: Introduction

Reward functions are a fundamental part of reinforcement learning for machines. Based partly on Pavlovian, or classical conditioning, exemplified by the pairing of ringing a bell (conditioned stimulus) with the presentation of food (unconditioned stimulus) to a dog repeatedly, resulting in the ringing of the bell alone to cause the dog to salivate (conditioned response).

More recently, developments in reinforcement learning, particularly temporal difference learning, have been compared to the function of reward learning parts of the brain. Pathologies of these reward producing parts of the brain, particularly Parkinson’s disease and Huntington’s disease, show the importance of the reward neurotransmitter dopamine in brain functions for controlling movement and impulses, as well as seeking pleasure.

The purpose and function of these reward centres in the basal ganglia of the brain, could have important implications in way in which we apply reinforcement learning. Especially in autonomous agents and robots. An understanding of the purpose of rewards, and their impact on the development of values in machines and people, also has some interesting philosophical implications that will be discussed

This post introduces what may become a spiral of related posts on concepts of rewards and values covering:

Hopefully this narrowing of post topics results in giving me focus to write and some interesting discourse on the each of the themes of this blog. Suggestions and comments are welcome!