Learning Algorithms for People: Reinforcement Learning

In a previous post I described some simple ways in which learning could be performed algorithmically, replicating the basic process of supervised learning in machines. Of course it’s no secret that repetition is important for learning many things, but the exact way in which we repeat a training set or trial depends on what it is we are trying to learn or teach. I made a series of post on values and reinforcement learning machines and animals, so here I will describe a process for applying reinforcement learning to developing learning strategies. Perhaps more importantly though, I will discuss a significant notion in machine learning and its relationship to psychological results of conditioning — introducing the value function.

Let’s start with some pseudo-code for a human reinforcement learning algorithm that might be representative of certain styles of learning:

given learning topic, S, a set of assessment, T, and study plan, Q
    for each assessment, t in T
        study learning topic, S, using study plan, Q
        answer test questions in t
        record grade feedback, r
        from feedback, r, update study plan, Q 

This algorithm is vague on the details, but this approach of updating a study plan fits into the common system of education; one where people are given material to learn and given grades as feedback on their responses to assignments and tests.

Let’s come back to the basic definitions of computer-based reinforcement learning. The typical components in reinforcement learning are the state-space, S, which describes the environment, an action-space, A, are the options for attempting to transition to different states, and the value function, Q, that is used to pick an action in any given state. The reinforcement feedback, reward and punishment, can be treated as coming from the environment, and is used to update the value function.

The algorithm above doesn’t easily fit into this structure. Nevertheless, we could consider the combination of the learning topic, S, and the grade-system as the environment. Each assessment, t, is a trial of the study plan, Q, with grades, r, providing an evaluation of the effectiveness of study. The study plan is closely related to the value function — it directs the choices of how to traverse the state-space (learning topic).

This isn’t a perfect analogy, but it leads us to the point of reinforcement feedback: to adjust what is perceived as valuable. We could try to use a reinforcement learning algorithm whenever we are trying to search for the best solution for a routine or a skill, and all we receive as feedback is a success-failure measurement.

Though, coming back to the example algorithm provided, considering grades as the only reinforcement feedback in education is a terrible over-simplification. For example, consider the case of a school in a low socio-economic area where getting a good grade will actually get you punished by your peers. Or consider the case of a child that is given great praise for being “smart”. In a related situation, consider the case of praising a young girl for “looking pretty”. How is the perception of value, particularly self-worth, effected by this praise?

Children, and people in general, feel that the acceptance and approval of their peers is a reward. Praise is a form of approval, and criticism is a form of punishment, and each is informative of what should be valued. If children are punished for looking smart, they will probably value the acceptance of their peers over learning. If children are praised for being smart, they may end up simply avoiding anything that makes them look unintelligent. If children are praised for looking pretty, they may end up valuing looking pretty over being interesting and “good” people.

A solution could be to try to be more discerning about what we praise and criticise. The article linked above makes a good point about praising children for “working hard” rather than “being smart”. Children who feel that their effort is valued are more likely to try hard, even in the face of failure. Children who feel that praise will only come when they are successful, will try to avoid failure. Trying to give children self-esteem by praising them for being smart or pretty is, in fact, making their self-esteem contingent on that very thing they are being praised for.

It may seem manipulative, but praise and criticism are ways we can reward and punish people. Most people will adjust their “value-function”, their perception of what is valuable, and, as a result, they will adjust their actions to try to attain further praise, or avoid further punishment. What we praise and criticise ourselves for, is a reflection of what we value in ourselves. And our self-praise and self-criticism can also be used to influence our values and self-esteem, and hence our actions.

Evolving rewards and values

EvolvingRewardsThis is the fourth post in a series on rewards and values. Previous posts have discussed how rewards are internally generated in the brain and how machines might be able to learn autonomously the same way. But the problem still exists: how is the “reward function” created? In the biological case, evolution is a likely candidate. However, the increasing complexity of behaviour seen in humans seems to have led to an interesting flexibility in what are perceived as rewards and punishments. In robots a similar flexibility might be necessary.

Note: here the phrase “reward function” is used to describe the process of taking some input (e.g. the perceived environment) and calculating “reward” as output. A similar phrase and meaning is used for “value function”.

Let’s start with the question posed in the previous post: what comes first – values or rewards? The answer might be different depending on whether we are talking about machine or biological reinforcement learning. A robot or a simulated agent will usually be given a reward function by a designer. The agent will explore the environment and receive rewards and punishments, and it will learn a “value function”. So we could say that, to the agent, the rewards precede the values. At the very least, the rewards precede the learning of values. But the designer knew what the robot should be rewarded for – knew what result the agent should value. The designers had some valuable state in mind when they designed the reward function. To the designer, the value informs the reward.

How about rewards in animals and humans? The reward centres of the brain are not designed in the sense that they have a designer. Instead they are evolved. What we individually value and desire as animals and humans is largely determined by what we feel is pleasurable and what is not pleasurable. This value is translated to learned drives to perform certain actions. The process of genetic recombination and mutation, a key component of evolution, produces different animal anatomies (including digestive systems) and pleasure responses to the environment. Animals that find pleasure in eating food that is readily available and compatible with the digestive system will have a much greater chance of survival than animals that only find pleasure in eating things that are rare or poisonous. Through natural selection pleasure could be expected to converge to what is valuable to the animal.

In answer to the question: what comes first – rewards or values? – it would seem that value comes first. Of course this definition of “value” is related to the objective fact of what the animal or agent must do to achieve its goals of survival or some given purpose. But what of humans? Evolutionary psychology and evolutionary neuroscience, have reasonable sense to say that, along with brain size and structure, many human behaviours and underlying neural processes have been developed through natural selection. While hypotheses are difficult to test, people seem to have evolved to feel pleasure from socialising – driving us to make social bonds and form groups. And people seem to have evolved feelings of social discomfort – displeasure from embarrassment and being rejected. Although the circumstances that caused the selection of social behaviours isn’t clear, many of our pleasure and displeasure responses seem to be able to be rationalised in terms of evolution.

An interesting aspect of the human pleasure response is the pleasure from achievements. Olympic gold medallists certainly would be normal to feel elation at winning. But even small or common victories, such as our first unaided steps as a child or managing to catch a ball, can elicit varying amounts of pleasure and satisfaction. Is this a pleasure due to the adulation and praise of onlookers that we have been wired to enjoy? Or is there a more fundamental case of success at any self-determined goal causing pleasure? This could be related to the loss of pleasure and enjoyment that is often associated with Parkinson’s disease. Areas of the brain related to inhibiting and coordinating movement, which deteriorate as part of Parkinson’s disease, are also strongly associated with reward and pleasure generation.

And we can bring this back to an autonomous robot that generates its own reward: a robot that has multiple purposes will need to have different ways of valuing the objects and states of the environment depending on what its current goal is. When crossing the road the robot needs to avoid cars; when cleaning a car the robot might even need to enter the car. This kind of flexibility in determining what the goal is, and reward feedback that determines on one level whether the goal has been reached, and another that determines whether the goal was as good as it was thought to be, could be an important process in the development of “intelligent” robots.

However, before I conclude, let’s consider one fall out from the planetary dominance of humans – our selection pressures have nearly disappeared. We are, after all, the current dominant species on this planet. Perception-reward centres that were not evolved to deal with newly discovered and human manufactured stimuli, aren’t likely to be strongly selected against. And through our ingenuity we have found powerful ways to “game” our evolved pleasure centres – finding and manufacturing super-normal stimuli.

Dan Dennett: Cute, sweet, sexy, funny (TED talk video available on YouTube).

The video linked above features Dan Dennett describing what how evolution has influencing our feelings of what is “cute, sweet, sexy and funny”. The result is possibly the opposite of what we intuitively feel and think: there is nothing inherently cute, sweet, sexy or funny; these sensations and feelings evolved in to find value in our surroundings and each other. We have evolved to find babies and young animals cute, we have evolved to find food that is high in energy tasty, and we have evolved to find healthy members of the opposite sex attractive. Funny wasn’t explained clearly – Dan Dennett described hypothesis that it was related to helping find boring or unpleasant jobs bearable. I would speculate that humour might also have been selected for making people more socially attractive, or making them, at the very least, more bearable. 🙂

Building on this understanding of pleasure and other feelings being evolved, the topic of the next post in this series will be super-normal stimuli and how they influence our views on human values and ethics. Let’s begin the adventure into the moral minefield.