Learning Algorithms for People: Reinforcement Learning

In a previous post I described some simple ways in which learning could be performed algorithmically, replicating the basic process of supervised learning in machines. Of course it’s no secret that repetition is important for learning many things, but the exact way in which we repeat a training set or trial depends on what it is we are trying to learn or teach. I made a series of post on values and reinforcement learning machines and animals, so here I will describe a process for applying reinforcement learning to developing learning strategies. Perhaps more importantly though, I will discuss a significant notion in machine learning and its relationship to psychological results of conditioning — introducing the value function.

Let’s start with some pseudo-code for a human reinforcement learning algorithm that might be representative of certain styles of learning:

given learning topic, S, a set of assessment, T, and study plan, Q
    for each assessment, t in T
        study learning topic, S, using study plan, Q
        answer test questions in t
        record grade feedback, r
        from feedback, r, update study plan, Q 

This algorithm is vague on the details, but this approach of updating a study plan fits into the common system of education; one where people are given material to learn and given grades as feedback on their responses to assignments and tests.

Let’s come back to the basic definitions of computer-based reinforcement learning. The typical components in reinforcement learning are the state-space, S, which describes the environment, an action-space, A, are the options for attempting to transition to different states, and the value function, Q, that is used to pick an action in any given state. The reinforcement feedback, reward and punishment, can be treated as coming from the environment, and is used to update the value function.

The algorithm above doesn’t easily fit into this structure. Nevertheless, we could consider the combination of the learning topic, S, and the grade-system as the environment. Each assessment, t, is a trial of the study plan, Q, with grades, r, providing an evaluation of the effectiveness of study. The study plan is closely related to the value function — it directs the choices of how to traverse the state-space (learning topic).

This isn’t a perfect analogy, but it leads us to the point of reinforcement feedback: to adjust what is perceived as valuable. We could try to use a reinforcement learning algorithm whenever we are trying to search for the best solution for a routine or a skill, and all we receive as feedback is a success-failure measurement.

Though, coming back to the example algorithm provided, considering grades as the only reinforcement feedback in education is a terrible over-simplification. For example, consider the case of a school in a low socio-economic area where getting a good grade will actually get you punished by your peers. Or consider the case of a child that is given great praise for being “smart”. In a related situation, consider the case of praising a young girl for “looking pretty”. How is the perception of value, particularly self-worth, effected by this praise?

Children, and people in general, feel that the acceptance and approval of their peers is a reward. Praise is a form of approval, and criticism is a form of punishment, and each is informative of what should be valued. If children are punished for looking smart, they will probably value the acceptance of their peers over learning. If children are praised for being smart, they may end up simply avoiding anything that makes them look unintelligent. If children are praised for looking pretty, they may end up valuing looking pretty over being interesting and “good” people.

A solution could be to try to be more discerning about what we praise and criticise. The article linked above makes a good point about praising children for “working hard” rather than “being smart”. Children who feel that their effort is valued are more likely to try hard, even in the face of failure. Children who feel that praise will only come when they are successful, will try to avoid failure. Trying to give children self-esteem by praising them for being smart or pretty is, in fact, making their self-esteem contingent on that very thing they are being praised for.

It may seem manipulative, but praise and criticism are ways we can reward and punish people. Most people will adjust their “value-function”, their perception of what is valuable, and, as a result, they will adjust their actions to try to attain further praise, or avoid further punishment. What we praise and criticise ourselves for, is a reflection of what we value in ourselves. And our self-praise and self-criticism can also be used to influence our values and self-esteem, and hence our actions.

Simulating stimuli and moral values

This is the fifth post in a series about rewards and values. Previously the neurological origins for pleasure and reward in biological organisms were touched on, and the evolution of pleasure and the discovery of supernormal stimuli were mentioned. This post highlights some issues surrounding happiness and pleasure as ends to be sought.

First let’s refresh: we have evolved sensations and feelings including pleasure and happiness. These feelings are designed to enhance our survival in the world in which they were developed; the prehistoric world where survival was tenuous and selection favoured the “fittest”. This process of evolving first the base feelings of pleasure, wanting and desire, that later extended to the warm social feelings of friendship, attachment and social contact, couldn’t account the facility we now have for tricking these neural systems into strong, but ‘false’, positives. Things like drugs, pornography and facebook, all can deliver large doses of pleasure from directly stimulating the brain or simulating what had been evolved to be pleasurable experiences.

So where does that get us? In the world of various forms of utilitarianism we are usually trying to maximum some value. By my understanding, in plain utilitarianism the aim is to maximise happiness (sometimes described as increasing pleasure and reducing suffering), in hedonism the aim is sensual pleasure, and in preference utilitarianism it is the satisfaction of preferences. Pleasure may once have seemed like a good pursuit, but now that we have methods of creating pleasure at the push of a button, that hardly seems like a “good” way to live – being hooked up to a machine. And if we consider that our life-long search for pleasure as an ineffective process of trying to find out how to push our biological buttons, pleasure may seem like a fairly poor yardstick for measuring “good”.

Happiness is also a mental state that people have varying degrees of success in attaining. Just because we haven’t had the same success in creating happiness “artificially” it doesn’t mean that it is a much better end to seek. Of course the difficulty of living with depression is undesirable, but if we all could become happy at the push of a button the feeling might lose some value. Even the more abstract idea of satisfying preferences might not get us much further, since many of our preferences are for avoiding suffering and attaining pleasure and happiness.

Of course in all this we might be forgetting (or ignoring the perspective) that pleasure and pain were evolved responses to inform us of how to survive. And here comes a leap:

Instead of valuing feelings we could value an important underlying result of the feelings: learning about ourselves and the world.

The general idea of valuing learning and experience might not be entirely new; Buddhism has long been about seeking enlightenment to relieve suffering and find happiness. However, considering learning and gaining experience as valuable ends, and the pleasure, pain or happiness they might arouse as additional aspects of those experiences, isn’t something I’ve seen as part of the discussion of moral values. Clearly there are causes of pleasure and suffering that cause debilitation or don’t result in any “useful” learning, e.g., drug abuse and bodily mutilation, so these should be avoided. But where would a system of ethics and morality based on valuing learning and experience take us?

This idea will be extended and fleshed out in much more detail in a new blog post series starting soon. To conclude this series on rewards and values, I’ll describe an interesting thought experiment for evaluating systems of value: what would an (essentially) omnipotent artificial intelligence do if maximising those values?

Evolving rewards and values

EvolvingRewardsThis is the fourth post in a series on rewards and values. Previous posts have discussed how rewards are internally generated in the brain and how machines might be able to learn autonomously the same way. But the problem still exists: how is the “reward function” created? In the biological case, evolution is a likely candidate. However, the increasing complexity of behaviour seen in humans seems to have led to an interesting flexibility in what are perceived as rewards and punishments. In robots a similar flexibility might be necessary.

Note: here the phrase “reward function” is used to describe the process of taking some input (e.g. the perceived environment) and calculating “reward” as output. A similar phrase and meaning is used for “value function”.

Let’s start with the question posed in the previous post: what comes first – values or rewards? The answer might be different depending on whether we are talking about machine or biological reinforcement learning. A robot or a simulated agent will usually be given a reward function by a designer. The agent will explore the environment and receive rewards and punishments, and it will learn a “value function”. So we could say that, to the agent, the rewards precede the values. At the very least, the rewards precede the learning of values. But the designer knew what the robot should be rewarded for – knew what result the agent should value. The designers had some valuable state in mind when they designed the reward function. To the designer, the value informs the reward.

How about rewards in animals and humans? The reward centres of the brain are not designed in the sense that they have a designer. Instead they are evolved. What we individually value and desire as animals and humans is largely determined by what we feel is pleasurable and what is not pleasurable. This value is translated to learned drives to perform certain actions. The process of genetic recombination and mutation, a key component of evolution, produces different animal anatomies (including digestive systems) and pleasure responses to the environment. Animals that find pleasure in eating food that is readily available and compatible with the digestive system will have a much greater chance of survival than animals that only find pleasure in eating things that are rare or poisonous. Through natural selection pleasure could be expected to converge to what is valuable to the animal.

In answer to the question: what comes first – rewards or values? – it would seem that value comes first. Of course this definition of “value” is related to the objective fact of what the animal or agent must do to achieve its goals of survival or some given purpose. But what of humans? Evolutionary psychology and evolutionary neuroscience, have reasonable sense to say that, along with brain size and structure, many human behaviours and underlying neural processes have been developed through natural selection. While hypotheses are difficult to test, people seem to have evolved to feel pleasure from socialising – driving us to make social bonds and form groups. And people seem to have evolved feelings of social discomfort – displeasure from embarrassment and being rejected. Although the circumstances that caused the selection of social behaviours isn’t clear, many of our pleasure and displeasure responses seem to be able to be rationalised in terms of evolution.

An interesting aspect of the human pleasure response is the pleasure from achievements. Olympic gold medallists certainly would be normal to feel elation at winning. But even small or common victories, such as our first unaided steps as a child or managing to catch a ball, can elicit varying amounts of pleasure and satisfaction. Is this a pleasure due to the adulation and praise of onlookers that we have been wired to enjoy? Or is there a more fundamental case of success at any self-determined goal causing pleasure? This could be related to the loss of pleasure and enjoyment that is often associated with Parkinson’s disease. Areas of the brain related to inhibiting and coordinating movement, which deteriorate as part of Parkinson’s disease, are also strongly associated with reward and pleasure generation.

And we can bring this back to an autonomous robot that generates its own reward: a robot that has multiple purposes will need to have different ways of valuing the objects and states of the environment depending on what its current goal is. When crossing the road the robot needs to avoid cars; when cleaning a car the robot might even need to enter the car. This kind of flexibility in determining what the goal is, and reward feedback that determines on one level whether the goal has been reached, and another that determines whether the goal was as good as it was thought to be, could be an important process in the development of “intelligent” robots.

However, before I conclude, let’s consider one fall out from the planetary dominance of humans – our selection pressures have nearly disappeared. We are, after all, the current dominant species on this planet. Perception-reward centres that were not evolved to deal with newly discovered and human manufactured stimuli, aren’t likely to be strongly selected against. And through our ingenuity we have found powerful ways to “game” our evolved pleasure centres – finding and manufacturing super-normal stimuli.

Dan Dennett: Cute, sweet, sexy, funny (TED talk video available on YouTube).

The video linked above features Dan Dennett describing what how evolution has influencing our feelings of what is “cute, sweet, sexy and funny”. The result is possibly the opposite of what we intuitively feel and think: there is nothing inherently cute, sweet, sexy or funny; these sensations and feelings evolved in to find value in our surroundings and each other. We have evolved to find babies and young animals cute, we have evolved to find food that is high in energy tasty, and we have evolved to find healthy members of the opposite sex attractive. Funny wasn’t explained clearly – Dan Dennett described hypothesis that it was related to helping find boring or unpleasant jobs bearable. I would speculate that humour might also have been selected for making people more socially attractive, or making them, at the very least, more bearable. 🙂

Building on this understanding of pleasure and other feelings being evolved, the topic of the next post in this series will be super-normal stimuli and how they influence our views on human values and ethics. Let’s begin the adventure into the moral minefield.