Learning Algorithms for People: Reinforcement Learning

In a previous post I described some simple ways in which learning could be performed algorithmically, replicating the basic process of supervised learning in machines. Of course it’s no secret that repetition is important for learning many things, but the exact way in which we repeat a training set or trial depends on what it is we are trying to learn or teach. I made a series of post on values and reinforcement learning machines and animals, so here I will describe a process for applying reinforcement learning to developing learning strategies. Perhaps more importantly though, I will discuss a significant notion in machine learning and its relationship to psychological results of conditioning — introducing the value function.

Let’s start with some pseudo-code for a human reinforcement learning algorithm that might be representative of certain styles of learning:

given learning topic, S, a set of assessment, T, and study plan, Q
    for each assessment, t in T
        study learning topic, S, using study plan, Q
        answer test questions in t
        record grade feedback, r
        from feedback, r, update study plan, Q 

This algorithm is vague on the details, but this approach of updating a study plan fits into the common system of education; one where people are given material to learn and given grades as feedback on their responses to assignments and tests.

Let’s come back to the basic definitions of computer-based reinforcement learning. The typical components in reinforcement learning are the state-space, S, which describes the environment, an action-space, A, are the options for attempting to transition to different states, and the value function, Q, that is used to pick an action in any given state. The reinforcement feedback, reward and punishment, can be treated as coming from the environment, and is used to update the value function.

The algorithm above doesn’t easily fit into this structure. Nevertheless, we could consider the combination of the learning topic, S, and the grade-system as the environment. Each assessment, t, is a trial of the study plan, Q, with grades, r, providing an evaluation of the effectiveness of study. The study plan is closely related to the value function — it directs the choices of how to traverse the state-space (learning topic).

This isn’t a perfect analogy, but it leads us to the point of reinforcement feedback: to adjust what is perceived as valuable. We could try to use a reinforcement learning algorithm whenever we are trying to search for the best solution for a routine or a skill, and all we receive as feedback is a success-failure measurement.

Though, coming back to the example algorithm provided, considering grades as the only reinforcement feedback in education is a terrible over-simplification. For example, consider the case of a school in a low socio-economic area where getting a good grade will actually get you punished by your peers. Or consider the case of a child that is given great praise for being “smart”. In a related situation, consider the case of praising a young girl for “looking pretty”. How is the perception of value, particularly self-worth, effected by this praise?

Children, and people in general, feel that the acceptance and approval of their peers is a reward. Praise is a form of approval, and criticism is a form of punishment, and each is informative of what should be valued. If children are punished for looking smart, they will probably value the acceptance of their peers over learning. If children are praised for being smart, they may end up simply avoiding anything that makes them look unintelligent. If children are praised for looking pretty, they may end up valuing looking pretty over being interesting and “good” people.

A solution could be to try to be more discerning about what we praise and criticise. The article linked above makes a good point about praising children for “working hard” rather than “being smart”. Children who feel that their effort is valued are more likely to try hard, even in the face of failure. Children who feel that praise will only come when they are successful, will try to avoid failure. Trying to give children self-esteem by praising them for being smart or pretty is, in fact, making their self-esteem contingent on that very thing they are being praised for.

It may seem manipulative, but praise and criticism are ways we can reward and punish people. Most people will adjust their “value-function”, their perception of what is valuable, and, as a result, they will adjust their actions to try to attain further praise, or avoid further punishment. What we praise and criticise ourselves for, is a reflection of what we value in ourselves. And our self-praise and self-criticism can also be used to influence our values and self-esteem, and hence our actions.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s