Learning Algorithms for People: Reinforcement Learning

In a previous post I described some simple ways in which learning could be performed algorithmically, replicating the basic process of supervised learning in machines. Of course it’s no secret that repetition is important for learning many things, but the exact way in which we repeat a training set or trial depends on what it is we are trying to learn or teach. I made a series of post on values and reinforcement learning machines and animals, so here I will describe a process for applying reinforcement learning to developing learning strategies. Perhaps more importantly though, I will discuss a significant notion in machine learning and its relationship to psychological results of conditioning — introducing the value function.

Let’s start with some pseudo-code for a human reinforcement learning algorithm that might be representative of certain styles of learning:

given learning topic, S, a set of assessment, T, and study plan, Q
    for each assessment, t in T
        study learning topic, S, using study plan, Q
        answer test questions in t
        record grade feedback, r
        from feedback, r, update study plan, Q 

This algorithm is vague on the details, but this approach of updating a study plan fits into the common system of education; one where people are given material to learn and given grades as feedback on their responses to assignments and tests.

Let’s come back to the basic definitions of computer-based reinforcement learning. The typical components in reinforcement learning are the state-space, S, which describes the environment, an action-space, A, are the options for attempting to transition to different states, and the value function, Q, that is used to pick an action in any given state. The reinforcement feedback, reward and punishment, can be treated as coming from the environment, and is used to update the value function.

The algorithm above doesn’t easily fit into this structure. Nevertheless, we could consider the combination of the learning topic, S, and the grade-system as the environment. Each assessment, t, is a trial of the study plan, Q, with grades, r, providing an evaluation of the effectiveness of study. The study plan is closely related to the value function — it directs the choices of how to traverse the state-space (learning topic).

This isn’t a perfect analogy, but it leads us to the point of reinforcement feedback: to adjust what is perceived as valuable. We could try to use a reinforcement learning algorithm whenever we are trying to search for the best solution for a routine or a skill, and all we receive as feedback is a success-failure measurement.

Though, coming back to the example algorithm provided, considering grades as the only reinforcement feedback in education is a terrible over-simplification. For example, consider the case of a school in a low socio-economic area where getting a good grade will actually get you punished by your peers. Or consider the case of a child that is given great praise for being “smart”. In a related situation, consider the case of praising a young girl for “looking pretty”. How is the perception of value, particularly self-worth, effected by this praise?

Children, and people in general, feel that the acceptance and approval of their peers is a reward. Praise is a form of approval, and criticism is a form of punishment, and each is informative of what should be valued. If children are punished for looking smart, they will probably value the acceptance of their peers over learning. If children are praised for being smart, they may end up simply avoiding anything that makes them look unintelligent. If children are praised for looking pretty, they may end up valuing looking pretty over being interesting and “good” people.

A solution could be to try to be more discerning about what we praise and criticise. The article linked above makes a good point about praising children for “working hard” rather than “being smart”. Children who feel that their effort is valued are more likely to try hard, even in the face of failure. Children who feel that praise will only come when they are successful, will try to avoid failure. Trying to give children self-esteem by praising them for being smart or pretty is, in fact, making their self-esteem contingent on that very thing they are being praised for.

It may seem manipulative, but praise and criticism are ways we can reward and punish people. Most people will adjust their “value-function”, their perception of what is valuable, and, as a result, they will adjust their actions to try to attain further praise, or avoid further punishment. What we praise and criticise ourselves for, is a reflection of what we value in ourselves. And our self-praise and self-criticism can also be used to influence our values and self-esteem, and hence our actions.

State space: Quantities and qualities

GridWorldThis is the second post in a series on rewards and values, the previous post discussed whether rewards are external stimuli or internal brain activity. This post discusses the important issue of representing the world in a computer or robot, and the practice of describing the world as discrete quantities or abstract qualities.

The world we live in can usually be described as being in a particular state, e.g., the sky is cloudy, the door is closed, the car is travelling at 42km/h. To be absolutely precise about the state of the world we need to use quantities and measurements: position, weight, number, volume, and so on. But how often do we people know the precise quantitative state of the world we live in? Animals and people often get by without quantifying the exact conditions of the world around them, instead perceiving the qualities of the world – recognising and categorising things, and making relative judgements about position, weight, speed, etc. But then why are robots often programmed to use quantitative descriptions of the world rather than qualitative descriptions? This is a complex issue that won’t be comprehensively answered in this post, but some differences in computer-based representation with quantities and with qualities will be described.

For robots, the world can be represented as a state space. When dealing with measurable quantities the state space often divides up the world into partitions. A classic example in reinforcement learning is navigating a “gridworld“. In the gridworld, the environment the agent finds itself is literally a square grid, and the agent can only move in the four compass directions (north, south, east and west). In the computer these actions and states would usually represented as numbers: state 1, state 2, …, state n, and action 1, action 2, …, action m. The “curse of dimesionality” appears because to store the value of every state-action pair the number of states multiplied by the number of actions. If we add another dimension to the environment with another k possible values, our number of states is multiplied by k. A ten by ten grid, with another dimension of 10 values goes from having 100 states to 1000 states. With four different movements available the agent has 4 actions, so there would be 4000 state-action pairs.

While this highlights one serious problem of representing the world quantitatively, an equally serious problem is deciding how fine should our quantity divisions be? If the agent is a mobile robot driving around a laboratory with only square obstacles, we could probably get by dividing the world up into 50cm x 50cm squares. But if the robot was required to pick something up from a tabletop, it might need to has accuracy down to the centimetre. If it drives around the lab as well as picking things up from tabletops, dividing up the world gets trickier. The grid that makes up the world can’t just describe occupancy, areas of the grid occupied by objects of interest need to be specified as those objects, adding more state dimensions to the representation.

When we people make a choice to do something, like walk to the door, we don’t typically update that choice each time we move 50cm. We collapse all the steps along the way into a single action. Hierarchical reinforcement learning does just this, with algorithms coming under this banner collecting low level actions into high level actions, hierarchically. One popular framework collects actions into “options”, a method of selecting actions (e.g., ‘go north’ 100 times) and evaluating end-conditions (e.g., hit a wall or run out of time) that allow for a reduction in the number of times an agent needs to make a choice (e.g., choose ‘go north’ 100 times) to see how things pan out. This simplifies the process of choosing actions that the agent performs, but it doesn’t simplify the representation of the environment.

When we look around we see the objects – right now you are looking at some sort of computer screen – we also see objects that make up any room we’re in: the door, the floor, the walls, tables and chairs. In our minds, we would seem to represent the world around us as a combined visual and spatial collection of objects. Describing the things in the world as the objects they are in the minds of people allows our “unit of representation” to be any size, and can dramatically simplify the way the world is described. And that is what is happening in more recent developments in machine learning, specifically with relational reinforcement learning and object-oriented reinforcement learning.

In relational reinforcement learning, things in the world are described by their relationship to other things. For example, the coffee cup is on the table and the coffee is in the coffee cup. These relations can usually be described using simple logic statements. Similar to relational abstraction of the world, object-oriented reinforcement learning allows objects to have properties and have associated actions, much like classes in object-oriented programming. Given that object-oriented programming was designed partly because it was related to how we people describe the world, viewing the world as objects has a lot of conceptual benefits. The agent considers the state of objects and learns the effects of actions with those objects. In the case of a robot, we reduce the problem of having large non-meaningful state spaces, but then run into the challenge of recognising objects – a serious hurdled in the world of robotics that isn’t yet solved.

A historical reason for ‘why were quantitative divisions for state space used in the first place?’ is because some problems, such as balancing a broom or gathering momentum to get up a slope, were designed to use as little prior information and as little sensory feedback as possible. This challenge turned into how to get a system to efficiently learn to solve these problems when having to blindly search for a reward or avoid a punishment. Generally speaking, many of the tasks requiring the discrete division of a continuous range are ones that involve some sort of motor control. The same sort of tasks that people perform using vision and touch to provide much more detailed feedback than plain success or failure. The same sort of tasks that we couldn’t feel our success or failure unless we could sense what was happening and had hard-wired responses or some goal in mind (or had someone watching to give feedback). This might mean that reinforcement learning is really the wrong tool for learning low-level motor control, unless that is, we don’t care to give our robots eyes.

This leads me to the topic of the next post in this series on rewards and values: “Self-rewarding autonomous machines“. I’ll discuss how a completely autonomous machine will need to have perceptual capabilities of detecting “good” and “bad” events and reward themselves. I’ll also discuss how viewing the world as “objects” that drive actions will lead to a natural analogy with how animals and people function in the world.