This is the second post in a series on rewards and values, the previous post discussed whether rewards are external stimuli or internal brain activity. This post discusses the important issue of representing the world in a computer or robot, and the practice of describing the world as discrete quantities or abstract qualities.
The world we live in can usually be described as being in a particular state, e.g., the sky is cloudy, the door is closed, the car is travelling at 42km/h. To be absolutely precise about the state of the world we need to use quantities and measurements: position, weight, number, volume, and so on. But how often do we people know the precise quantitative state of the world we live in? Animals and people often get by without quantifying the exact conditions of the world around them, instead perceiving the qualities of the world – recognising and categorising things, and making relative judgements about position, weight, speed, etc. But then why are robots often programmed to use quantitative descriptions of the world rather than qualitative descriptions? This is a complex issue that won’t be comprehensively answered in this post, but some differences in computer-based representation with quantities and with qualities will be described.
For robots, the world can be represented as a state space. When dealing with measurable quantities the state space often divides up the world into partitions. A classic example in reinforcement learning is navigating a “gridworld“. In the gridworld, the environment the agent finds itself is literally a square grid, and the agent can only move in the four compass directions (north, south, east and west). In the computer these actions and states would usually represented as numbers: state 1, state 2, …, state n, and action 1, action 2, …, action m. The “curse of dimesionality” appears because to store the value of every state-action pair the number of states multiplied by the number of actions. If we add another dimension to the environment with another k possible values, our number of states is multiplied by k. A ten by ten grid, with another dimension of 10 values goes from having 100 states to 1000 states. With four different movements available the agent has 4 actions, so there would be 4000 state-action pairs.
While this highlights one serious problem of representing the world quantitatively, an equally serious problem is deciding how fine should our quantity divisions be? If the agent is a mobile robot driving around a laboratory with only square obstacles, we could probably get by dividing the world up into 50cm x 50cm squares. But if the robot was required to pick something up from a tabletop, it might need to has accuracy down to the centimetre. If it drives around the lab as well as picking things up from tabletops, dividing up the world gets trickier. The grid that makes up the world can’t just describe occupancy, areas of the grid occupied by objects of interest need to be specified as those objects, adding more state dimensions to the representation.
When we people make a choice to do something, like walk to the door, we don’t typically update that choice each time we move 50cm. We collapse all the steps along the way into a single action. Hierarchical reinforcement learning does just this, with algorithms coming under this banner collecting low level actions into high level actions, hierarchically. One popular framework collects actions into “options”, a method of selecting actions (e.g., ‘go north’ 100 times) and evaluating end-conditions (e.g., hit a wall or run out of time) that allow for a reduction in the number of times an agent needs to make a choice (e.g., choose ‘go north’ 100 times) to see how things pan out. This simplifies the process of choosing actions that the agent performs, but it doesn’t simplify the representation of the environment.
When we look around we see the objects – right now you are looking at some sort of computer screen – we also see objects that make up any room we’re in: the door, the floor, the walls, tables and chairs. In our minds, we would seem to represent the world around us as a combined visual and spatial collection of objects. Describing the things in the world as the objects they are in the minds of people allows our “unit of representation” to be any size, and can dramatically simplify the way the world is described. And that is what is happening in more recent developments in machine learning, specifically with relational reinforcement learning and object-oriented reinforcement learning.
In relational reinforcement learning, things in the world are described by their relationship to other things. For example, the coffee cup is on the table and the coffee is in the coffee cup. These relations can usually be described using simple logic statements. Similar to relational abstraction of the world, object-oriented reinforcement learning allows objects to have properties and have associated actions, much like classes in object-oriented programming. Given that object-oriented programming was designed partly because it was related to how we people describe the world, viewing the world as objects has a lot of conceptual benefits. The agent considers the state of objects and learns the effects of actions with those objects. In the case of a robot, we reduce the problem of having large non-meaningful state spaces, but then run into the challenge of recognising objects – a serious hurdled in the world of robotics that isn’t yet solved.
A historical reason for ‘why were quantitative divisions for state space used in the first place?’ is because some problems, such as balancing a broom or gathering momentum to get up a slope, were designed to use as little prior information and as little sensory feedback as possible. This challenge turned into how to get a system to efficiently learn to solve these problems when having to blindly search for a reward or avoid a punishment. Generally speaking, many of the tasks requiring the discrete division of a continuous range are ones that involve some sort of motor control. The same sort of tasks that people perform using vision and touch to provide much more detailed feedback than plain success or failure. The same sort of tasks that we couldn’t feel our success or failure unless we could sense what was happening and had hard-wired responses or some goal in mind (or had someone watching to give feedback). This might mean that reinforcement learning is really the wrong tool for learning low-level motor control, unless that is, we don’t care to give our robots eyes.
This leads me to the topic of the next post in this series on rewards and values: “Self-rewarding autonomous machines“. I’ll discuss how a completely autonomous machine will need to have perceptual capabilities of detecting “good” and “bad” events and reward themselves. I’ll also discuss how viewing the world as “objects” that drive actions will lead to a natural analogy with how animals and people function in the world.