This is the third post in a series about rewards and values. I’ve written about rewards being generated from within the brain, rather than actually being external, and how autonomous robots must be able to calculate their own rewards in a similar way. I’ve also written about a way in which the world is represented for robots – the “state space” – and the benefits of using object-oriented representations. Combining these two things will be the order of this post: a robot that perceives the world as objects and that rewards itself. But how exactly should this be done and what is the use of learning this way?
First, let’s consider another feature of reinforcement learning. I’ve described a state space, but I didn’t go into “state-action values”. In a world that is divided up into states, and each state has a range of possible actions. The value (state-action value) of any of those actions in that state is the reward we can expect from performing it. Consider the problem of balancing an inverted pendulum: we only get a negative reward (punishment) if the pendulum tips too far to the left or right, or if we reach the limit of movement left or right. If we are in a state of the pendulum tipping to the left, moving in the opposite direction will speed the descent and the punishment. The action with more value is the one that keeps the pendulum balanced. It would be common for the angles and the positions of the base of the inverted pendulum to be partitioned into increments. Smaller partitions gives finer control, but also means that more time needs to be spent to explore and calculate the value of actions in each of those partitions. This is an inverted pendulum version of the “gridworld”.
Now consider a robot that sees the scene in front of it. The robot has inbuilt perceptual algorithms and detects objects in the environment. Each of these detected objects affords a set of complex actions – moving towards, moving away, moving around, inspecting, etc. Again these actions might have values, but how does the robot determine the object-action value, and how does it decide what reward it should receive? Remember that, in an autonomous agent, rewards are internal responses to sensory feedback. Therefore the rewards the autonomous agent receives is what it gives itself, and must come from its own sensory feedback. The reward could come from visually detecting the object, or from any other sensor information detected from the object. An internal “reward function”, objective, or external instruction determines whether the sensory and perceptual feedback is rewarding.
Now the reward is no longer directly telling the robot what path it should be taking or how it should be controlling its motors, the reward is telling it what it should be seeking or doing at a higher level of abstraction. The act of finding an object may be rewarding, otherwise the reward may come from interacting with the object and detecting its properties. Imagine a robot on Mars, inspecting rocks, but only knowing whether it found the chemicals it was looking for in the rock after probing it with tools. Visually detecting a rock and then detecting the chemicals the robot is looking associates the reward and value with visual qualities of the rock – seeking more rocks with that appearance. If the objective of the robot is to complete a movement, such as grasping a cup, the inclusion of self-perception allows the robot to monitor and detect its success. With detailed sensory feedback, telling the robot where it is in relation to its target, feedback controllers can be used – typically a much more efficient and generalised process for controlling movements than the random search for a trajectory using gridworld-style reinforcement learning.
So if this is such a great way of applying reinforcement learning in robots, why aren’t we? The simple matter is that our current algorithms and processes for perception and perceptual learning just aren’t good enough to recognise objects robustly. So what good is this whole idea if we can’t visually recognise objects? Planning out how to create a robot with robust autonomy, and that is capable of learning, can point us in a direction to focus our efforts in research and development. Perception is still a major stumbling block in the application of autonomous robotics. Biological clues and recent successes suggest that deep convolutional neural networks might be the way to go, but new faster ways of creating and training them are probably necessary. Multiple sensor modalities and active interaction and learning will likely also be important. Once we have more powerful perceptual abilities in robots they can do more effective monitoring of their own motion and adapt their feedback controllers to produce successful movements. With success being determined by the application of reinforcement learning, the learning cycle can be self-adapting and autonomous.
More can still be said of the development of the reward function that decides what sensory states and percepts are rewarding (pleasurable) and what are punishing (painful). The next post will speculate on the biological evolution of rewards and values – which comes first? – and how this might have relevance to a robot deciding what sensory states it should find rewarding.