I spy with my computer vision eye… Wally? Waldo?

Lately I’ve been devoting a bit of my attention to image processing and computer vision. It’s interesting to see so many varied processes applied to the problem over the last 50 or so years, especially when computer vision was once thought to be solvable in a single summer’s work. We humans perceive things with such apparent ease, it was probably thought that it would be a much simpler problem than playing chess. Now, after decades of focused attention, the attempts that appear most successful at image recognition of handwritten digits, street signs, toys, or even thousands of real-world images, are those that, in some way, model the networks of connections and processes of the brain.

You may have heard about the Google learning system that learned to recognise the faces of cats and people from YouTube videos. This is part of a revolution in artificial neural networks known as deep learning. Among deep learning architectures are ones that use many units that activate stochastically and clever learning rules (e.g., stochastic gradient descent and contrastive divergence). The networks can be trained to perform image classification to state-of-the-art levels of accuracy. Perhaps another interesting thing about these developments, a number of which have come from Geoffrey Hinton and his associates, is that some of them are “generative”. That is, while learning to classify images, these networks can be “turned around” or “unfolded” to create images, compress and cluster images, or perform image completion. This has obvious parallels to the human ability to imagine scenes, and the current understanding of the mammalian primary visual cortex that appears to essentially recreate images received at the retina.

A related type of artificial neural network that has had considerable success is the convolutional neural network. Convolution here is just a fancy term for sliding a small patch of network connections across the entire image to find the result at all locations. These networks also typically uses many layers of neurons, and has achieved similar success in image recognition. These convolutional networks may model known processes in the visual cortices, such as simple cells that detect edges of certain orientations. Outlines in images are combined into complex sets of features and classified. An earlier learning system, known as the neocognitron, used layers of simple cell-like filters without the convolution.

The process of applying the same edge-detection filter over the whole image is similar to the parallel processing that occurs in the brain. Though the thousands of neurons functioning simultaneously has an obvious practical difference to the sequential computation performed in the hardware of a computer; however, GPUs with many processor cores now allow parallel processing in machines. If rather than using direction selective simple cells to detect edges we use image features (such as a loop in a handwritten digit, or the dark circle representing the wheel of a vehicle), we might say the convolution process is similar to scanning an image with our eyes.

Even when we humans are searching for something hidden in a scene, such as our friend Wally (or Waldo), our attention typically centres on one thing at a time. Scanning large, detailed images for Wally often takes us a long time. A computer trained to find Wally in an image using a convolutional network could methodically scan the image a lot faster than us with current hardware. It mightn’t be hard to get a computer to beat us in this challenge for many Where’s Wally images with biologically-inspired image recognition systems (rather than more common, but brittle, image processing techniques).

Even though I think these advances are great, it seems there are things missing from what we are trying to do with these computer vision systems and how we’re trying to train them. We are still throwing information at these learning systems as the disembodied number-crunching machines they are. Though consider how our visual perception abilities allow us to recognise objects in images with little regard for scale, translation, shear, rotation or even colour and illumination; these things are major hurdles for computer vision systems, but for us, they just provide us more information about the scene. These are things we learn to do. Most of the focus of computer vision seems to be related to concept of the “what pathway”, rather than the “how pathway”, of two-streams hypothesis of vision processing in the brain. Maybe researchers could start looking at ways of making these deep networks take that next step. Though extracting information from a scene, such as locating sources of illumination or the motion of objects relative to the camera, might be hard to fit into the current trends of trying to perform unsupervised learning from enormous amounts of unlabelled data.

I think there may be significant advantages to treating the learning system as embodied, and make the real-world property of object permanence something the learning system can latch onto. It’s certainly something that can provide a great deal of leverage in our own learning about objects and how our interactions influence them. It is worth mentioning that machine learning practitioners already commonly create new numerous modified training images from their given set and see measurable improvements. This is similar to what happens when a person or animal is exposed to an object and given the chance to view it from multiple angles and under different lighting conditions. Having a series of contiguous view-points is likely to more easily allow parts of our brain to learn to compensate for different perspectives that scale, shear, rotate and translate the view of objects. It may even be important to learning to predict and recreate different perspectives in our imagination.