I spy with my computer vision eye… Wally? Waldo?

Lately I’ve been devoting a bit of my attention to image processing and computer vision. It’s interesting to see so many varied processes applied to the problem over the last 50 or so years, especially when computer vision was once thought to be solvable in a single summer’s work. We humans perceive things with such apparent ease, it was probably thought that it would be a much simpler problem than playing chess. Now, after decades of focused attention, the attempts that appear most successful at image recognition of handwritten digits, street signs, toys, or even thousands of real-world images, are those that, in some way, model the networks of connections and processes of the brain.

You may have heard about the Google learning system that learned to recognise the faces of cats and people from YouTube videos. This is part of a revolution in artificial neural networks known as deep learning. Among deep learning architectures are ones that use many units that activate stochastically and clever learning rules (e.g., stochastic gradient descent and contrastive divergence). The networks can be trained to perform image classification to state-of-the-art levels of accuracy. Perhaps another interesting thing about these developments, a number of which have come from Geoffrey Hinton and his associates, is that some of them are “generative”. That is, while learning to classify images, these networks can be “turned around” or “unfolded” to create images, compress and cluster images, or perform image completion. This has obvious parallels to the human ability to imagine scenes, and the current understanding of the mammalian primary visual cortex that appears to essentially recreate images received at the retina.

A related type of artificial neural network that has had considerable success is the convolutional neural network. Convolution here is just a fancy term for sliding a small patch of network connections across the entire image to find the result at all locations. These networks also typically uses many layers of neurons, and has achieved similar success in image recognition. These convolutional networks may model known processes in the visual cortices, such as simple cells that detect edges of certain orientations. Outlines in images are combined into complex sets of features and classified. An earlier learning system, known as the neocognitron, used layers of simple cell-like filters without the convolution.

The process of applying the same edge-detection filter over the whole image is similar to the parallel processing that occurs in the brain. Though the thousands of neurons functioning simultaneously has an obvious practical difference to the sequential computation performed in the hardware of a computer; however, GPUs with many processor cores now allow parallel processing in machines. If rather than using direction selective simple cells to detect edges we use image features (such as a loop in a handwritten digit, or the dark circle representing the wheel of a vehicle), we might say the convolution process is similar to scanning an image with our eyes.

Even when we humans are searching for something hidden in a scene, such as our friend Wally (or Waldo), our attention typically centres on one thing at a time. Scanning large, detailed images for Wally often takes us a long time. A computer trained to find Wally in an image using a convolutional network could methodically scan the image a lot faster than us with current hardware. It mightn’t be hard to get a computer to beat us in this challenge for many Where’s Wally images with biologically-inspired image recognition systems (rather than more common, but brittle, image processing techniques).

Even though I think these advances are great, it seems there are things missing from what we are trying to do with these computer vision systems and how we’re trying to train them. We are still throwing information at these learning systems as the disembodied number-crunching machines they are. Though consider how our visual perception abilities allow us to recognise objects in images with little regard for scale, translation, shear, rotation or even colour and illumination; these things are major hurdles for computer vision systems, but for us, they just provide us more information about the scene. These are things we learn to do. Most of the focus of computer vision seems to be related to concept of the “what pathway”, rather than the “how pathway”, of two-streams hypothesis of vision processing in the brain. Maybe researchers could start looking at ways of making these deep networks take that next step. Though extracting information from a scene, such as locating sources of illumination or the motion of objects relative to the camera, might be hard to fit into the current trends of trying to perform unsupervised learning from enormous amounts of unlabelled data.

I think there may be significant advantages to treating the learning system as embodied, and make the real-world property of object permanence something the learning system can latch onto. It’s certainly something that can provide a great deal of leverage in our own learning about objects and how our interactions influence them. It is worth mentioning that machine learning practitioners already commonly create new numerous modified training images from their given set and see measurable improvements. This is similar to what happens when a person or animal is exposed to an object and given the chance to view it from multiple angles and under different lighting conditions. Having a series of contiguous view-points is likely to more easily allow parts of our brain to learn to compensate for different perspectives that scale, shear, rotate and translate the view of objects. It may even be important to learning to predict and recreate different perspectives in our imagination.

Advertisements

4 responses to “I spy with my computer vision eye… Wally? Waldo?

  1. Great post, Toby! I’ve been eager to learn more about deep learning ever since I read a recent article in the NY TImes that cited it as the source of recent and surprising advaces in AI.

    • Thanks! Deep learning is certainly turning some heads. Artificial neural networks are now being shown to be applicable to challenging perceptual problems and this has lots of interesting implications. If you’d like to read more, browse through some of the links I’ve put in this post. Or if you’re after anything in particular let me know, I can try to point you in the right direction.

      • Those articles do a good job of summarising the positive side of what has been achieved (improved classification and prediction) and presenting the point that there is a lot more to intelligence than has been demonstrated so far.

        Though, as I mention above, a great thing about Hinton’s deep learning systems is that they can generate images (and probably sounds) from incomplete or noisy input images, or by exciting high-level neurons that represent that class. Although being able to recognise images or sounds alone makes for pretty shallow “understanding”.

        I think further advances might start to occur when they start trying to mix multiple sensory streams: particularly vision and sound, but eventually tactile senses as well. This could easily be achieved with the minimal changes to the current designs of deep learning architectures.

        If we can have the machine be shown an image, and it spontaneously generate speech that says what object it recognises, some people would find that a bit eerie. This would approximate playing the naming game with a child. If we could tell the robot a story, and in its “mind” it spontaneously starts generating images that represent the story, we could say that the machine is actually on the way to having a genuine “understanding” of the story.

        As I describe in this post, advances might also be achieved by expanding the perceptual applications to include things like motion perception: often visible as objects changing position, scale or rotation. One thing that is less obvious, is how to apply deep learning to motion control of robots. That’s something I’m still thinking about.

        Even though I can see great potential in deep learning, I can still see some significant obstacles. Deep learning still uses iterative training techniques, which despite being quicker than they were, are still a far cry from being able to briefly see a person or object once, and then recognising that person or object from obscure angles and distances moments later. Something that people can (sometimes) do very well. Though this might be easily taken care of with some complementary memory processes (much as it seems to be in the brain).

        And even if current systems for deep learning turns out to have some fatal flaw in being applicable to simulating intelligence, I think the cat is very nearly out of the bag. The neocortex has largely the same structure all over the brain, and we have reasonably good approximations for many of the less regular areas of the brain. So, in my opinion it is only a matter of time before we have intelligent machines. Or machines “intelligent” enough to do most of what we’ve ever wanted them to, though with certain physical constraints withstanding.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s