This article originally appeared on the Australian Centre for Robotic Vision website and was written by Tim Macuga.
Facebook automatically finds and tags friends in your photos. Google Deepmind’s AlphaGo computer program trounced champions at the ancient game of Go last year. Skype translates spoken conversations in real time — and pretty accurately too.
Behind all this is a type of artificial intelligence called deep learning. But what is deep learning and how does it work?
Deep learning is a subset of machine learning — a field that examines computer algorithms that learn and improve on their own.
Machine learning is by no means a recent phenomenon. It has its roots in the mid-20th century. In the 1950s, British mathematician Alan Turing proposed his artificially intelligent “learning machine”. And, during the following decades, various machine learning techniques have risen and fallen out of favour.
One of these is neural networks — the algorithms that underpin deep learning and play a central part in image recognition and robotic vision.
Inspired by the nerve cells (neurons) that make up the human brain, neural networks comprise layers (neurons) that are connected in adjacent layers to each other. The more layers, the “deeper” the network.
A single neuron in the brain, receives signals — as many as 100,000 — from other neurons. When those other neurons fire, they exert either an excitatory or inhibitory effect on the neurons they connect to. And if our first neuron’s inputs add up to a certain threshold voltage, it will fire too.
In an artificial neural network, signals also travel between ’neurons‘. But instead of firing an electrical signal, a neural network assigns weights to various neurons. A neuron weighted more heavily than another will exert more of an effect on the next layer of neurons. The final layer puts together these weighted inputs to come up with an answer.
Let’s say we want a neural network to recognise photos that contain at least one cat. But cats don’t all look exactly alike — consider a shaggy old Maine Coon and a white Siamese kitten. Nor do photos necessarily show them in the same light, at the same angle and at the same size.
So we need to compile a training set of images — thousands of examples of cat faces, which we (humans) label “cat”, and pictures of objects that aren’t cats, labelled (you guessed it) “not cat”.
These images are fed into the neural network. And if this were a sports drama film, the training montage would look something like this: an image is converted into data which moves through the network and various neurons assign weights to different elements. A slightly curved diagonal line could be more heavily weighted than a perfect 90-degree angle, for instance.
At the end, the final output layer puts together all the pieces of information — pointed ears, whiskers, black nose — and spits out an answer: cat.
The neural network compares this answer to the real, human-generated label. If it matches, great! If not — if the image was of a corgi, for instance — the neural network makes note of the error and goes back and adjusts its neurons’ weightings. The neural network then takes another image and repeats the process, thousands of times, adjusting its weightings and improving its cat-recognition skills — all this despite never being explicitly told what “makes” a cat.
This training technique is called supervised learning.
Unsupervised learning, on the other hand, uses unlabeled data. Neural networks must recognise patterns in data to teach themselves what parts of any photo might be relevant.
A self-learning machine sounds terrific. But until recently, neural networks were, by and large, ignored by machine learning researchers. Neural networks were plagued by a number of seemingly insurmountable problems. One was that they were prone to ‘local minima’. This meant they ended up with weightings that incorrectly appeared to give the fewest errors.
Other machine learning techniques took off, particularly in the realm of computer vision and facial recognition. In 2001, Paul Viola and Michael Jones from Mitsubishi Electric Research Laboratories, in the US, used a machine learning algorithm called adaptive boosting, or AdaBoost, to detect faces in an image in real time.
Rather than weighted interconnected neurons, AdaBoost filtered an image through a set of simple decisions. Does the image have a bright spot between dark patches, which might signify the bridge of a nose? Are there two dark areas above broad paler smears, as eyes and cheeks appear in black and white photos?
As the data cascaded down the decision tree, the likelihood of correctly picking out a face from an image grew. “It’s a very simple idea, but it’s very elegant — and very powerful,” says Prof Ian Reid, Deputy Director of the the Australian Centre for Robotic Vision (ACRV) and a computer vision scientist . It seemed the final nail in the coffin for neural networks. Most of the artificial intelligence community thought they had well and truly been left in the dust.
“But not everyone gave up on neural networks,” Reid says. In the past decade or so, a confluence of events thrust neural networks to the forefront of the field.
A group at the University of Toronto in Canada, headed by 1980s neural network pioneer Geoff Hinton, came up with a way of training a neural network that meant it didn’t fall into the local minimum trap.
Powerful graphics processing units, or GPUs, burst onto the scene, meaning researchers could run, manipulate and process images on desktop computers rather than supercomputers.
But what gave neural networks the biggest leg-up, Reid says, was the advent of a mammoth amount of labelled data. In 2007, a pair of computer scientists — Fei-Fei Li at Stanford University and Kai Li at Princeton University — launched ImageNet, a database of millions of labelled images from the internet. The long and arduous labelling task was crowdsourced with technology such as the Amazon Mechanical Turk, which paid users a couple of cents for each image they labelled.
Now ImageNet provides neural networks with about 10 million images and 1000 different labels. “The jump from 10 years ago to now — it’s massive,” says Gustavo Carneiro, another computer scientist from the Centre. And in the past five years, neural networks have become a central tool of robot vision.
While modern neural networks contain many layers — Google Photos has around 30 layers — a big step has been the emergence of convolutional neural networks, Reid says.
As with traditional neural networks, convolutional counterparts are made of layers of weighted neurons. But they’re not just modelled on the workings of the brain; they, appropriately enough, take inspiration from the visual system itself.
Every layer within a convolutional neural network applies a filter across the image to pick up specific patterns or features. The first few layers detect larger features, such as diagonal lines, while later layers pick up finer details and organise them into complex features such as an ear, Reid says.
The final output layer, like an ordinary neural network, is fully connected (that is, all neurons in that layer are connected to all neurons in the previous layer). It puts together highly specific features — which could include slit-shaped pupils, almond-shaped eyes, eye-to-nose distance — to produce an ultra-precise classification: cat.
In 2012, Google trained a convolutional neural network with thousands of unlabelled thumbnails of YouTube clips to see what it would come up with. Unsurprisingly, it became adept at finding cat videos.
So what’s going on in a neural network’s hidden layers? This is largely a mystery, says Prof Peter Corke, director of the Centre. But as networks get deeper and researchers unwrap the secrets of the human brains on which they’re modelled, they’ll become ever-more nuanced and sophisticated.
“And as we learn more about the algorithms coded in the human brain and the tricks evolution has given us to help us understand images,” Corke says, “we’ll be reverse engineering the brain and stealing them.”
Growth Marketer at ROSS Intelligence