Trends in Machine Intelligence: Feature Learning

Feature learning will allow machine learning to revolutionise many industries, without needing years of domain expertise.
“What is a good feature?” This is a three-decade old question in machine learning. Before we can understand the question, we must first understand what is a feature.
In machine learning, features are any observations of a sample that are used to make a prediction or classification. For example, consider the problem of estimating the age of a child (our sample) using a set of measurements that consist of their height, weight and a photo. These are all features. The height measurement is one feature, as is the weight measurement. However, the photo contains millions of individual pixels, each of which is a feature.
While the height and weight may correlate directly with age of a child, an individual pixel in the photo is unlikely to correlate with age. However, there is wealth of data in raw measurements such as photos. But to make use of it the raw data must be transformed into a ‘good feature’, a measurement that tells us something about the sample.
The classical approach is to handcraft features that experts thought were discriminative. This approach is known as feature engineering. For example, consider the problem of identifying a ripe apple from a photo. Here an expert created feature might be to segment the foreground pixels, assuming they contained the apple, then transform these pixel-values into an illumination invariant color space, and finally measure the average hue of all these pixels.
This was the area I worked in for well over a decade. Finding good features that transformed raw data into discriminative observations. I looked for physical characteristics about objects of interest that would help with a prediction or classification problem. To do this I became a domain expert in the objects of interest and the sensors that made the observations. What I found was that experts features often failed. For example, the assumption that the apple is in the foreground of the photo may not hold true. Another problem is that there are a myriad of possible handcrafted features, for example would the HSB or LAB colour space correlate better with the ripeness of an apple?
One approach that was popularised in the mid 2000s was feature selection. Here, dozens, hundreds or even thousands of features would be created. Next statistical methods are used to select the best features. In classification problems, a good feature is one that discriminates the object of interest from all other objects. Feature selection allows the data scientist to determine if HSB or LAB colour space is better for discerning ripe apples. The primary driver for feature selection is reducing the number of feature-dimensions, to mitigate against the curse of dimensionality.
An interesting sub-problem in feature selection, is not just deciding between two features, but selecting the best subset of K features. The best K features may not be the most discriminative set of K features. For example, a set containing two colour space features may not be as useful as a set containing one colour-space feature and one size feature, when determining the ripeness of an apple.
A more recent technique is feature learning. The rise of deep learning has given birth to powerful generative (unsupervised) and discriminative (supervised) feature learning methods, such as the restricted Boltzmann machine and the convolutional neural network.
For example, to build our ripe apple detector, we would use thousands of photos of ripe apples, unripe apples, as well as photos of other objects that aren’t ripe apples but might appear in our scene. Using a discriminative learning approach, such as a convolutional neural network, we also need labels for the content of each photo: ripe apple or not-ripe-apple. Starting with randomly initialised neurons, the network is trained to reduce the difference in its predicted output when compared to the correct label, for a given input image. The learnt features are then contained within the layers of the neural network, with the layers closest to the input containing the simplest features, such as edge detectors.
However, many data sets don’t have all the labels that we need. This is where generative learning methods can help. For example, using a convolutional restricted Boltzmann machine we can learn a set of features (convolutional filters) that attempt to describe an image in a compact form. These features are a great starting point for initialising a deep convolutional neural network, which can now be fine-tuned to perform a discriminative task, such as deciding if an apple is ripe.
This ability to learn good features directly from the data means that the slow process of ‘inventing’ new features using domain-expertise is no longer required. Instead, the data itself is used to learn good features, which tend to be more robust and generalise better than handcrafted features.
Feature learning allows the data-scientist to quickly gain insight into physical processes, and develop automated solutions for any domain.