Learning the art of data science
Mark Lee and Matthew Evans provide an introduction to data science for actuaries
Face recognition, fraud detection and spam filters are just a few examples of the applications of data science, a catch-all term encompassing big data, machine learning, data mining and predictive analytics.
An ever-increasing supply of data, and powerful modern computers that are able to exploit and analyse it, has led to the growth of the data science field. At its core is the concept of gaining insight from data, be it big or small. Data science techniques are employed in a wide variety of industries, from fashion retail to hedge funds.
We use the term ‘big data’ to refer to large collections of data, potentially from diverse sources, that is often unstructured, relying on text, pictures,
or geographical positions, rather than fixed fields as found in more traditional data sets. But data science is not just about big data. Having big data may well require the use of machine learning technology to extract useful information; however, a lack of big data does not preclude the use of machine learning algorithms.
‘Machine learning’ is the process by which a computer learns by being exposed to data, generally by using an algorithm that optimises some mathematical function of that data. Once the domain of computer scientists in large research organisations, machine learning is now available to everyone through free, open-source toolboxes provided for programming languages such as R and Python. These languages have a comparatively easy learning curve and come with many functions that are built in or available to download, enabling the user to perform sophisticated tasks with ease. This functionality is invaluable to actuaries as it means the exercise is more one of data manipulation and analysis of the output than computer programming. With courses available that provide the fundamentals needed to explore the field, data science has never been so accessible to actuaries.
There are two fundamental categories of machine learning. ‘Supervised’ learning algorithms are in the business of prediction, while ‘unsupervised’ learning focuses on understanding the structure behind a data set.
Imagine trying to categorise pictures of cats and dogs. Starting from a database of such photos, each labelled either ‘cat’ or ‘dog’, supervised learning involves the creation of a predictive model that exploits the information contained in the labels.
The model will make predictions by taking an unlabelled, previously unseen pet photo and deciding whether it is a picture of a cat or a dog.
Unsupervised learning takes a different approach. Running an unsupervised algorithm on a set of unlabelled photos returns a grouping of photos that are most similar. That grouping might be a separation into pictures of cats and dogs, but equally could be a separation by pet size or colour. The exact results will be determined by the parameters governing the algorithm. Although the unsupervised learning algorithm may not be able to predict pet species, this is not a failing of the algorithm since it was not supplied with the information contained in the labels. A successful unsupervised algorithm will provide information about the relationships between the pictures. It is then up to the user to interpret the information appropriately – after all, pet species is not the only information in the pictures, and for some uses, maybe pet size is more important.
Figure 1: Some common machine-learning algorithms
Figure 1 shows some machine-learning algorithms. One such algorithm, a generalised linear model (GLM), has been used by actuaries in personal lines pricing for years. GLMs can be thought of as prototypical supervised learning algorithms. Given a set of prior claim frequencies and severities, a GLM algorithm creates a model that predicts, for a new policy, how likely it is that a claim will occur and how much it will cost. With modern computing power, these methods can be taken further with the use of algorithms, such as decision forests or neural networks. The flexibility of these algorithms allows the fitting of non-linear trends, without having to make manual assumptions. Such techniques also have the ability to identify interactions between data items that are not seen by the human eye or through the use of linear models. These ‘hidden’ interactions can then potentially be used to predict claims more effectively, leading to more competitive pricing.
While GLMs are often used to price personal lines, specialty lines in the London Market rely on the expertise of underwriters. Marine pricing is one such example where there is a wealth of data, in this case on ship position and weather records. This is big data, which lacks clear structure and so can be difficult to analyse. However, supervised learning algorithms such as neural networks could extract features predictive of claim patterns. The information could add an extra dimension for underwriters and may offer a competitive advantage.
There are also many potential applications for unsupervised learning techniques. Unsupervised algorithms can augment and replace human-labour intensive data sorting and visualisation, particularly when the number of data fields is large. For example, grouping accounts by prior loss ratio performance, enabling quick identification of common trends or dependencies, may offer management teams a valuable insight into the company.
Figure 2: Trade-off between flexibility of algorithm and transparency of resulting model
Beyond the black box
A common criticism levelled at machine learning is that the resulting models are too ‘black box’ like. Although a simple linear model is straightforward to understand and communicate, it is not very flexible when dealing with general data that may involve non-linear relationships. In contrast, very flexible supervised learning algorithms, such as decision forests or neural networks, can fit to quite general data patterns but at the cost of a less transparent model (see figure 2). While it might be true that the models can be complicated, the resulting model can usually be communicated sufficiently clearly by plotting the model predictions against the various predictive features. Furthermore, a variety of statistical techniques exist that can provide the user with the comfort that the model is robust and appropriate.
Data science is not a new field. It contains a multitude of tried and tested algorithms that have already been proven to be beneficial in other industries. With the development of technology giving the everyday user the computing power to use these processes, and with the tools to use these methods being easily accessible, actuaries can now apply techniques that were once only available to data specialists. In today’s competitive environment, data science could be used to supplement the tools that actuaries already have at their disposal and provide companies with that all-important edge.
Mark Lee is a consultant at Insight Risk Consulting
Matthew Evans is an actuarial director at Insight Risk Consulting