Learning the art of data science

Mark Lee and Matthew Evans provide an introduction to data science for actuaries

Learning the art of data science

Face recognition, fraud detection and spam filters are just a few examples of the applications of data science, a catch-all term encompassing big data, machine learning, data mining and predictive analytics.

An ever-increasing supply of data, and powerful modern computers that are able to exploit and analyse it, has led to the growth of the data science field. At its core is the concept of gaining insight from data, be it big or small. Data science techniques are employed in a wide variety of industries, from fashion retail to hedge funds.

We use the term ‘big data’ to refer to large collections of data, potentially from diverse sources, that is often unstructured, relying on text, pictures,
or geographical positions, rather than fixed fields as found in more traditional data sets. But data science is not just about big data. Having big data may well require the use of machine learning technology to extract useful information; however, a lack of big data does not preclude the use of machine learning algorithms.

Machine learning
‘Machine learning’ is the process by which a computer learns by being exposed to data, generally by using an algorithm that optimises some mathematical function of that data. Once the domain of computer scientists in large research organisations, machine learning is now available to everyone through free, open-source toolboxes provided for programming languages such as R and Python. These languages have a comparatively easy learning curve and come with many functions that are built in or available to download, enabling the user to perform sophisticated tasks with ease. This functionality is invaluable to actuaries as it means the exercise is more one of data manipulation and analysis of the output than computer programming. With courses available that provide the fundamentals needed to explore the field, data science has never been so accessible to actuaries.

There are two fundamental categories of machine learning. ‘Supervised’ learning algorithms are in the business of prediction, while ‘unsupervised’ learning focuses on understanding the structure behind a data set.

Imagine trying to categorise pictures of cats and dogs. Starting from a database of such photos, each labelled either ‘cat’ or ‘dog’, supervised learning involves the creation of a predictive model that exploits the information contained in the labels.

The model will make predictions by taking an unlabelled, previously unseen pet photo and deciding whether it is a picture of a cat or a dog.
Unsupervised learning takes a different approach. Running an unsupervised algorithm on a set of unlabelled photos returns a grouping of photos that are most similar. That grouping might be a separation into pictures of cats and dogs, but equally could be a separation by pet size or colour. The exact results will be determined by the parameters governing the algorithm. Although the unsupervised learning algorithm may not be able to predict pet species, this is not a failing of the algorithm since it was not supplied with the information contained in the labels. A successful unsupervised algorithm will provide information about the relationships between the pictures. It is then up to the user to interpret the information appropriately – after all, pet species is not the only information in the pictures, and for some uses, maybe pet size is more important.

Figure 1: Some common machine-learning algorithms

Actuarial applications
Figure 1 shows some machine-learning algorithms. One such algorithm, a generalised linear model (GLM), has been used by actuaries in personal lines pricing for years. GLMs can be thought of as  prototypical supervised learning algorithms. Given a set of prior claim frequencies and severities, a GLM algorithm creates a model that predicts, for a new policy, how likely it is that a claim will occur and how much it will cost. With modern computing power, these methods can be taken further with the use of algorithms, such as decision forests or neural networks. The flexibility of these algorithms allows the fitting of non-linear trends, without having to make manual assumptions. Such techniques also have the ability to identify interactions between data items that are not seen by the human eye or through the use of linear models. These ‘hidden’ interactions can then potentially be used to predict claims more effectively, leading to more competitive pricing.

While GLMs are often used to price personal lines, specialty lines in the London Market rely on the expertise of underwriters. Marine pricing is one such example where there is a wealth of data, in this case on ship position and weather records. This is big data, which lacks clear structure and so can be difficult to analyse. However, supervised learning algorithms such as neural networks could extract features predictive of claim patterns. The information could add an extra dimension for underwriters and may offer a competitive advantage.

There are also many potential applications for unsupervised learning techniques. Unsupervised algorithms can augment and replace human-labour intensive data sorting and visualisation, particularly when the number of data fields is large. For example, grouping accounts by prior loss ratio performance, enabling quick identification of common trends or dependencies, may offer management teams a valuable insight into the company.

Figure 2: Trade-off between flexibility of algorithm and transparency of resulting model

Beyond the black box
A common criticism levelled at machine learning is that the resulting models are too ‘black box’ like. Although a simple linear model is straightforward to understand and communicate, it is not very flexible when dealing with general data that may involve non-linear relationships. In contrast, very flexible supervised learning algorithms, such as decision forests or neural networks, can fit to quite general data patterns but at the cost of a less transparent model (see figure 2). While it might be true that the models can be complicated, the resulting model can usually be communicated sufficiently clearly by plotting the model predictions against the various predictive features. Furthermore, a variety of statistical techniques exist that can provide the user with the comfort that the model is robust and appropriate.

Data science is not a new field. It contains a multitude of tried and tested algorithms that have already been proven to be beneficial in other industries. With the development of technology giving the everyday user the computing power to use these processes, and with the tools to use these methods being easily accessible, actuaries can now apply techniques that were once only available to data specialists. In today’s competitive environment, data science could be used to supplement the tools that actuaries already have at their disposal and provide companies with that all-important edge.

Mark Lee is a consultant at Insight Risk Consulting
Matthew Evans is an actuarial director at Insight Risk Consulting

Most popular

  1. Cyber incidents top ranking of business risks in 2020

    An increasing reliance on data and IT systems has seen cyber incidents shoot to the top of the most pressing risks facing businesses worldwide, research by Allianz has uncovered.


    Friday 17

    17 January 2020

  2. Risk managers overwhelmed by new technologies

    The majority of risk managers worldwide cannot adequately assess the threats posed by new technologies, research by Accenture has found.

    10 December 2019

  3. Blockchain to save financial services firms $7bn by 2024

    Financial institutions will save $7bn (£5.43bn) by 2024 thanks to blockchain technology and the automation of customer checks, a market research firm has predicted.

    05 November 2019

White paper

  • Quarterly InsurTech Briefing Q1 2017

    Why InsurTech? A Pressured Insurance Value Chain

    By Andrew Sagon, Andrew Johnston and Matthew Wong

    InsurTech is a burgeoning phenomenon that is modernising the insurance industry. It is disrupting the traditional value chain whereby insurers offer loss protection, and shifting the emphasis to risk mitigation. Incumbents face disintermediation as investors in search of higher yields pour money into insurance-linked instruments in the capital markets. And entrepreneurial businesses are targeting friction costs and inefficiencies within every aspect of the traditional value chain.



  • Insurance big data – float like a butterfly, sting like a bee

    Nimbleness and agility will unlock potential

    By Elinor Friedman, Andrew Harley and Klayton Southwood

    Recent Willis Towers Watson surveys in the U.S. have shown that P&C and life insurers in developed markets are taking seriously the potential of big data and predictive analytics to improve their businesses. Nimbleness and agility, rather than brute force, are likely to be key to realizing that potential.

    Download PDF

  • The new era of insurance analytics

    Driven by technology, toolkits and talent

    By Claudine Modlin and Graham Wright

    Advanced analytics is helping some insurers offer innovative products and solutions. What do insurers need to know about the changing nature of analytics and whether it is worth the investment? Claudine Modlin and Graham Wright discuss technology, toolkits and talent — topics that may help you decide.

    Download PDF

  • How can we manage the dynamic nature of cyber-risk?

    Risk transfer is part of a comprehensive solution

    By Adeola Adele, Patrick Kulesa, Kevin Madigan and Alice Underwood

    Given the dynamic nature of cyber-risk, taking a multidimensional approach that integrates board governance, technology solutions, behavioral change and risk transfer solutions can help reduce risk to a manageable level.

    Whitepaper Form