Do Data Lakes hide Loch Ness Monsters?

I had a discussion with a client recently about the virtues of ensuring data written into a data warehouse is rock solid and understood and well defined.

My training and experience has given me high confidence that this is the right way forward for typical actuarial data.  Here I’m talking in force policy data files, movements, transactions, and so on.  This is really well structured data that will be used many times by different people and can easily be processed once, “on write”, stored in the data warehouse to be reliably and simply retrieved whenever necessary.

The particular issue was effectively about a single transaction, a single claim, being recorded as two separate items because of administration system limitations that required the claim to be processed as two equal amounts rather than a single claim. I can understand wanting to possibly also store the two separate transactions for audit trail purposes, but in terms of the claim for movements and analysis runs and checking against policy rules, it needs to reflect the product, business, actuarial model and customer reality of a single claim.

If the data is permanently stored as two separate, equal transactions, the end consumer of that data will always need to know to combine them, that they’re not duplicates (well they are, but they’re not – see how confusing it can be?). It really is standard practice for data warehouses to have “schema on write” in place. This talks to the structure of the data, but also the cleaning and transformation, application of business and validation rules to the data to fit the data definitions exactly.

And then I heard about Data Lakes. At first, these same a little like a new buzzword for data warehouse. But in fact it is quite different. The idea behind a data like is to store large amounts of unstructured data, as it comes, and figure out what to do with it later.

This is “schema on read”.

If you can imagine an actual lake, full of different things, sand and mud and fish and seaweed, but also boats and piers and old anchors and cool water and warm water and maybe the occasional Loch Ness Monster. Some of it may be structured, most of it not, and you don’t need to define what you will store ahead of time. You just throw it all in the data lake and “figure out what to do with it later”

The value of a data lake is that it is flexible. Nothing is lost from the original data. If a brand new purpose for the data arises that wasn’t even imagined when the data was originally stored, no avenues have been closed off. This is not true for a data warehouse.

However, for the types of data used for a simple actuarial valuation, locking down the exact data to be used to ensure results are consistent, that all valid policies are valued, that the analysis of experience and surplus work correctly, a rigid, reliable data source is required.

If a data lake is to be used, then another store of the highly structured, cleaned and transformed and validated data will be required to support the time-sensitive standard valuation processes. Whether this is  data warehouse feeding off the data lake, or just a simple data base, I suppose if up for debate. But a data lake on its own holds too many monsters to be used raw.

Published by David Kirk

The opinions expressed on this site are those of the author and other commenters and are not necessarily those of his employer or any other organisation. David Kirk runs Milliman’s actuarial consulting practice in Africa. He is an actuary and is the creator of New Business Margin on Revenue. He specialises in risk and capital management, regulatory change and insurance strategy . He also has extensive experience in embedded value reporting, insurance-related IFRS and share option valuation.