Do Data Lakes hide Loch Ness Monsters?

I had a discussion with a client recently about the virtues of ensuring data written into a data warehouse is rock solid and understood and well defined.

My training and experience has given me high confidence that this is the right way forward for typical actuarial data.  Here I’m talking in force policy data files, movements, transactions, and so on.  This is really well structured data that will be used many times by different people and can easily be processed once, “on write”, stored in the data warehouse to be reliably and simply retrieved whenever necessary. Continue reading “Do Data Lakes hide Loch Ness Monsters?”

ENID not Blyton

ENID is a term widely used, just generally not in South Africa. For some reason we didn’t import the term along with most of Solvency II.

This has nothing to do with the Famous Five. While it is most common in the general insurance space, it is relevant across the spectrum of risk management and assumption setting.

Events Not In Data or “ENID” is the forgotten cousin of “what to do with outliers in your data”.

Outliers and where to find them

Outliers are observed values substantially different from others in a sample. Some more formal definitions include:

“An outlier is an observation that lies an abnormal distance from other values in a random sample from a population”

“an outlier is an observation point that is distant from other observations”

Not these sort of outliers. Entertaining book though.


How to deal with outliers?

Simple question, complex answer. It depends a great deal on the context.

Ultimately you need to make the judgement call “are these outliers under- or over-represented in the data”. Continue reading “ENID not Blyton”

Claims analysis, inflation and discounting (part 1)

I’ve had the privilege to straddle life insurance and non-life insurance (P&C, general, short term insurance, take your pick of terms) in my career.  On balance, I think having significant exposure to both has increased my knowledge in each rather than lessened the depth of my knowledge in either.  I’ve been able to transport concepts and take learnings from one side to the other.

A recent example relates to the common non-life practice of not discounting claims reserves.  Solvency II, SAM and IFRS17 moves to require discounting aside, it is still more a common GAAP approach to not discount than to discount claims reserves.

Discounting or fiddling with inflation has some obvious implications for analysing actual vs expected analysis, reserve run offs, and reserve adequacy analysis. That some non-life reserving actuaries trip over because it’s more natural in the life space.

But, first, why are non-life reserves so often not discounted? There are several reasons typically given: Continue reading “Claims analysis, inflation and discounting (part 1)”

Modest data

I’m as excited as the next guy about the possibilities of “Big Data!” but possibly more excited about the opportunities presented by plain old “Modest Data”. I believe there is plenty of scope for useful analysis on fairly moderate data sets with the right approach and tools.

I’d go as far as to say that many of the “Big Data!” stories and analysis currently performed is really plain old statistical analysis with a few new touches from the ever-expanding list of R libraries.

For example, it seems that papers with shorter titles get more citations by other researchers.  Although the research considered 140,000 papers, there is nothing especially “Big Data!” about the analysis. The paper and authors suggest several possible causes related to the quality of the journal, period of time etc. Disappointingly, they don’t seem to have modelled these possible effects directly to understand whether there is any residual effect.

There is scope for great analysis without “Big Data!” and plenty of scope of poor analysis with all the data in the world.

Economic growth during and after Apartheid and the real problem with 1%

I read a letter from Pali Lehohla on news24 this weekend. Lehohla, the head of StatsSA, disagreed with a report by DaMina Advisors on economic growth in South Africa during and post the apartheid era.

To paraphrase Lehohla, he disagreed with their methodology, their data and their values and ethics:

First, I need to engage the author on methods. Second, I address the facts. Third, I focus on the morality of political systems and, finally, I question the integrity of the luminaries of DaMina and ask them to come clean.

This wasn’t data I had looked at before, but some of Lehohla’s criticisms seemed valid. Using nominal GDP growth data is close to meaningless over periods of different inflation.

Second, the methodology of interpreting economic growth should use real growth instead of nominal growth because this carries with it differing inflation rates. This is to standardise the rates across high and low inflation periods.

I haven’t confirmed the DaMina calculations, but the labels in their table do say “current USD prices” which suggests they have used nominal data. It’s little wonder any period including the 1970s looks great from a nominal growth perspective with nominal USD GDP growth in 1973 and 1974 being 34% and 23%, compared to real growth of 2.2% and 3.8%. The high inflation of the 1970s arising from oil shocks and breakdown of the gold standard distorts this analysis completely.

Lehohla’s other complaint is also important, but less straightforward to my mind –

The methods that underpin any comparison for a given country cannot be based on a currency other than that of the country concerned. The reason is that exchange-rate fluctuations exaggerate the changes beyond what they actually are.

Two problems here – one is that purchasing power adjusted GDP indices are not typically available going far back in history. The other is that if one is using real GDP, the worst of the problems of currency fluctuations are already ironed out. (The worst, certainly not all and it would still be a factor that should be analysed rather than completely overlooked.)

I was disappointed that neither piece mentioned anything at all about real GDP per capita. Does it really matter how much more we produce as a country if the income per person is declining? Income inequality aside, important as it is, more GDP per capita means more earning power per person, more income per person, more things per person. It is a far more useful measure of prosperity for a country, and particularly for comparing economic growth across countries with different population growth rates.

My own analysis, based on World Bank data (available from 1960 to 2013)

real GDP growth (annual %) real GDP per capita growth (annual %)
1961-1969 6.1% 3.5%
1970-1979 3.2% 1.0%
1980-1989 2.2% -0.3%
1990-1999 1.4% -0.8%
2000-2009 3.6% 2.0%
2010-2013 2.7% 1.1%
1961-1990 3.6% 1.2%
1971-1990 2.4% 0.1%
1991-2010 2.6% 1.3%
1991-2013 2.6% 0.8%

 

I’ve put these numbers out without much analysis. However, it’s pretty clear that on the most sensible measure (real GDP per capita) over the periods the DaMina study considered, post-apartheid growth has been better than during the 1971-1990 period of Apartheid.

The conclusion is reversed if one includes the 1960s Apartheid economy and the latest data to 2013, the picture is reversed on both measures.

This, above all else, should talk to the dangers of selecting data to suit the outcome.

This analysis doesn’t talk to the impact of the gold standard, the low cost of gold mining closer to the surface than it is now, the technological catch-up South Africa should have benefited from more in the past, the impact of international sanctions and expenditure on the old SADF and who knows what else. There are much big monsters lurking there that I am not equipped to begin to analyse.

My overall conclusion? The Apartheid days were not “economically better” even without ignoring the millions of lives damaged. Unfortunately, our economic growth has for decades been too low to progress our economy to provide a better life for all.

Here is the problem:

1961-2013 1961-2013
Real GDP growth Real per capita GDP growth
South Africa 3.2% 1.0%
Kenya 4.6% 1.3%
Brazil 4.3% 2.3%
USA 3.1% 2.0%

Despite the theory of “Convergence“, the US has had double South Africa’s per capita GDP growth for over five decades.  Real GDP per capita increased by 72% in South Africa over the entire period from 1960 to 2013, which sounds impressive until you realise that the US managed 189%. That is more than 2.5x our growth Brazil has done even better at 237%. “Even Kenya” outperformed us over this period.

1% per annum real per capita GDP growth is just not good enough.

Open mortality data

The Continuous Statistical Investment Committee of the Actuarial Society does fabulous work at gathering industry data and analysing it for broad use and consumption by actuaries and others.

I can only begin to imagine the data horrors of dealing with multiple insurers, multiple sources, multiple different data problems. The analysis they do is critically useful and, in technical terms, helluva interesting. I enjoyed the presentation at both the Cape Town and Johannesburg #LACseminar2013 just because there is such a rich data set and the analysis is fascinating.

I do hope they agree to my suggestion to put the entire, cleaned, anonymised data set available on the web. Different parties will want to analyse the data in different ways; there is simply no way the CSI Committee can perform every analysis and every piece of investigation that everyone might want. Making the data publicly available gives actuaries, students, academics and more the ability to perform their own analysis. And at basically no cost.

The other, slightly more defensive reason, is that mistakes do happen from time to time. I’m very aware of the topical R-R paper that was based on flawed analysis of underlying data. Mistakes happen all the time, and allowing anyone who wants to have access to the data to repeat or disprove calculations and analysis only makes the results more robust.

So, here’s hoping for open access mortality investigation data for all! And here’s thanking the CSI committee (past and current) for everything they have already done.

The virtual irrelevancy of population size to required sample size

Statistics and sampling are fundamental to almost all of our understanding of the world. The world is too big to measure directly. Measuring representative samples is a way to understand the entire picture.

Popular and academic literature are both full of examples of poor sample selection resulting in flawed conclusions about the population. Some of the most famous examples relied on sampling from telephone books (in the days when phone books still mattered and only relatively wealthy people had telephones) resulting in skewed samples.

This post is not about bias in sample selection but rather the simpler matter of sample sizes.

Population size is usually irrelevant to sample size

I’ve read too often the quote: “Your sample was only 60 people from a population of 100,000.  That’s not statistically relevant.”  Which is of course plain wrong and frustratingly wide-spread.

Required Sample Size is dictated by:

  • How accurate one needs the estimate to be
  • The standard deviation of the population
  • The homogeneity of the population

Only in exceptional circumstances does population size matter at all. To demonstrate this, consider the graph of the standard error of the mean estimate as the sample size increases for a population of 1,000 with a standard deviation of the members of the population of 25.

Standard Error as Sample Size increases for population of 1,000
Standard Error as Sample Size increases for population of 1,000

The standard error drops very quickly at first, then decreases very gradually thereafter even for a large sample of 100. Let’s see how this compares to a larger population of 10,000. Continue reading “The virtual irrelevancy of population size to required sample size”

Coffee as the thin edge

Pick n Pay is starting to gain some useful insights into customer behaviour and purchasing decisions at different stores. They’re using coffee as a key product to better understand who buys what, where and when.  They’re tossing out (more likely de-emphaszing) LSMs as a method of categorising customers and moving to more sophisticated measures (including whether the purchaser has children or not, but also I’d expect location, purchase frequency, average basket size, mix of goods etc.)

Pick n Pay had to spend a fortune on the Smart Shopper system and has ongoing expenses in terms of rewards and analysis. The curious thing for me is how many loyalty cards incur the system and reward costs for retailers, but without gaining the full benefit of analysis and thus insight into customers.

I don’t get tailored book suggestions from Exclusive Books. They also haven’t tried to entice me back to their stores since I started buying first from Bookfinder.com and then almost exclusively ebooks from Amazon. They’ve basically lost a customer and haven’t done anything about it.

Even my friend’s St Elmos offers sweet deals to customers who haven’t ordered in a while to entice them back. Pick n Pay turned sub R100 pm customers into R350 pm customers (at least while the special was one) by specifically targeting customers that are familiar with Pick n Pay but need a push to become regular, high-spending customers.

I haven’t had a movie card with Ster Kinekor in a while, but I always use the same email address and credit when I purchase tickets online (which I do almost universally). There have been periods of several months where I haven’t gone to the movies, but no attempt from Ster Kinekor to woo me back with free popcorn or a careful movie recommendation.

Retailers are missing a trick to get an edge over their competitors.