Calibration is not Validation

I am an actuary, not an economist, although I hold economics in high regard as a field. It tackles questions that are fascinating, ambitious, and sometimes messy in ways that make me slightly jealous. Still, I remain an actuary, and perhaps that makes me a little more sceptical of certain techniques. Maybe it is because they do not fit neatly into my actuarial toolkit. Or maybe it is because actuaries are trained to test, back-test, and analyse actual versus expected results, to understand confidence intervals and likelihood, and to ask, above all, whether the model works.

I once read that meteorologists are better calibrated forecasters than stock pickers. By calibrated, I do not mean they are better or worse at forecasting, but that their level of confidence matches their accuracy. When they believe they are 80 % likely to be right, they are right about 80 % of the time. When they are uncertain, their forecasts are indeed less accurate. Stock pickers, and I suspect some economists, tend to be less well calibrated. They are not necessarily poor forecasters, but they are often overconfident, and that is a different problem altogether.

The point came from Superforecasting by Philip Tetlock and Dan Gardner, which explored how some individuals become remarkably well-calibrated forecasters through constant feedback and disciplined self-assessment. One reason meteorologists perform better is feedback. Stock pickers and economists can always point to exceptional events that explain why forecasts went wrong. Who knew there would be a drought, or that Russia would invade Ukraine, or that COVID-19 would shut down economies? There is always a convenient exogenous shock. Weather forecasters do not have that luxury. They live by feedback. They issue forecasts every day and are judged immediately by outcomes they cannot explain away. Their models evolve constantly because their results are tested constantly.

That contrast came to mind when I first encountered Computable General Equilibrium (CGE) models about a decade ago. They were intriguing. A little too clever, perhaps, and impressive in their scale. Thousands of equations, countless parameters, and from all of that emerges the predicted effect of an intervention on GDP, employment, or wages—sometimes quoted to two decimal places. At the time, it seemed remarkable, but I could not see the transparency I wanted. There was little evidence of how these models were calibrated, how reliable they were, or how confident one should be in their results.

I came across them again recently. I am not saying CGE models are unhelpful or that their results are wrong. They can be useful for exploring direction and mechanism – how substitution between sectors or between labour and capital might unfold. They are elegant frameworks for thinking through equilibrium interactions. What concerns me is the way they are sometimes used. There is often little effort to quantify uncertainty, measure accuracy, or report diagnostics that would tell us how well the model is performing. That makes me uneasy.

I would like to see genuinely robust studies that take a CGE model, generate predictions, and check them against what actually happens. Yes, there will always be surprises, but if the models barely outperform a simple trend line, the enormous complexity and opacity seem hard to justify.

Even before that, I would want to see what the model predicts with no intervention at all. CGE models are built to find a general equilibrium, but there is no guarantee the economy we feed in is itself in equilibrium. The first step should surely be to see what the model does on its own. If its natural equilibrium is very different from the starting point, that tells us something. Either the real economy is far from equilibrium, or the model’s structure is not reflecting it well.

I am not suggesting this problem occurs frequently, but the diagnostic results are rarely reported, and that makes me wonder whether it sometimes does. If the model is not allowed to reach its internal equilibrium before the intervention is applied, the change in GDP or other results could be partly driven by the model’s own adjustment toward equilibrium. It would be useful to know.

I would also like to see how stable the results are when the model is run with successive years’ data. If we input the economy as it stood each year, do the parameters and balances remain consistent? If not, that raises questions about the model’s stability. For example, can it predict the next year’s social accounting matrix (the SAM) with any meaningful accuracy? If it cannot, then I am not sure how much confidence we can place in the counterfactual results.

As I understand it, CGE models are calibrated rather than statistically estimated. They rely on factual data inputs, some parameters solved to achieve equilibrium, and elasticities imported from previous studies. Those elasticities differ across the literature, and the uncertainty around them is often considerable. How confident are we in those numbers? What happens if we vary them within plausible ranges? If small changes produce very different equilibria even before any intervention, then those elasticities are critical. If the results remain fairly stable, that is reassuring.

We could also test how much the estimated impact of a policy or intervention depends on these elasticity choices. If the results swing widely, confidence in the model’s precision should be low. And if the model’s internal uncertainty interval for GDP, wages, or productivity is wider than the intervention effect, that should be stated clearly.

In actuarial work, we routinely perform analysis of surplus to understand why results differ from expectations. We ask whether deviations stem from data, assumptions, or the model itself. That discipline drives improvement and professional scepticism.

Actuaries are not perfect. There are areas of our own work where experience analysis or model validation is weak. Some solvency projections (ORSA AvE notably, but there are others) , for example, are never properly reconciled to outcomes. That is disappointing, but at least the mindset of testing and validation is embedded in the profession. The actuarial control cycle is still taught.

We are also learning from “new” data science, which takes validation seriously. Models are trained and tested on different data, evaluated out of sample and out of time, and using techniques such as Monte Carlo Cross Validation (a new discovery for me) or k-fold cross-validation. These approaches help ensure that models perform reliably, not just on the data that built them but on data they have never seen.

I would like to see economists apply similar rigour to CGE modelling. Calibration is not validation. Without testing, backtesting, and clear communication of uncertainty, the apparent precision of these models can be misleading.

CGE models can be valuable tools, but their uncertainties and dependencies are often poorly understood, poorly expressed, and underappreciated.

Perhaps I am missing something. Perhaps these diagnostic checks are done quietly somewhere. I would be glad to be corrected by economists who understand these models better than I do. I am genuinely open to learning more.

Because if these models can withstand that kind of scrutiny, they deserve confidence. If they cannot, they still have value, but we should be honest about what they are: structured thought experiments with useful stories to understand. Not forecasts – and those second decimals places should never be shown.

Twenty Third Floor

Figuring out the future and the now

Leave a Reply Cancel reply

About David Kirk

Featured Posts

Categories