An actuary I know once made me cringe by saying “It doesn’t matter how an Economic Scenario Generator is constructed, if it meets all the calibration tests then it’s fine. A machine learning black box is as good as any other model with the same scores.” The idea being that if the model outputs correctly reflect the calibration inputs, the martingale test worked and the number of simulations generated produced an acceptably low standard error then the model is fit for purpose and is as good as any other model with the same “test scores”.
This is an example of actuarial sloppiness and is of course quite wrong.
There are at least three clear reasons why the model could still be dangerously specified and inferior to a more coherently structured model with worse “test scores”.
The least concerning of the three is probably interpolation. We rarely have a complete set of calibration inputs. We might have equity volatility at 1 year, 3 year, 5 year and an assumed long-term point of 30 years as calibration inputs to our model. We will be using the model outputs for many other points and just because we confirmed that the output results are consistent with the calibration inputs says nothing about whether the 2 year or 10 year volatility are appropriate.
The second reason is related – extrapolation. We may well be using model outputs beyond the 30 year point for which we have a specific calibration. A second example would be the volatility skew implied by the model even if none were specified – a more subtle form of extrapolation
A typical counter to these first two concerns is to use a more comprehensive set of calibration tests. Consider the smoothness of the volatility surface and ensure that extrapolations beyond the last calibration point are sensible. Good ideas both, but already we are veering away from a simplified calibration test score world and introducing judgment (a good thing!) into the evaluation.
There are limits to the “expanded test” solution. A truly comprehensive set of tests might well be impossibly large if not infinite with increasing cost to this brute force approach.
The third is a function of how the ESG is used. Most likely, the model is being used to value a complex guarantee or exotic derivative with a set of pay-offs based on the results of the ESG. Two ESGs could have the same calibration test results, even calculating similar at-the-money option values but value path-dependent or otherwise more exotic or products very differently due to different serial correlations or untested higher moments.
It is unlikely that a far out-of-the-money binary option was part of the calibration inputs and tests. If it were, it is certain that another instrument with information to add from a complete market was excluded. The set of calibration inputs and tests can never be exhaustive.
It turns out there is an easier way to decreasing the risk that the interpolated and extrapolated financial outcomes of using the ESG are nonsense: Start with a coherent model structure for the ESG. By using a logical underlying model incorporating drivers and interactions that reflect what we know of the market, we bring much of what we would need to add in that enormous set of calibration tests into the model and increase the likelihood of usable ESG results.