In the past I had some comments on Neil Ferguson’s disease model and have repeatedly noted its poor quality. This model was used, last spring, as the basis for setting government policies to respond to Covid-19. Like many disease models, its output was garbage, unfit for any purpose.
The following item noted that the revision history, since last spring, is available and shows that ICL has not been truthful about the changes made to the original model code.
THIS! Many academic models including disease models and climate models, average the outputs from multiple runs, some how imaginatively thinking that this produces a reliable projection – uh, no, it does not work that way.
An average of wrong is wrong. There appears to be a seriously concerning issue with how British universities are teaching programming to scientists. Some of them seem to think hardware-triggered variations don’t matter if you average the outputs (they apparently call this an “ensemble model”).
Averaging samples to eliminate random noise works only if the noise is actually random. The mishmash of iteratively accumulated floating point uncertainty, uninitialised reads, broken shuffles, broken random number generators and other issues in this model may yield unexpected output changes but they are not truly random deviations, so they can’t just be averaged out.
Software quality assurance is often missing in academic projects that are used for public policy:
For standards to improve academics must lose the mentality that the rules don’t apply to them. In a formal petition to ICL to retract papers based on the model you can see comments “explaining” that scientists don’t need to unit test their code, that criticising them will just cause them to avoid peer review in future, and other entirely unacceptable positions. Eventually a modeller from the private sector gives them a reality check. In particular academics shouldn’t have to be convinced to open their code to scrutiny; it should be a mandatory part of grant funding.
The deeper question here is whether Imperial College administrators have any institutional awareness of how out of control this department has become, and whether they care. If not, why not? Does the title “Professor at Imperial” mean anything at all, or is the respect it currently garners just groupthink?
When a software model – such as a disease model – is used to set public policies that impact people’s lives – literally life or death – these models should adhere to standards for life-safety critical software systems. There are standards for, say, medical equipment, or nuclear power plant monitoring systems, or avionics – because they may put people’s lives at risk. A disease model has similar effects – and hacked models that adhere to no standards have no business being used to establish life safety critical policies!
I and another software engineer had an interaction with Gavin Schmidt of NASA regarding software quality assurance of their climate model or paleoclimate histories. He noted they only had funding for 1/4 of a full time equivalent person to work on SQA – in other words, they had no SQA. Instead, their position was that the model’s output should be compared to others. This would be like – instead of testing, Microsoft would judge its software quality by comparing the output of MS Word to the output of another word processor. In other words, sort of a quailty-via-proxy analogy. Needless to say, this is not how SQA works.
Similarly, the climate model community always averages multiple runs from multiple models to create projections. They do this even when some of the model projections are clearly off the rails. Averaging many wrongs does not make a right.
 Note that NASA does open source their software which enables more eyes to see the code, and I do not mean to pick on NASA or Schmidt here. They are doing what they can within their funding limitations. The point, however is that SQA is frequently given short shrift in academic-like settings.