I will not comment on Covid-19 and only make a few comments on the publicly available source code.
- This is not a comment about whether models should be used – or not.
- This is not a comment about whether this model’s output is correct – or not (we have no way of knowing either way). Even with the model output being off my very large amounts, we still have no way of knowing.
- This is not a comment on whether there should be a lock down – or not.
- This is not a comment on whether a lock down is effective – or not.
- This is a review of a software project.
- The review findings are typical of what is often seen in academic software projects and other “solo contributor” projects (versus modern “production code” projects). The issues that often arise in academic projects are due to the nature of individuals or small groups, not trained in software, tinkering with software code until it grows out of control. This likely occurs in other fields but seldom do such works become major components of public policy.
- When software is used for public policy it needs to be publicly reviewed by independent parties. Until the past month, this code had apparently not been reviewed outside of the ICL team.
- Models are a valuable tool, when properly used and their limitations are understood. A reasonable model can enable planners to play “what if” scenarios and to adjust input parameters and see what might occur. For example, consider a model for complex manufacturing – we might look at productivity measures, inventory, defect rates, costs of defect repairs, costs of dissatisfied customers, impacts on profits and revenue, supplier issues and so on. If we choose to optimize for profit, then we use our model to find optimal values for each parameter to achieve maximum profit. Or perhaps we optimize for customer satisfaction instead -what happens to our profits if we do that? That is a What-if question. For this purpose the model need not be perfect but at least needs to be “within the ball park”. The key is “within the ball park”. If the model flies off the rails in many cases, it is not a good and accurate model and there is a risk we make seriously wrong decisions.
- A model may also be used to compare scenarios. We may not need precise future projections for that – instead, if we say, increase X, our model shows high profits, but in another run, we decrease X and show losses. We may not need to know the exact dollar value – only that one path leads to profit and one leads to losses. In this way, precise projections are not always essential.
This code – placed on GitHub – is apparently a revision released by the University of Edinburgh, based on the original source code by Neil Ferguson of the Imperial College of London. They are said to have asked Microsoft and others to help and clean it up and fix defects. Consequently, this is not the exact same code that Neil Ferguson was using to create is models two months ago, but code that has been since updated by others.
This code is thousands of lines of very old C programming.
First thing I noticed was how so much code has been placed in one gigantic source file – 5,400 lines in a single source file. Ouch.
This explains much:
The “undocumented C” argument comes when the author is the only one working on a project and sees no need to document their work. After all, it’s just me! There are two problems with this thinking: (1) over time, even personal projects like this one, grow in size until they become thousands of lines of code. Years later, our understanding of our own original code may not be as good as we think it is. We forget why we made particular design choices. We forget why we assumed certain conditions or values. Bottom line: over time, we forget. And (2) personal projects like the Covid-19 simulation eventually became the basis of major public policy and others are asked to review, check or modify the code base. No documentation puts the entire model at risk. This is not the right way to do these kinds of software projects, particularly when this is the basis for advising world leaders on major public policies that impact billions of people.
Academic projects are rarely industrial strength and not intended for the public policy choices they get used for. In 2008, I reviewed some of the ClimateGate source code – and there were crazy oddities. A different climate scientist, Gavin Schmidt at NASA, replied to a question I asked about quality assurance back around the time of ClimateGate and he said they had only 1/4 of a full time equivalent worker to do SQA in his NASA office (to be clear, the ClimateGate code I reviewed was not done by NASA, and NASA has been good about sharing their code publicly). The issue is that even NASA was not funded to do SQA. Schmidt said they relied instead on insuring that their model output matched other model outputs… uh, okay. That would be somewhat like testing Microsoft Word by comparing its output to LibreOffice. That is not how software testing works though and would be seen, at best, as only a partial software quality metric.
A similar problem occurs in economic models – which are often complex, opaque and unverified – but are used for setting critical public policies. Similarly with the UW IHME Covid-19 statistical model – whose projections have been garbage (I will substantiate that if anyone does not understand how bad the IHME models have been). Yet the IHME model has been relied upon by states and the Federal government.
The Covid-19 simulation code has what appear to be hundreds of parameter variables (some one else counted over 450), each of which can have an assumed or asserted value. This lets someone experiment with ideas or possibilities – but there are so many variables – many of which appear unlikely to have any particular basis for a value assignment – that the resulting output seems questionable.
A former Google software engineer has taken a look at the code base in more detail and indicates numerous problems. Some one else also took a look and counted over 450 input parameters. Others in the software field are also beginning to look in to this. This is the sort of review that should have occurred long ago.
Multiple runs of the model, using identical input parameters are said to lead to widely different outputs. Literally, two runs of what should be the same model and one produces 80,000 thousand more deaths in 80 days than the other – yet nothing changed in the input parameters. Some argue that its a Monte Carlo model but I do not believe that is the case or the source of these discrepancies between model runs. Everything indicates this is a simulation-based model.
A model that is run twice in a row with the same input parameters but produces widely different results raises serous questions. With different results, its output can never be subject to verification.
The code produced many anomalous outputs that the ICL team would run the model several times and average the results, assuming that the average would represent the correct answer. ICL argued that the code can only run single threaded on a single processor and this would fix the problem. But the University of Edinburgh found this assertion not true – which implies the ICL team does not understand how their code works nor understands why it is broken.
The team had no consistent regression testing program in effect. That means code updates can break old code – and those broken bits will not be detected via a standardized test regime. In other words, broken pieces lay hidden until something goes haywire in the future – or just adds a few thousand extra deaths – and no one notices.
Model outputs are said to be used as model inputs. Ouch.
Single character variable names. Miles long nested IF statements. Code like that is near impossible to maintain, hence, each update likely breaks a few things that are not caught.
This code looks bad, really bad. This does not mean the policy selected is the right or wrong policy – only that this model has sufficient software related difficulties as to be highly questionable and not suitable for making those policy decisions.
Ferguson’s past models – notably for “Mad Cow”, “Swine Flu epidemic” and other diseases out breaks also produced wildly off target outputs – off by one to two orders of magnitude in their projections of deaths. This is similar to the output of the SARS-CoV-2 model projections – off in space.
The Dallas Morning News has some comments about models – good first part but a weak ending.
Bottom line: the ICL model is of questionable accuracy and uses insufficient software development methodologies suitable for a model of this importance. The model needs to be significantly cleaned up, documented, and have a set of regression tests applied. The code also needs to be thoroughly reviewed by independent parties. Again, this model is important – and since it is used for public policy purposes, the public needs to have confidence that its output is valid and useful. The ICL model have even gotten the correct results – merely by randomness! The model may be a good model – as long as all of the limitations are fully disclosed (which had not been up to now). Still, there’s enough worrisome stuff going on that asking questions and seeking answers would be wise.
Addendum: Defense of Ferguson’s model is that that this shows why more funding is needed for this type of work. Or that it doesn’t matter that the model was bad as the policy was effective anyway. Anyone criticizing him or his model or his coding skills is a Nazi (sorry, but that argument comes up with just about everything now days). It was the best available model even if its not very good. Model are “just models” so no biggie. His code modeled human behavior, and humans’ behavior changed, therefore, the initial order of magnitude error is not reflected in the post event data.