Assumptions, data, models and Covid-19: a conversation with Simon Wood
Edinburgh alumnus Benjamin Skuse interviews Professor Simon Wood, Chair of Computational Statistics.
Benjamin Skuse: You started out in physics before developing an interest in statistics – what inspired this change?
Simon Wood: During my PhD in Strathclyde’s applied physics department working on ecological and biological modelling, I fairly soon came to the conclusion that you couldn't really do mathematical modelling in biology without a bunch of statistics – otherwise, you were simply writing down fairy tales. But I found that actually people weren't terribly interested in making biological modelling better by doing statistics with it. And I sort of drifted into mainstream statistics as a result.
BS: One of your primary research interests is extending the application of generalised additive models (GAMs). What benefits can GAMs bring to modelling complex, high-dimensional data over other methods?
SW: A regression model relates a bunch of different variables you've measured to a response variable that you're interested in. GAMs just allow more flexibility in those relationships. They’re useful where you want to understand a little bit more about the structure within your data. So, for example, one of the reasons that GAMs get used for electricity load prediction is because you can understand what the components of a particular prediction look like. That's really good in situations in which something unusual happens, where you have a strike or you've got a period of unusual weather.
If you've got a machine learning model that's just spat out a prediction you have no way of going inside and finding out the why of your prediction. But if you've got a GAM or other classical statistical model, you can look at the reasons the model is making the prediction and decide whether you think it's a sensible prediction or not.
BS: Another topic of interest for you is statistical ecology and epidemiology. What have been your main foci in this area?
SW: I’ve been taking classical mathematical models – most often written down as differential equations describing how it is that a population changes with time or how it is that an epidemic progresses over time – and figuring out how you can better validate, calibrate and test them. The reason that often not a lot of calibration and validation is done with these models is because they're fairly complicated; they're nasty, nonlinear objects, and they're just difficult to handle in a standard statistical framework.
One methodology I have developed for this is synthetic likelihood, which is useful when you have really chaotic dynamics, where it can be quite difficult to do standard statistical analyses on the model as a result. The idea of synthetic likelihood is to essentially throw away a bit of information in order that you get to a better core of information from which to extract what you're actually interested in. That’s where the interest is for me, in developing methods that enable you to handle problems perhaps a little bit more honestly.
BS: What did you discover when you independently modelled and conducted replication studies on COVID-19 in England?
SW: What happened early on in the epidemic is that people took some really very simplistic epidemic models and fitted them to data using standard off-the-shelf statistical technology. To get that to work, they had to really grossly oversimplify the models and build in some really very, very strong assumptions. And the difficulty when you build strong assumptions into these nonlinear models is you can end up with those assumptions just driving the conclusions completely.
Several of these models made assumptions about the R-number. They assumed that this number was basically fixed and only changed when the government made an announcement, neglecting all sorts of spontaneous behaviour by people. These very, very simplistic models came to specific conclusions; in particular, that the incidence of new infections per day was surging up until the eve of the full lockdown, and then it crashed.
But if you suppose that the R-number was just a smooth function of time, you get a very different answer – that the R-number and incidence was dropping rapidly before the full lockdown.
BS: Why wasn’t this message communicated broadly at the time?
SW: Very early on, I tried on my own and with other people to get articles into The Guardian, we tried letters to The Times, I would talk to BBC researchers, but nothing would come of it. It was extremely difficult to get scientific papers published too. If you said anything against the narrative, it would be bounced repeatedly by editors, or just take forever in the review process.
Things were really suppressed, and I think people thought they were doing the right thing by refusing to publish anything contrary to the standard narrative, but much wider scientific debate should have taken place.
BS: What do you think should be the key lessons learned from COVID-19 in terms of reporting and acting upon statistics?
SW: Much more emphasis should be on direct measurement, rather than relying on highly nonlinear mathematical models, which were never really designed for the kind of predictive task that they were used for.
Another thing is just a much less focus on a single risk whilst neglecting other risks. It was portrayed as being ‘lives versus economy’. But it's really ‘lives versus lives’ because there's a lot of evidence that economic effects and social disruption have a big effect on people's health and lives.
Simon Wood
Professor Simon Wood is Chair of Computational Statistics in the School of Mathematics, University of Edinburgh.