Centre for Statistics

Tapping into Big Data transforming lives

In conversation with Edinburgh alumnus Benjamin Skuse, Professor Cathie Sudlow OBE, Chair of Neurology and Clinical Epidemiology, discusses opportunities for statisticians to tap into whole population health data from across the UK.

Benjamin Skuse: Can we begin with a brief introduction to you and your career? 

Cathie Sudlow: I trained in medicine, and I don't think in the early days I saw myself as an academic or a researcher – I just wanted to be a doctor. But then, about two years after qualifying, I started doing a job in neurology up in Edinburgh. The big interest in the department at the time was stroke, particularly in the epidemiology of the condition and in doing large-scale studies and large-scale trials. So I simultaneously got interested in the public health large-scale epidemiology of common conditions at the same time as I got interested in neurology. 

Up until about 2011, my career was split: half research, half clinical coalface work with an emphasis on stroke epidemiology. But then in 2011, I took up the post of Chief Scientist for UK Biobank, which allowed me to get into large-scale population studies that are relevant for studying a broad range of diseases.

I next started to work with Health Data Research UK. For a while, I set up and led the Scottish component of that and then I flipped over to lead the new British Heart Foundation Data Science Centre, which was a data science investment within Health Data Research UK. That was really what allowed me to take forward the agenda of linking together datasets from healthcare at whole population scale. It had started to be done in Scotland and in Wales, but it never been done in England for a number of reasons. I was very involved in co-developing the setups and the secure environments to allow these datasets to be brought together, de-identified and made available for research.

BS: What have been the big technical challenges in bringing together these vast datasets from across England and the UK?

CS: There were billions of rows of data to be analysed and these data were not collected for research purposes. They’re pretty messy, there's lots of gaps, there's lots of missingness and the data dictionaries were in a pretty woeful state. So we built a team of health data scientists to manage the data and to run curation pipeline so that researchers coming to use the data didn't have to keep starting from scratch.

Gradually, over the months and years since we've cleaned them. It's possible now to run studies across the whole English population, the whole Scottish population, the whole Welsh population and now increasingly the Northern Irish population as well. We can study 67 million people simultaneously if we want to, but the splicing together of the results from the four nations is still very challenging, because the environments in which these analyses are run are still separate, and the data are not optimally harmonised.

BS: You have said that ‘we are only just starting to tap into the potential that data has in transforming lives’. Can you elaborate on this?

CS: I think there is huge potential in AI and machine learning. But actually, there's massive potential in just running traditional, simple, straightforward epidemiological analyses on these very large datasets because of the richness of the data. We can ask questions now that it didn't used to be possible to ask about health disparities, we can ask questions that are relevant to all age groups and all ethnicities and people in all geographies.

There are also still many other datasets that could, in theory, be brought into the fold to add more value. For example, there are lots of data on expensive medicines that could be linked to the data from the other sources, which we don't do yet, which would give huge amounts of information of relevance to NHS productivity and NHS spending. There’s lots of data on people’s genomics that's generated in NHS labs that when linked to other sources could again provide huge insights about the causes and consequences of rare and common genetic diseases. Those are just a couple of examples of where we've moved the dial, but we've actually got a long way yet to go.

BS: What would be your advice to any statistician looking to use the wealth of health data now available in the UK in their studies?

CS: Before I answer that, I want to say that we need more quantitatively able people. It's an area where it's been quite difficult to attract and retain people who can work well with data, because they tend to get pulled off into working in other areas. Also, researchers from statistics who are interested in health need to know that this is a team sport, it's really interdisciplinary. It’s about collaborating and connecting with people with other skills, so clinicians, epidemiologists, computer scientists, patients, members of the public – this sort of stuff only happens when all of these different perspectives and skills are brought together.

From a very practical point of view, the BHF Data Science Centre has set up a consortium [CVD-COVID-UK/COVID-IMPACT Consortium] which is completely inclusive. We can sign people up to that consortium, they can come to our meetings, they can meet others who are working with the data, they can learn about how it works, and they can join and contribute to a project.

For many statisticians, they may not quite know what the best questions to address are, but they may have skills that other teams most need to help them work out how best to analyse the data. So there are fantastic opportunities for people with quantitative skills to join our consortium and no doubt others as well. We would love people with quantitative skills to hook in, make connections and find projects.


Cathie Sudlow