This is a guest blogpost by Matt Jones, Lead Analytics Strategist at Tessella.
From drug design to supply chains, predictive algorithms built using yesterday’s data no longer work in our strange new world. Here’s how to build new ones quickly.
Finding and testing new drugs, predicting disease spread, and scaling up manufacture of vaccines, all rely on predictive models and lots of varied and potentially complex data. But their data capture mechanisms, and the models themselves, were designed for a different world.
Modelling every eventuality is impossible, so models include assumptions based on data trends and scientific principles, such as what time of year people get colds, global manufacturing capacity, how certain proteins bind, or how certain particles spread in an airstream. All of these areas now have a huge number of new and poorly understood variables thanks to SARS-COV-2.
This means that the experts creating new drugs, vaccines, tracking apps etc, need to spend a lot of time acquiring and reengineering data, rebuilding and validating models. This may not be their core expertise and progress may be slow. If they get it wrong, they will cause big and expensive problems down the line.
We propose four principles for rebuilding models to ensure they get good results at speed.
To maintain focus, we will discuss this in the context of challenges facing global healthcare industries, but the recommendations are relevant to wherever data has been disrupted by Covid-19, from FMCG supply chain management, to reopening factories safely, to sales forecasting.
- Accessing the right data
First, ensure the data coming in is reliable.
Disease spread models need data on things like population density, prevalence of infections, and new behaviours adopted under lockdown. Diagnostics models need training on the full range of disease manifestations – a lung scan will look different in a healthy 20-year-old vs an asthmatic 60-year-old. Accessing such reliable data is often a pain point.
Good data must come from a trusted source, which means having experts check it for errors, bias and confounding elements before it goes into your corporate system. Many diagnostic AIs fail because the model can learn to spot a label in the data (eg a lung scan with a circle drawn around an infected area), rather than the disease indicator itself.
Once captured, adequate IT infrastructure and data pipelines are needed to ensure data is consistent and accessible to across the organisation. There must also be a consistent taxonomy for identifying things, and metadata should be added to allow different groups to find it easily in the system and understand its context. This is critical for allowing modellers to move quickly.
- Choosing the right models
Second, design models for this new reality. Some may be repurposed, but many will need to be rebuilt from the ground up.
Start by defining what you want to do. Understand the type of analytics problem. Is it classification or regression, supervised or unsupervised, predictive, statistical, physics-based, etc?
Don’t rush straight in. Use subject matter experts and data scientists to screen data to understand what is possible and what is not. Perform rapid and agile early explorations using simple techniques to spot the correlations that will guide your plan.
Based on this analysis, identify candidate modelling techniques (eg empirical, physical, stochastic, hybrid). Decide on the most suitable model for the problem. Check implementation requirements such as user interface, required processing speed, architecture, etc to ensure it will be usable.
‘Most powerful’ is not the same as ‘must suitable’. Popular techniques such as machine learning, which need lots of well understood data, may not be appropriate for Covid-19 challenges. Alternate approaches such as Bayesian uncertainty quantification may be better for scenarios where limited trusted data is available.
- How to ensure your answers are trusted
Third, bake in trust. The best model in the world is ultimately useless if users don’t trust and use it.
Trust requires people to be able and see that the model works, but this is not the end of the story. Over-complicated or frustrating user interfaces, or models which break after a few months, undermine trust and slow down uptake. So do privacy and ethical concerns – as we see with track and trace apps.
Trusted models are explainable. If people can clearly see from their app that they spent an hour talking to an infected person, they are likely to take the result seriously and isolate. If they get an alert with no context, they may decide the model is being oversensitive and dismiss it.
- Deploying models at scale
Finally, models must work and scale in the real world.
That means putting the right infrastructure around it to capture incoming data and deliver the resulting insight. In most cases, that involves engineering it into robust software systems and integrating it into either a mobile or web app, or a piece of technology such as a diagnostics machine.
It may mean wrapping models in software (‘containers’) which translates incoming and outgoing data into a common format, to allow them to slot into the wider IT ecosystem. It will require allocating power and compute demands relevant to the application. It also means planning for ongoing maintenance, support, and retraining.
If all goes well, the user is presented with a clear interface. They enter the relevant inputs – symptoms, workstation positioning, desired drug properties, etc – and the model runs and presents the resulting insight in an easy to understand way, that the user is happy to act upon.
Bringing it all together for rapid results
Time can be saved by being laser-focussed on your end objective, thus reducing time needed to find and manage relevant data.
But beyond that, rigour is always needed. Speed isn’t about cutting corners; it is about doing things right first time. Doing this quickly means efficient allocation of resources; data experts to handle the data, experienced modellers to correctly build the right model, and software engineers to develop the final software solution.
Critically it means freeing up the Covid-19 experts to focus on public health measures, developing drugs and diagnostics, or getting people safely back to work.
This article is based on Tessella’s whitepaper, COVID-19: Effective Use of Data and Modelling to Deliver Rapid Responses, developed with input from a range of modelling experts.