svetazi - stock.adobe.com
Data is a vital component in helping governments, healthcare organisations and other sectors battle the Covid-19 coronavirus pandemic.
But it’s important to remember that not all data is created equal and datasets are often incomplete. There is a lack of alignment among the players fighting the spread of the coronavirus about what is being measured and compared. And all of that is contributing to uncertainty and inconsistency amid coronavirus, and that is compounding mistrust and fear.
Comparing apples and oranges and avoiding conflict
We’re all in this together. Yet some countries are finger-pointing and telling others that their numbers related to coronavirus infection rates and fatalities are wrong. It all boils down to how people collect data and on what they base measurements.
There are many ways a country could account for the number of coronavirus fatalities. It could just count anyone who died with coronavirus-like symptoms, but unless a person was tested, it’s unclear whether that individual succumbed to coronavirus directly. And even if a patient had the virus, that individual’s cause of death could have been due to coronavirus combined with something else.
It’s a classic problem, but it’s probably the first time we’re seeing it on a worldwide scale.
Part of the solution involves those who are measuring cases to come together to identify the similarities and differences in their approaches. That provides a fundamental layer of trust and alignment. If you don’t do this, it’s impossible to share numbers effectively.
Everybody in accounting knows this. You have to keep what you’re comparing comparable.
Understanding that not all data is created equal
Those who are using data to understand situations and make informed decisions also should understand that not all data is created equal. Inaccurate data is everywhere.
In business, customer relationship management (CRM) systems often contain inaccurate data because they rely on salespeople typing in notes. In the coronavirus fight, efforts that rely on coronavirus self-reporting can create inaccurate data because people may not tell the truth or they might misinterpret signals.
You might also get inaccurate data coming from automated systems. Say a country uses an automated system that connects with smartphones to check users’ temperatures. Maybe it’s warm where the person is and they are spending time in the sun, so that person’s temperature is elevated. Or maybe the person has symptoms that do not stem from the coronavirus. There are a hundred reasons why the measurements in automated solutions could display variability, leading to inaccurate data.
Data is an important strategic asset but only when we ask and answer the following questions: What does it mean? Where does it come from? Who knows something about it? Who is the owner? Who is a user and why? What sources are certified for which purposes?
Taking this approach enables leaders to ensure the data they are using can be trusted.
Read more about data quality
- As companies add machine learning applications, they need to really understand – and be able to improve – their data. That’s where data quality initiatives come in.
- Instead of waiting for data quality to become an issue, consider a proactive approach. Here are ways to improve data quality in your organisation before it’s a problem.
Data scoring can also help. This technique uses trained models to allow people to understand that one dataset is better – or more complete – than another dataset.
Addressing the challenge of incomplete data
The incompleteness of data is not unique to the efforts related to fighting the spread of coronavirus. This is a classic problem you run into with anything related to data.
Data profiling addresses incomplete data by identifying basic things. For example, it can spot that a dataset does not include ages of the patients or that 70% of the ages are missing.
Perhaps these details are missing due to privacy laws. But if you’re going to build a model for coronavirus that doesn’t include age information, particularly for the elderly, that model is going to be bullish compared to one that relies on datasets containing age details for the patients.
If you do use a data profiler, be sure to program it with relevant rules. If you don’t, it could create problems. For example, we all know there’s no such thing as a 200-year-old person or a minus-10-year-old person, but unless you set a rule for that, the data profiler will not know it.
Using data matching to spot commonalities at scale
In today’s world, hundreds of hospitals are testing and treating people for coronavirus. If you had just two hospitals, data comparisons between them would be relatively easy. But once you get to 50, 100 or more datasets, the data becomes impossible to compare manually. Data matching can help.
Data matching identifies what is common among different datasets. In this example, what’s common among them might be patients who visited more than one of the hospitals.
Using data matching can also show that one dataset contains details about 80% of the patients in the other datasets. That indicates this one dataset might be a good place to start. You can use that dataset to derive inferences and use the other datasets to validate the hypothesis.
Coronavirus is teaching people the importance of trusting in data to make the best decisions. I believe we’ll see countries that do this really well and bend the curve. Countries that do not do this as well, or use data incorrectly, are likely to be less successful at bending the curve.
Stan Christiaens is the co-founder and chief technology officer of Collibra, a data intelligence company that provides a cloud-based platform which enables IT and the business to build a data-driven culture for the digital enterprise.