Data is not neutral

Enterprise Applications Editor

This is a guest blogpost by Triveni Gandhi, Senior Data Scientist, Dataiku

We like to think of data and numbers as indisputable, but the reality is that every piece of information is a product of the context it was built in. Data is inherently biased, and all too often it’s teaching AI systems to make what are critical business and ethical mistakes.

Take, for example, the situation that came to light with the Apple card, when a prominent software developer discovered that he received a much higher credit line than his wife with a higher credit score. After tweeting his shock, other couples reported similar differences in credit limits, despite their shared assets and similar credit history.

How did this blatant discrimination make it to production? It’s unlikely that the Apple team decided to build a model that would bias against women. In fact the “black-box” algorithm that determined credit lines was trained on a set of data with inherent biases, which the AI application learned and reproduced.

What did the bias in this data look like? It’s likely that Apple used a set of data pertaining to historic credit lines for many individual borrowers and various information about them such as credit history, income levels, and more. You may think that this data should be seen as a reflection of the world, and as a result neutral or unbiased.

However, the world is rife with biases in existing lending practices, pay disparities, and access to credit. If the data used to train a model is reflecting the world, it is inherently reflecting those biases, and any AI system trained on this data will only reinforce them.

Confront the truths of AI

The good news is that researchers and ethicists are working together to develop techniques that can mitigate the negative impacts of biased data and generate more positive impacts through responsible AI. At the same time, many companies are moving to an AI strategy that is consciously designed and reflective of their intentions. Importantly, these organisations are building pipelines based on awareness of the relevant historical and social context that matters most.

They are also dedicated to data ingestion, clearing and transformation so that they can be confident that their data sufficiency to build models that meet their goals that are explicitly stated. Beyond this, they follow responsible practices in their development, implementation, and monitoring and testing of their AI products.

Some of the biggest challenges in responsible AI and the best, most granular approaches to maintaining it have been seen in life sciences, particularly in the healthcare industry, where there have been many examples of medical biases replicated in algorithms.

For example, in 2019, an algorithm created and sold by a leading health services company in the US underestimated the health needs of the black patients with the poorest health. The algorithm, which was designed to predict which patients would benefit from extra medical care, ended up nearly always deeming white patients more at risk, and in need of extra medical attention.

What the algorithm didn’t consider was that in the US, healthcare costs are closely related to race, and that people of colour consistently access healthcare less frequently, and in turn show a lower cost. In this instance, the developers of the algorithm may have thought they were being equitable by omitting race as a variable, when in reality, they were actually only making bias more automated and more efficient. It’s worth noting that this wasn’t an isolated incident, and it’s quite likely this wasn’t the only company that had created a model that focused on cost and omitted race.

What can we do?

In life sciences in particular, companies are now working differently to find biases in their data before they begin modelling so that they can do everything possible to avoid replicating bias.

This often starts at the data collection point because data collection itself may be the starting point for bias. To explore this, organisations are exploring their datasets using datasheets, which are comprised of a series of questions like how was this data collected, when was it collected, what key groups may be missing from it, or simply is this data really reflective of reality as we should we want to know it?

Datasheets were born out of Cornell University in 2018, which recognised that the machine learning community didn’t have a standardised process for documenting datasets. The department of computer science felt that datasheets would work to “facilitate better communication between dataset creators and dataset consumers, and encourage the machine learning community to prioritise transparency and accountability.”

The use of datasheets is becoming an increasingly more common and standardised way to approach data now, especially amongst the life sciences, banking, retail and insurance industries. Datasheets are promoting transparency to possible sources of bias which also inspires responsible AI practice around composition, labelling, pre-processing, intended use and distribution. These industries are also applying Exploratory Data Analysis (EDA), which is dedicated to making sure data is neutral, and helps teams to search for underlying biases in data. EDA may help teams to better understand and summarize their data samples and to come to concrete conclusions about the underlying population a dataset represents, while helping to visualise the structure of a dataset.

Overall, whilst there is no silver bullet for responsible AI, it is entirely possible if organisations start from the right place. If we work to question our data at all times in order to minimise and eventually eliminate unconscious bias, we can improve both new and existing AI lifecycles and mitigate the risks of misuse and unintended consequences.