Beware the 1% view of data science

This is a guest blogpost by Shaun McGirr, AI Evangelist, Dataiku

As data science and AI become more widely used, two separate avenues of innovation are becoming clear. One avenue, written about and discussed publicly by individuals working at Google, Facebook and peer companies, depends on access to effectively infinite resources.

This generates a problem for further democratisation of AI: success stories told by the top echelon of data companies drown out the second avenue of innovation. There, smaller-scale data teams deliver stellar work in their own right, without the benefit of unlimited resources, and also need a share of the glory.

One thing is certain: a whole class of legacy IT issues don’t plague global technology companies at anywhere near the scale of traditional enterprises. Some even staff entire data engineering teams to deliver ready-for-machine-learning data to data scientists, which is enough to make the other 99% of data scientists in the world salivate with envy.

Access to the right data, in a reasonable time frame, is still a top barrier to success for most data scientists in traditional companies, and so the 1% served by dedicated data engineering teams might as well be from another planet!

“Proudly analogue companies need to go on their own data journey on their own terms,” said Henrik Göthberg, Founder and CEO of Dairdux, on the AI After Dark podcast. This highlights that what is right and good for the 1% of data scientists working at internet giants is unlikely to work for those having to innovate from the ground up, with limited resources. This 99% of data scientists must extract data, experiment, iterate and productionise all by themselves, often with inadequate tooling they must stitch together themselves based on the research projects of the 1%.

For example, one European retailer spent many months developing machine learning models written in Python (.py files) and run on the data scientist’s local machines. But eventually, the organisation needed a way to prevent interruptions or failure of the machine learning deployments.

As a first solution, they moved these .py files to Google Cloud Platform (GCP), and the outcome was well received by the business and technical teams in the organisation. However, once the number of models in production went from one to three and more, the team quickly realized the burden involved in maintaining models. There were too many disconnected datasets and Python files running on the virtual machine, and the team had no way to check or stop the machine learning pipeline.

Beyond these data scientists doing the “hard yards” to create value in traditional organisations, there is also the latent data population — capable but hidden away — who have real-world problems to solve but who are even further from being able to directly leverage the latest innovations. If these people can be empowered to create even a fraction of the value of the 1% of data scientists, their sheer number would mean the total value created for organisations and society would massively outweigh the latest technical innovations.

Achieving this massive scale, across many smaller victories, is the real value of data science to almost every individual and company.

Organisations don’t need to “be a Facebook” to get started on an innovative and advanced data science or AI project. There is still a whole chunk of the data science world (and its respective innovations) that is going unseen, and it’s time to give this second avenue of innovation its due.

Data Center
Data Management