The data science toolkit (DST) is fast becoming the secret sauce behind many of the world’s top businesses. One of the early leaders in using data is Tesco, which integrated Met Office data to ensure it had the right stock in the best locations, based on weather conditions.
Another example is e-commerce retailer Yoox, which runs websites for luxury brands such as Armani, Diesel and Dolce & Gabbana. Yoox analyses its 15 years’ worth of transactional data to provide insights into shopping behaviour, which it then utilises to shape its delivery systems and, more importantly, its business strategy.
For many businesses, access to real-time information from their DSTs offers real benefits. For example, The Guardian website uses Elastic – previously Elasticsearch – in its own in-house analytics system Ophan, which is employed to ensure web content is presented properly and exposed to five million readers.
Ophan processes 40 million documents a day to decide what hyperlinks to place in articles. It attempts to give each article the right exposure at the right time on social media platforms and also diagnoses website performance issues.
What is a DST?
The first challenge is to define DST. The software industry is happy to label absolutely anything that even vaguely sniffs data as a DST. Data warehouse, data acquisition, data integration and data visualisation systems are all variously called DSTs.
Such tools are also widely used by big data companies, although large volumes of data are not necessary to benefit from a DST. For the purposes of this article, data science tools are defined as toolsets that enable businesses to take any sort of data from a number of different sources and allow that data to be manipulated to find a particular answer or answers.
A typical DST is not standalone. It is used alongside standard extract transform load (ETL) tools to acquire and clean data, and summarising tools that visualise the information.
The DST is a relatively small part of the process from data to actionable insight. According to US data analytics specialist Datascience, businesses should expect the data acquisition phase of a DST project to take 25% of the project time, data cleaning up to 35% and summarising 20%.
Particular use case
Implementation can be speeded up by looking at DSTs that are specific to a sector or a particular use case, such as Fuzzy Logix for campaign management, Adara’s Magellan platform for travel, Lithium Technologies for social media and Lumiata’s MedicalGraph in healthcare, where much of the tweaking and development work is already done for the user rather than choosing a more general-purpose DST.
But whatever you choose, it isn’t something you will be able build quickly and your current team may lack the necessary skills. Various factors are driving the current huge interest in data science and its associated toolsets:
Cloud platforms enable businesses to cut the cost and time it takes to crunch huge datasets;
A wide adoption of open source-based standards for DST development;
New agile and DevOps methodologies have encouraged and supported rapid change;
A changing competitive landscape – now any user, large or small, could access the same toolsets. The only differentiator is data and how fast they can get the insight from it.
To get the best from DSTs, they must be embraced throughout the business. The C-suite has to understand the benefits of DSTs and move away from regarding them as “fiddling around with data” to tech that will give their business an edge.
One way to combat this reluctance is to look for any quick wins that can be gained and publicised around the business early on.
There is no need to wait until every last piece of data is cleansed to see big trends, such as “people in North Yorkshire prefer brown shoes to black”, “the most popular time for shopping is a Friday lunchtime”, and adding “people who liked this also liked this” functionality.
The business also needs to be able to react to the results provided at all levels and all platforms, from management down to IT systems.
Managers need to know what questions they can ask of their DST systems and tests are required to be able to calculate what incremental gains are being made, and which particular initiative succeeded in moving the needle.
Integrated with ERP
The DST needs to be fully integrated with enterprise resource planning (ERP) systems, so stock can be shifted and orders placed at short notice, such as being able to react to wet weather or a change in social media sentiment. It also needs to be connected to e-commerce systems so “offers” and “wet weather specials” can appear in the e-commerce systems. This should also be linked into the social media output and to marketing.
The business must also identify where data is held. Often, the biggest problem is not ingesting the data or the data quality, but knowing that the data exists. The organisation should identify any potential third-party sources of data, such as Met Office forecasts, social media sentiment, and so on.
Read more about data science toolkits
Predictive analytics can give marketing campaigns a head start, but business stakeholders may need convincing.
There is huge demand for data scientists – but rather than build skills in-house, analyst Forrester suggests CIOs can outsource to specialist providers.
There are plenty of suppliers on the market that give themselves the DST label. Fewer suppliers offer true DST, but there is still a wide range of products available to suit any organisation, whatever its budget.
One of the disruptive players in the market is Pentaho. Founded in 2004 and bought by Hitachi Group in 2015, it has 1,500 paying customers worldwide, including Zalando, BT and Caterpillar.
Pentaho had a major reboot in 2013 and is now aggressively targeting the data analytics market with a mix of on-premise and cloud-based offerings aimed at a wide range of businesses.
Another company making a mark in the DST space is Swedish business Qlik, formed in 1993. The Qlik analytics platform is supported by a large app market and has 35,000 customers worldwide. Qlik was recently bought by a private equity firm for about $3bn.
Also vying for a piece of the DST market are established database companies such as SAS, Informatica, Teradata, Statistica, Oracle’s Hyperion, SAP’s Hana and IBM.
IBM offers multiple products, including its artificial intelligence (AI) system Watson, Cognos and SPSS Statistics, which are constantly evolving.
SPSS started life as a package for social sciences, education and government, but has now moved into health sciences and marketing applications. It is used for predictive modelling and data mining as well as big data analytics.
For an example of what the future of data science looks like, the first few minutes of the recent movie Jason Bourne gives a perfect illustration. In the film, CIA operators fire off natural language questions and see real-time images and visualisation culled from multiple datasets, including passports, CCTV images and data posted on social media.
The new part of this scenario – and what will begin appearing in DST this year – is the use of artificial intelligence to reduce the skills gap that is holding back DST implementations.
By using AI, data scientists no longer need to spend weeks programming in Python, but can simply ask the right questions using natural language and get the answers they are seeking.
Although Jason Bourne is a Hollywood representation of the use of technology, the real CIA is actually one step ahead of its screen image.
The agency recently launched a new department called the Directorate of Digital Innovation, which aims to use DST for “anticipatory intelligence” to discover insights pointing to future events that may need CIA involvement. n
Marcus Austin is an analyst at Quocirca.