Data and machine learning ecosystems change as organisations scale.What works for a large company differs from a startup, as the experience of online travel site Booking.com demonstrates.
Speaking at the Digital Transformation Week conference in Amsterdam in September, Sanchit Juneja, director-product of the firm's data science and machine learning platform, presented an ideal data ecosystem for a large tech company and showed how this differs from the data ecosystem of a startup. He used a layered description of all the data processing activities that need to take place in any company that has to process a lot of data and apply machine learning tools to maintain a competitive advantage.
In a big tech organisation, there are various data sources, he explained. These can be separated by vertical product groups that consumers interact with – for example, flights, attractions or hotels. This layer of processing is called the data formative layer. At this level, a user performs an action on the data – and, based on that action, the data is created. A decision is then made as to how the data will be formatted for downstream processing.
The data flows from the formative layer into a DataOps layer, which is a very new concept in the industry. At this layer, DevsecOps principals, such as Git, are applied to data pipelines. This layer provides information on how the data will be used downstream to the formative layer, where the data is produced.
From the DataOps layer, the data flows into a data aggregation layer, where it can be processed as a transaction, or it can be used for analytical decision-making. In the first case, the data is treated by a set of processes called online transactional processing (OLTP); in the second case, it is treated by a second set of processes, called online analytical processing (OLAP).
For transactional processing, e-commerce platforms might be used. For analytical processing, big data platforms are used. In a typical startup, this distinction doesn’t exist – one platform does both the transactional processing and the analytics. Only larger organisations can afford to make the distinction between the two types of system.
After the data is stored at the data aggregation layer, the data consumption part begins. If the data is being used for machine learning applications, part of this layer is called MLOps, which is a hot area with a lot of different tools being applied – Pachyderm, for example.
Some big organisations, such as Uber and Amazon, built their own MLOps layer – and what they built was so good that they are now selling it. Amazon calls its platform SageMaker; Uber calls its platform Michelangelo. Both are available as software as a service (SaaS) for smaller companies.
The data aggregation layer consists of several sets of activities. A group of product managers will be concerned with data protection, another will work on how data is stored, with a manager also looking at how data is presented.
The next layer is the applicative layer, where there are two major kinds of application, the first of which is machine learning. These activities are often driven by machine learning managers, which is a new kind of job.
“At Booking.com, here is where we look at what users searched for on different pages,” said Juneja. “Let’s say you looked for hotels in Amsterdam. Next time you come around, I can give you a deal on hotels in Amsterdam.”
Many of the big breakthroughs are made at this layer if it is done right, and if there is a robust data ecosystem. Here, pharmaceutical companies are working on drug discovery using artificial intelligence (AI) and automotive companies are developing self-driving cars.
The second kind of application at the applicative layer is analytical, non-machine learning applications. These activities might be driven by data managers, but sometimes they are managed by data product managers, another new job title.
“Let’s say we have one billion orders in a given week,” said Juneja. “We may want to analyse how many people are booking in Europe, or how many people are booking in Southeast Asia. We can look at where we can apply more discounts, for example.”
Analytical analysis is post-hoc analysis, meaning timing is not critical. By contrast, machine learning is ad-hoc analysis. Booking.com wants to get the user to behave in a certain way in near-real time – while he or she is interacting with the company.
Another layer is the ecosystem-observability layer, which every tech-aware company needs. This where the ecosystem can be monitored, and misalignments managed. A set of tools might be applied to analyse how well the data pipeline is being used. One such tool is Monte Carlo.
Juneja pointed out that the model he presented is an ideal scenario that describes what big tech companies aim to do to make the most of machine learning. The biggest impact is at the applicative layer, so this is where a company should put its priorities. What happens at the applicative layer also depends on how robust the preceding layers are.
Juneja’s presentation helped to expose the growing divide between established players and startups when it comes to harnessing the power of data. For startups and companies that are just beginning to scale, much of the ideal data ecosystem is out of reach. According to Juneja, this gap is being filled by off-the-shelf products and a thriving SaaS ecosystem.
Others might disagree. It takes more than just software to monetise data. It takes a team of experts – and very few small companies have that luxury.