Sergey Nivens - Fotolia
The race to the cloud among enterprises has been putting pressure on DevOps teams for some time now. DataOps is a variant of this, which is being used as a way to deliver new data models and test data more quickly, to support the pace with which organisations are building out data-driven initiatives.
Whereas machine learning is used to build cohesive software applications, DataOps is being used in a similar way to DevOps, to accelerate the speed with which data models are built, tested and deployed. In doing so, organisations can accelerate the time it takes to derive value from the customer data they collect.
Thibaut Gourdel, technical product manager at Talend, says: “DataOps is a new approach, driven by the advent of machine learning and artificial intelligence. The growing complexity of data and the rise of needs for data governance and ownership are huge drivers in the emergence of DataOps. Data must be governed, stored in specific datacentres, and organisations should know who has access to data, which data and who owns it.”
More sophisticated analytics
DataOps effectively concentrates on the creation and curation of a central data hub, repository and management zone designed to collect, collate and then onwardly distribute application data and data models. The concept hinges around the proposition that an almost metadata-type level of application data analytics can be propagated and democratised more widely across an entire organisation’s IT stack. This then allows more sophisticated layers of analytics to be brought to bear.
As Tamr database guru Andy Palmer puts it: “DataOps acknowledges the interconnected nature of data engineering, data integration, data quality and data security/privacy. It helps an organisation rapidly deliver data that accelerates analytics and enables previously impossible analytics.”
DataOps is not a product. Rather, it is a methodology and an approach. As such, it has its theorists, its naysayers and its fully paid up card-carrying believers. Some argue that DataOps provides the means to deliver data and data models for continuous testing with version control.
George Miranda, DevOps advocate at PagerDuty, a provider of digital operations management, says: “The goal of DataOps is to accelerate time to value where a ‘throw it over the wall’ approach existed previously. For DataOps, that means setting up a data pipeline where you continuously feed data into one side and churn that into useful results.”
Making it easier for people to work with data is a key requirement in DataOps. Nigel Kersten, vice president of ecosystem engineering at Puppet, says: “The DataOps movement focuses on the people in addition to processes and tools, as this is more critical than ever in a world of automated data collection and analysis at a massive scale.”
DataOps practitioners (DataOps engineers or DOEs) generally focus on building data governance frameworks. A good data governance framework – one that is fed and watered regularly with accurate de-duplicated data that stems from the entire IT stack – is able to help data models to evolve more rapidly. Engineers can then run reproducible tests using consistent test environments that ingest customer data in a way that complies with data and privacy regulations.
The end result is a continuous and virtuous develop-test-deploy cycle for data models, says Justin Reock, chief architect at Rogue Wave, a Perforce Company. “At the core of all modern business, code is needed to transport, analyse and arrange domain data,” he says. “This need has given rise to entirely new software disciplines, such as enterprise federation, API-to-API [application programming interface] communication, big data and big data analytics, stream processing, machine learning and data science.
“As the complexity and scale of these applications expand, as is often the case in sophisticated environments, the need for convergence arises. We must be able to reconcile data security, integrity, accessibility and organisation into a single mode of thought – and that mode of thought is DataOps.”
It is important to remember that data has a lifecycle. The data model resulting from a diligent DataOps process will have an appreciation for the entire data lifecycle.
Some data is new, raw, unstructured and potentially quite peripheral; other data may be live, current and possibly mission-critical, while there will always be data that is effectively redundant or needs to be retired. Other types of data may simply be inaccessible due to policy access control or system incompatibility.
Read more about DataOps
During a London data science popup meeting, Computer Weekly caught up with MoneySuperMarket.com’s analytics head, Harvinder Atwal.
Every organisation seems to be hunting for a data scientist, but securing the right people with the right skills is a challenge.
Mitesh Shah, senior technologist at MapR, says: “By providing a comprehensive open approach to data governance, organisations can operate a DataOps-first methodology where teams of data scientists, developers and other data-focused roles can train machine learning models and deploy them to production. DataOps development environments foster agile, cross-functional collaboration and fast time-to-value.”
DataOps helps to address some of the inefficiencies in data science. In an interview with Computer Weekly, Harvinder Atwal, head of data strategy and advanced analytics at MoneySuperMarket.com, explained the problem with data science investments.
Speaking at a data science popup event in London, Atwal described a common problem where data scientists have to request data access from IT, then need to negotiate with IT for the required compute resources, and have to wait for these resources to be provisioned. Further calls to IT will inevitably be required to install the set of tools required to build and test data models.
In a DataOps context, enabling the rapid creation and destruction of environments for the collection, modelling and curation of data requires automation and must acknowledge that just like developers, data scientists are not infrastructure admins, says Brad Parks, vice-president of business development at Morpheus.
Jitendra Thethi, assistant vice-president of technology at Altran, points out that data scientists and data managers can learn a lot from DevOps by moving to a model-driven approach for data governance, data ingestion and data analysis.
The right automation and orchestration platform can enable DataOps self-service, whereby data scientists can request a dataset, stand up the environment to utilise that dataset, then tear down that environment without ever talking to IT operations.
Thethi says this enables data scientists to manage data and data models using a version control system, enforced by an automated database system.
Containerisation provides a neat way to encapsulate the operational environment a data scientist needs, together with all the relevant software libraries and datasets required to test the data model being developed.
Tim Mackey, senior technical evangelist at Synopsys, says: “Data scientists may create an experimental model which is deployed in containerised form. As they refine their model, deployment of the updated model can be quickly performed – potentially while leaving the previous model available for real-time comparison. As their model proves itself, they can quickly scale underlying resources seamlessly, confident that each node in the model is identical to its peers, both in function and performance.”
A number of so-called data science platforms are starting to emerge that support DataOps. Domino Data Lab is the one MoneySuperMarket.com has deployed, and Atwal says it offers a way to provide self-service for its data scientists to work.
Rogue Wave’s Reock believes DataOps, when combined with modern data analytics practices and emerging machine learning technologies, can help organisations to prepare for the coming surge in data-driven business models.
The growth in the use of data to improve decision-making, such as applying advanced analytics to internet of things (IoT) sensor streams, is likely to dwarf, by orders of magnitude, the already astronomical amount of data now being generated.
This is likely to lead to greater emphasis on the management of data models and test data , which means DataOps will have an increasingly important role.
Will Cappelli, CTO and global vice-president of product strategy at Moogsoft, says DevOps teams and data scientists should learn how to work together more effectively. “DevOps professionals are all too often impatient,” he says. “They don’t want to wait for the results of a rigorous analysis, whether it is carried out by humans or by algorithms. Data scientists can be overly fastidious – particularly those coming from maths, rather than computer science.
“The truth is, though, that DevOps needs the results of data science delivered rapidly but effectively, so both communities need to overcome some of their bad habits. Perhaps it is time for an agile take on data science itself.”