Data management has been the midwife of business value for IT for much of the past half century.
Twenty years ago, in the 7 November 1996 issue of Computer Weekly that commemorated 30 years of our publication, Nicholas Enticknap wrote: “The 1990s have seen an increasing emphasis on making IT provide competitive business advantage, and this has led to the rise of data mining and data warehousing applications.
“It has also led to an appreciation of the advantages of making your data and even applications available to others; customers, suppliers and intermediaries such as brokers.”
Twenty years on and that is still, in the overall context of IT, the specific vocation of data management and business intelligence, and data analytics. Enticknap goes on to say this is what is “driving the second major revolution of the 1990s: The rise of internet-based computing.”
In the 3 July 1986 issue of Computer Weekly, a decade earlier, the same author was harping on a similar theme, in a series of articles on what was then called the “fifth revolution” in computing, touching on artificial intelligence (AI): “We expect to see new applications that are designed to translate data into information, such as decision support and expert systems.”
Generations one to four, whatever the detail of the distinctions between them, all “conform[ed] to the same basic computer architecture as first proposed by [John] von Neumann and his colleagues in 1944”, wrote Enticknap, when a computer was “a super-powerful calculator when electronics was still in its infancy”.
A big aspect of the new paradigm, which included also user friendly computers, was solving “the problem of capitalising fully on the large investment in data”.
Relational database model and language
The 1970s had seen, wrote Enticknap in 1996, the establishment of transaction processing and the mini-computer as a business tool. It also saw the launch of database management systems and distributed processing over numbers of mini-computers, as opposed to just being centralised in single mainframes.
By that time, the relational database model, breaking the dependency between data storage and applications, was well known. It had been established theoretically in a paper published in 1970 by Englishman Tedd Codd, an Oxford-educated mathematician working at IBM, entitled A Relational Model of Data for Large Shared Data Banks.
Ferguson is still surprised it took IBM so long – some 11 years – to turn Codd’s invention of the relational model into a database product. Larry Ellison, with his Oracle database, leapt into the gap in 1978. Oracle is still the behemoth of enterprise data.
SQL was a language implementation of the relational model. Ferguson recalls Codd and Date’s disgruntlement in respect of its deviations from the original conception. Nevertheless, with SQL, relational databases – such as Oracle’s, but also IBM’s DB2, Microsoft’s SQL Server, and Sybase DB, now owned by SAP – came of age.
Indeed, the persistence of SQL in the world of databases has been remarkable. Despite the rise of the so-called big data technologies of the Hadoop stack, NoSQL databases and the Apache Spark framework in the past 10 years, SQL repeatedly comes back as the super language of data interrogation.
“Show me another API that will take on SQL,” says Ferguson. “There isn’t one.”
Data warehousing and business intelligence
In the 30th anniversary issue of Computer Weekly, there is a story about how data warehousing was failing to live up to the hype that surrounded it in 1996.
“Despite the hype, less than 10% of the UK’s Times Top 1,000 companies are implementing data warehouses,” recorded Computer Weekly. We read similar stories today about big data Hadoop-based “data lakes”.
Data warehousing represented an evolution of database technology for analytical purposes, positing the creation of a centralised repository for all an organisation’s business system data.
The idea was to take data (mainly) from transactional databases and load it into a data warehouse for analysis. This generated extract, transform and load (ETL) technologies to move the data, and then business intelligence (BI) software – which took the pain out of writing SQL queries to do reporting and analysis.
This set of technologies is now routinely taken to task for being too slow and antiquated, and far too dependent on corporate IT. It is often contrasted today with a new wave of modern data discovery and visualisation software, from Qlik, Tableau and similar ilk, that can obviate IT as a function.
However, Ferguson is keen to restate the radical step-change in productivity this triad of data warehousing, ETL and business intelligence software represented in the mid to late 1990s and early 200os.
“Data warehousing had to happen and was absolutely aimed at the BI market. Up till then all we had were those green and white printed sheets, spewed out of transactional database systems, to report from,” he says.
Ferguson says Teradata, for which he worked at that time, was “very pioneering” with its massively parallel processing database that was optimised for reporting purposes.
Mike Ferguson, Intelligent Business Strategies
Together with the ETL technologies that emerged (signally from what is now Informatica) and the BI tools – from Business Objects, Cognos, and Microstrategy, among others – data warehousing/business intelligence marked a “watershed in productivity”, says Ferguson.
“The 10% [of early implementing organisations] were being led by managers who saw the value of insight,” he says.
This is insight that was also being generated by the use of data analytics technologies from SAS and (now IBM’s) SPSS, which was less about BI reporting and more about statistical model building for forecasting.
In recent years, SAP has majored on the in-memory, columnar database platform Hana, which is said to bring analytical and transactional database models together.
But the history of business software, as seen through the prism of Computer Weekly, will be the subject of a companion article to this one.
Suffice to say here, on Ferguson’s account, the ETL suppliers were “under pressure to get data out of those business applications, where the data models were not well understood”, as well as from the relational database management systems in the 1990s and after.
The coming of the web
As if life were not complicated enough for database makers and database administrators, along came the World Wide Web, invented by another Englishman, Tim Berners Lee, in 1994.
Computer Weekly was just about registering that in 1996, as companies started building web sites in the mid-1990s.
The particular quirk with respect to online transaction processing [OLTP] databases was that they were never built to serve large numbers of concurrent users on the web, let alone those accessing the web from mobile phones, as has been the situation more recently, especially with the rise of the smart phone.
Julia Vowler, in the 28 March 1996 edition of Computer Weekly, was reporting on a war in cyberspace between relational database suppliers and object database companies, such as Informix, whose technology was, putatively, more suited to natively supporting text, audio, video, HTML and Java; and to connecting databases to web servers.
Informix customers were reported to include Morgan Stanley, Lehman Brothers and Nasa.
Who remembers object-oriented database management system companies today? The technology continues, and object-oriented programming languages, such as C#, Python, Perl, Ruby, continue to flourish.
But the companies that sought to displace Oracle and the other relational database firms have largely been absorbed by the rest of the industry – Informix was ingested by IBM in 2001.
The rise of big data
However, the hegemony of the relational model has more recently been contested by an upsurge of NoSQL [NotOnlySQL] companies, often based on open source technologies. Not always – MarkLogic is a NoSQL technology that is not open source – but often. So we have Basho (with Riak), Couchbase, DataStax (with Cassandra) and MongoDB.
Ferguson sums this group up as offering very specific use cases, usually to do with e-commerce or other website operations.
Stephen Brobst, chief technology officer at Teradata, expressed the view to Computer Weeky in 2014 that the NoSQL suppliers would eventually go the way of the object database supplier.
“In Silicon Valley, there is a religious war going on between the SQL and NoSQL bigots. Eventually, rationality will win through. Doing everything in SQL is not a good idea, and doing nothing in SQL is also not a good idea,” said Brobst.
“[On the NoSQL side] Mongo does a good job of adding ease of use for Java programmers. Cassandra is good for web logging. But what I believe will happen is a repeat of what happened with object databases in the 1990s.
“Back then, the cry was the ‘relational model is dead, it has had its 20-year reign’. But, essentially, relational database engineers stole all the good ideas, and brought in object capability, killing those pure object database players,” he said.
Stephen Brobst, Teradata
Whatever may prove to be the accuracy of this prediction, the fundamental ground on which the newer database makers have arisen is the same as that on which has risen the Hadoop family of technologies – big data.
Big data is an oft-bandied around term, but it can be said to include social media data, machine-generated data and other data types that do not fit neatly into the rows and columns of relational database technologies.
Ever since the strategy firm McKinsey consecrated the term in its May 2011 report Big data: The next frontier for innovation, competition and productivity, C-level business leaders have had their IT departments by the throat. “Where is our big data? And how can we make money from it?” have been the big executive questions. Some may even have asked: “What is Hadoop?”
Hadoop – or, more accurately, the Hadoop Distributed File System – is an open source version of a parallel programming framework called MapReduce, originally developed at Google.
It simplifies data processing across huge data sets distributed across commodity hardware and was developed a decade ago at Yahoo by Doug Cutting and Mike Cafarella. Cutting is now a senior executive at Cloudera, one of a group of Hadoop distributor companies that also includes Hortonworks and MapR.
MapReduce itself is coming to be displaced (or supplemented) by Apache Spark, commercialised by DataBricks. Spark is another parallel processing framework, but it is not confined to Hadoop, and can run over relational data stores as well as NoSQL databases. It is also not batchy, as MapReduce is.
And then there are graph databases, which store, map and query relationships between entities, and can sit alongside relational databases, but do a fundamentally different job. Neo4j was a pioneer in this branch of the NoSQL movement.
The future of data management
Data management is today an exciting, fast-moving field. In Computer Weekly, over the past 50 years – especially over the past 20, with the rise of the web and big data – we have covered much of it.
We’ve also written on topics such as master data management (MDM), the management of shared data to secure – usually – a single version of the truth; and data governance, which underpins MDM and always comes back to haunt data professionals and database suppliers.
Ferguson’s view is that, more than ever, “corporate IT wants tools so that complexity comes down and CIOs don’t have to pay through the nose for expensive [data science and data engineering] skills. In a way, it is back to Codd. We need data independence. Why does it matter where the data is stored? Tools and applications should not have to know that.”
Whatever may be the technical shape of future data management architectures, the field can only develop as one of the main bearers of business value, coming from IT, for corporate organisations well into the future. Those of us who write about it and, more importantly, do it will have plenty of work to do.