Executive interview: Google's big data journey

A programming concept developed in 1958 inspired the seminal Google whitepaper that introduced the world to MapReduce

This article can also be found in the Premium Editorial Download: Computer Weekly: Google’s big data vision

A programming concept developed way back in 1958 was the inspiration behind the seminal Google white paper that introduced the world in 2004 to MapReduce, an early big data initiative.

With MapReduce, Google tried to address a problem it had identified in the way it processed internet search data. In essence, MapReduce split big data in a way that enabled it to be processed with Hadoop running on low-cost commodity hardware.

The search engine company has now extended its data processing strategy and recently introduced Cloud BigTable, a fully managed, scalable NoSQL database service.

Internet search, social media and the internet of things are some of the IT areas experiencing huge data growth.

Indeed, experts predict traditional relational databases will be unable to process the tsunami of data that a truly digital society will require.

In that MapReduce white paper over a decade ago, Jeffrey Dean and Sanjay Ghemawat from Google described how there was no single infrastructure where heterogeneous jobs could be scheduled and processed in one common infrastructure. Everything and anything had to be hand-written for specific environments and architectures.

The internet search giant is now on version 3 of its big data vision since the publication of that white paper, says Cory O’Connor (pictured), Cloud BigTable product manager: “2002 to 2004 was the big bang of big data; this was when Google wrote its white papers on MapReduce.

Read more on Hadoop

“Google fundamentally rethought the practice of building bigger machines to solve these problems. We only build using commodity machines and we assume systems will fail.

“We have done several iterations of almost every piece of technology we showed in the white papers.”

The use of massively scalable low-cost commodity infrastructure is almost diametrically opposite to how the big four IT suppliers go about tackling big data. Yes, they do NoSQL and offer Hadoop in the cloud. But SAP, for example, wants customers to spend millions on S/4 Hana, Oracle pushes Exadata and its engineered appliance family, IBM sells the merits of the z13 mainframe, and Microsoft has SQL Server.

For example, the z13 mainframe can run real-time fraud analysis on financial transactions. Such time-series data is a natural fit with Cloud BigTable, says O’Connor. “It is interesting to see how you can approach a problem in a different way. Time will tell which way proves to be most effective.”

Google’s differentiator, according to Connor, is: “We know how to manage very large datasets. We have a fully managed big data architecture.”

Storage costs

He adds: “Data is growing. The market data today will require 10 times the amount of computing and 10 times the amount of storage. At some point you cannot build bigger, you have to adopt the paradigm of commodity hardware and scaling horizontally. This was the premise behind NoSQL, which is able to scale out very effectively.”

Google fundamentally rethought the practice of building bigger machines to solve these problems. We only build using commodity machines and we assume systems will fail

Cory O’Connor, product manager for Cloud BigTable

Given the cost of storing greater and greater amounts of data and the way it is deployed, he says, “it looks like it won’t be economical to maintain storage using traditional procurement”.

The question for large enterprises is whether investing in something like 1PB of enterprise storage is as reliable as 1,000 1TB commodity discs.

But Google, which started 20 years ago, runs arguably one of the world’s biggest databases, and it is all based on commodity storage. The technologies it uses internally are now available as external cloud services such as BigQuery for scalable analytics, the cloud Pub/Sub streaming data pipeline, Cloud DataFlow for streamed processing and now Cloud BigTable.

O’Connor says: “All scale, all are fully managed, and all are world class and version 3, from when the white papers were released.”

Lowering the technical barriers

With Cloud BigTable Google is attempting to lower the education barrier, by building in many of the services Hadoop users previously had to develop themselves. Going after existing users of Hadoop represents the low-hanging fruit.

What about enterprises running relational database applications? 

O’Connor says: “BigTable is a big step for people who have not experienced NoSQL.” He expects such organisations will need to reachitect their applications. Start building new projects using Cloud BigTable, he adds.

Read more on Big data analytics

Data Center
Data Management