When David Abraham became chief executive of Channel 4 three years ago, he had three things on his mind: data, data and data.
Abraham realised early on that if Channel 4 (C4) could harness the data generated by people viewing television online it could develop new services that could help to differentiate the channel from the competition.
“It was absolutely clear that David wanted to collect, crunch, analyse and drive [Channel 4] forward using data,” said Bob Harris, the broadcaster’s chief technology officer.
Added-value services could be as simple as allowing viewers to resume a programme in the right place, no matter what device they view it on, or offering viewers a menu of programmes that are likely to be of interest based on their previous viewing history.
The strategy has its origins back in 2006, when Channel 4 launched its on-demand website, 4OD, making programmes available to people on smartphones, laptops and internet TVs, on demand.
More from the 500 Club’s session on big data
A different approach to business intelligence
But if Channel 4 was to make Abraham’s vision of becoming a data-led broadcaster a reality, it needed to think about information technology in a different way, said Harris, speaking to IT leaders at a Computer Weekly 500 Club event.
Channel 4 was already up to speed with the latest IT systems for analysing business intelligence. Its portfolio of technology included Oracle databases running on Sun servers, coupled with SAP’s Business Objects suite and IBM’s SPSS statistical analysis software for analysing data.
Channel 4’s systems are powerful enough to allow the broadcaster to do near real-time analysis of advertising sales, but were not suited to the big data applications that Abraham had in mind.
Rethinking data volumes
“I sat down with our database guys, our recording guys, and our business intelligence teams, and I asked how we were going to cope with one, two, three orders of magnitude growth in our data volumes,” said Harris.
Channel 4’s R&D department had begun tracking the emerging big data technologies, back in 2001. One in particular stood out – Hadoop Map Reduce, a software framework developed by the open source community, which allows developers to write programs to analyse and process massive amounts of data.
“It seemed like the trendiest technology designed to handle huge volumes of data,” said Harris.
Definitions of big data
Wikipedia: Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
Bill Inmon, father of the Data Warehouse: Big data is that set of data which is an order of magnitude larger than that set of data which you can comfortably process today. (Adapted from quote from Bill Inmon)
Gartner: Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision-making.
Proof of concept trials
The IT team ran a number of proof of concept trials, working with specialist big data company Cloudera and others. But the task proved more difficult that Harris anticipated.
“We decided that building these things in-house from inexpensive computers was not as easy as people tell you,” he said.
Instead, Channel 4 turned to Amazon, the broadcaster's main cloud computing services provider, for an on-demand big data service.
Amazon’s Elastic Map Reduce
Harris decided to run EMR in parallel with Channel 4’s existing business intelligence systems throughout 2012 and compare the results.
“We had started to process larger and larger amounts of big data on our business intelligence stack, but in reality EMR started to overtake that,” said Harris.
It became clear that EMR could process far greater volumes of data than traditional business intelligence software, and that it could process Channel 4’s data much more quickly.
“We were moving jobs taking multiple days down, in most cases, to single-digit hours, so there were huge gains in productivity,” he said.
Querying billions of rows of data from the desktop
Further gains came when Channel 4 rolled out a programming tool for Hadoop, called Hive, that allows users to interrogate vast amounts of data from a desktop PC.
Users can query billions of rows of data from their desktops and have the results back in minutes or hours, said Harris.
The beauty of processing data in these volumes is that poor-quality data can be thrown away without causing a serious problem, providing it is not more than 1% or 2% of the total, he said.
“Let's not get bogged down in the sand – this is a world of statistical analysis we are moving into, as opposed to a world of finite banking level transactions,” said Harris.
Channel 4 has been able to use its traditional database and business intelligence suite to analyse the results.
Investment in traditional business intelligence not wasted
A typical big data trawl might work through 20 million rows of data, and generate 10 million rows of answers.
The best place to put those 10 million answers is a standard data warehouse, according to Harris.
Channel 4 has tens of millions of pounds invested in traditional business intelligence technology, and a skilled team of people in place.
And with most traditional IT manufacturers producing interfaces for big data, Harris sees no need to replace those systems any time soon.
“These things will be with us for some time to come, because most of us, myself included, have invested millions of pounds in these technologies,” he said.
Lessons learned from big data
For Harris, Channel 4’s experiences have shown that dedicated big data technologies such as Hadoop offer far more than traditional business intelligence technologies built on relational database management systems (RDBMS).
Certain jobs that would take a few hours on Map Reduce would take hundreds of days to complete on high-power Sun servers running RDBMS technologies, said Harris – and they would cost a lot more.
IT suppliers have developed proprietary technologies based around massively parallel computers, but performance they offer per pound invested compares poorly with Hadoop, said Harris.
“I have spoken to a lot of people who have invested in those proprietary technologies, but none have come back and said it is great and it does everything we need,” he said.
Open source software versus commercial software
In practice, Harris suggested that if you are going to implement big data, it makes sense to choose open source software such as Hadoop, rather than a commercial packaged software.
Open source technology is absolutely “at the bleeding edge” of technology, he said.
“I have not seen any proprietary big data technology that is really solving the problem. A lot of vendors are taking open source software and low-cost hardware, packaging it up in their own offering, and trying to sell it for a significant amount of money,” said Harris.
Read More 500 Club articles
Hadoop sparks a religious debate among technical specialists
Harris is agnostic when it comes to which particular version of Hadoop is the best. There is an almost religious debate over the merits of Cloudera, compared with Hortonworks' version of Hadoop, he said.
But a more pertinent question is whether to choose to run Hadoop in-house or in the cloud.
And if you go for Amazon, do you chose a packaged version of Hadoop running on Amazon’s Elastic Map Reduce service, or do you hire servers and storage from Amazon and run your own choice of Hadoop software?
Whichever route you choose, big data technology is immature and difficult to implement, according to Harris.
Making a success of it means embracing the open source community and being prepared to spend hours with baseball-hat wearing technical enthusiasts.
“You have to roll up your sleeves and get on with it,” he said. “That is very much what we did when we realised our systems were not going to cope. We basically sat a bunch of people down and said, 'just go and make this stuff work'.”
Channel 4 had to write its own big data tools. “I hate having to write my own tools," Harris told the audience. "There are some really good tools out there, but really there isn’t anything that fits the bill. You really have to code cut.”
“We have moved from monthly, to weekly, to daily [batch processing], and we will get much closer to real time,” he said. “I can’t do it today, but I believe the price point of that sort of analysis will come down.”
It is early days, but Harris expects Channel 4 to continue to developing big data on Amazon’s cloud service for the foreseeable future.
“We have had a few troughs of disillusionment, I can tell you, when we could not make this stuff work,” he said, drawing on the language of the IT analyst group Gartner.
But now, when it comes to big data, he said Channel 4 is heading towards Gartner’s “slope of enlightenment”.
“It's not that we are better than anyone else. We just started earlier,” he said.