This article can also be found in the Premium Editorial Download "Computer Weekly: Retail's data challenge."
Download it now to read this article plus other related content.
With 50TB of machine-generated data produced daily and the need to process 100PB of data all together, eBay's data challenge is truly astronomical.
This deluge of data is helping eBay to emulate the know-how that customers used to get from a local shop owner; the only difference is, it is trying to achieve this across its global auction sites.
Speaking at the Gartner CRM Summit in London, David Stephenson, head of global business analytics at eBay, said the auction site's goal is to make shopping successful.
As a marketplace, eBay's primary business involves being successful from a buyer's and a seller's perspective.
The company is using analytics to help it understand its customers better. Stephenson's ambition is to take the kind of personalisation possible in a small shop and apply it to the world of eBay. "In a small store, engaging the customer is key, helping them with search and recommendations, understanding their preferences and learning from existing customers," he said.
Web metrics data is the raw material Stephenson has at his disposal. The auction site generates a huge amount of web analytics, which Stephenson described as "the customer journey data". This tells him what people do on eBay and how they use the site.
“The web can offer the same experience [as a local shop], and provide customers with comparisons," said Stephenson. “We can learn customers' intentions." All this insight drives technology changes at eBay.
More articles on eBay
The challenge for eBay is that web analytics is like having a video camera mounted on the head of every customer going into a supermarket, said Stephenson. Recording everything every customer does generates 100 million hours of customer interaction [per month], creating an unmanageable amount of customer data. "There is no way to start if you want to process 100 million hours [of web analytics]," he said.
"We need to understand customers, learn from our customers and apply data science techniques to allow us to get more data and new types of data."
Managing the customer journey
The eBay site has 100 million customers who list items in 30,000 categories. In terms of transactions, the site processes thousands of dollars per second. And Stephenson described this transactional data as "just the tip of the iceberg".
He admitted eBay is starting to struggle to process all the customer journey data.
The big data challenge for eBay is that asking a simple business question such as "What were the top items that showed up in searches yesterday?" involves processing five billion page views. "So there is a huge problem just to ask a basic business question," said Stephenson.
But eBay needs to do more than ask simple questions. Stephenson said the site wanted to run sentiment analysis, network analysis and image analysis, all of which cannot be run in a traditional transactional database.
The company has split its data analytics across three platforms, the first of which is a traditional enterprise data warehouse from Teradata. This core transactional system must be extremely reliable, said Stephenson. "The system can't go down. Every day we process 50TB of data, accessed by 7,000 analysts with 700 [concurrent users]."
In 2002, eBay built a 13TB Teradata enterprise data warehouse, which effectively provides a massive parallel relational database. This has now grown to 14PB, with the system built on hundreds of thousands of nodes.
The enterprise data warehouse gives tremendous performance on standard structured queries, said Stephenson, but it is unable to meet eBay's needs for storage and processing flexibility. "These systems are fairly expensive, so when you are looking at adding 50TB of data every day, costs are prohibitive," he said.
In terms of customer journey data, eBay used to keep a sample of 1% and throw the rest away, said Stephenson.
If you impose structure and throw out data, you cannot ask questions you don't know
David Stephenson, head of global business analytics, eBay
It may make sense to record what customers do, then throw away all the information that is not required, he said, but added: "For a lot of questions, we don't know ahead of time what we want to ask about the customer journey. About 85% of the analytics questions we ask are new or unknown. If you impose structure and throw out data, you cannot ask questions you don't know, but if you store everything, you will have 100 million hours of data [per month] and won't be able to analyse it all.
"There is a tension, either to impose structure on the huge [web analytics] data set by throwing away data, or keeping all the data collected but not being able to work on it [because it is unmanageable]."
To address this issue, eBay started its second data initiative. Seven years ago, the company began a project to store all its customer data. "For the customer journey data, we wanted to scale our big data solution 100-fold for the same price [as the enterprise data warehouse]," said Stephenson.
The auction site needed a product that could handle hundreds of petabytes of raw customer journey data, but would be easy to maintain by a team of five people, yet could be accessed easily by analysts.
The company worked with Teradata to develop a custom appliance built with several hundred user-defined functions. The system was built on commodity hardware, with proprietary software to process all the customer journey data and store it cheaply.
The end result is a custom data warehouse called Singularity.
The system eBay has developed can run ad-hoc queries in 32 seconds. Stephenson said that at the time, Hadoop would have taken 30 minutes to run such queries. "Hadoop may not be best [suited] for business-critical issues such as really understanding your customers," he added.
Along with the enterprise data warehouse and Singularity, eBay is also using Hadoop, which completes the third side of its data analytics triangle. The auction site has built two 20,000-node Hadoop clusters with 80PB of capacity, said Stephenson. These work alongside the Teradata data warehouse and Singularity custom data analytics appliance to give eBay the tools it needs to use data analysis to follow the customer journey.
True value of analytics
Stephenson said Singularity is proving its value in 'A/B testing' on the eBay site, which can be compared with trying different combinations of confectionery at a supermarket checkout to capture impulse buying. This allows eBay to test ideas on the site and assess what works, such as testing whether site visitors prefer bigger pictures in search results.
The technology can also be used to power search hints, a concept Stephenson called "an economist in a box". It is possible for eBay to present search query tips based on topics that power users have already asked. “Just about every question that could be asked has already been asked by a power user," he said.
Such searches enable an eBay seller to determine whether it is best to set a low auction reserve price, whether free shipping matters, and any other possible questions related to selling an item successfully on eBay.