There is lot of talk about big data these days, but many organisations still don't know how to use it. Swedish companies Klarna and Spotify are among those that make good use of the new possibilities.
Klarna, an e-commerce business founded in Sweden in 2005, provides customer payment services for online stores. The business idea is to simplify buying. One aspect of this vision is to eliminate risk for seller and buyer, which means it is essential to make good assessments of a buyer’s trustworthiness.
“To increase precision of risk assessment, we have invested in a big data infrastructure. We began a year and a half ago and went into production this spring,” says Erik Zeitler, who holds a PhD in database technology and is the technology lead in data infrastructure at Klarna.
The new infrastructure makes it possible to use several different sources at the same time when making a risk assessment; previously a single database was used.
“This of course means we can take into account more kinds of data when doing risk assessment, and it also means we can try out new risk models by maintaining several alternative transformations at the same time," says Zeitler.
"We have implemented a continuous delivery pipeline on top of Hadoop that makes it easy to deploy new transformations in production in a traceable fashion. The single database used to be considered as the source of the truth, so to speak, and that's not the case anymore.”
Read more on big data architecture
- 'Big data' and the changing role of enterprise architects
- Four factors to weigh in planning an analytics big data architecture
- Hadoop for big data puts architects on journey of discovery
- An Information Architecture Vision
- The big data architecture dilemma for CIOs
- Big data spells new architectures
- Hadoop 2 opens new vistas beyond current UK data management practice
Klarna won't reveal exactly which data is used, but it comes from three main sources: internal customer history data, data specific to the purchase, and external vendor data.
“Each risk assessment is based on a pretty limited amount of data – it's not many megabytes. But this data is derived from nightly batch transformations, and they are pretty big,” says Zeitler.
This means that Klarna prepares preliminary variables every night on all customers who previously have made purchases.
“And when a person wants to make a new purchase, we use those variables. There are a number of factors that can stop a transaction, and one of those is if you are buying unexpected products. For example, are you expected to suddenly buy a hundred USB sticks? If not, it might be a fraud.”
While it is easy to measure how many transactions Klarna fails to stop – it's just a matter of counting how many purchases don't get paid – it is harder to assess how many purchases that are denied on false grounds.
“It's an ongoing project to try to estimate the loss we make from our false negatives. We continually tweak our models to try to minimise both the false positives and false negatives. And the more data we use, the better our chances are to make a good estimate,” says Zeitler.
If the customer does not pay, Klarna takes the hit. The e-stores are charged a fixed price combined with a fee for each transaction.
“The online stores that begin using Klarna usually increase their sales by about 30%. One of the reasons is that we eliminate friction for the customer by separating buying from paying. Customers buy first, and then decide how they want to pay,” he says.
To complete a purchase, Klarna only asks for information the customer can easily provide, according to Zeitler.
“We ask for things like email address, and post code or social security number, until we have enough information to identify the person, and to make the risk assessment. If we accept the purchase we send the customer an invoice. That means the customer does not have to part with their money before they get the product.”
The data used to identify the customer, and the data used to make the risk assessment, varies from country to country, mainly due to the difference in public registers. This – in combination with differences in payment cultures – means it is a lot of work for Klarna to enter new markets, according to Zeitler.
When you talk about big data you look at volume, variety and velocity – called the 'three Vs' – and our challenge is not the volume; the big challenge is to handle the variety of the data
Erik Zeitler, technology lead, data infrastructure, Klarna
Klarna recently stepped into the UK market, and its system is now offered by around 45,000 e-stores in 15 European countries. With the growing number of customers, the total amount of data handled also grows.
“But the amount of data is still not enormous. In our nightly batch transformation we go through somewhere between 10 terabytes and one petabyte of data. When you talk about big data you look at volume, variety and velocity – called the 'three Vs' – and our challenge is not the volume; the big challenge is to handle the variety of the data,” he says.
To meet the challenge that the big number of data sources and their complexity imposes, it is not enough to deploy ordinary relational databases, in Klarna’s view. The company’s front-end systems are built on the NoSQL database Riak from the vendor Basho. The risk assessment is made in the next layer, which is actually a cluster of relational databases.
“Those are the two online systems. Back office – where the nightly batches are made – we have a system built on Apache Hadoop. And Hadoop is two things - it’s scalable storage in the form of Hadoop Distributed File System, HDFS, and scalable execution in the form of the programming model MapReduce,” says Zeitler.
One of the big pros with this setup is the ability to use an SQL-like language called HiveQL on Hadoop, and then feed the result of the batch transformation into the front-end system every morning, according to Zeitler.
“Another pro is that the front-end SQL stays up to date with a steady stream of new transactions from the online system. That is, the information from the online system simmers down to Hadoop for offline analysing, but some of the information also goes directly into the cluster of relational databases. And then we write over the information in the online databases every morning with the outcome from Hadoop.”
This architecture, called Lambda, is a way to eliminate complexity and gives Klarna great flexibility, according to Zeitler.
“It took a great deal of reading and discussing before we settled for this solution,” says Zeitler, who was the one who took the initiative for investing in a big data infrastructure at Klarna.
“I proposed it to the upper management two years ago, when I was relatively new at Klarna, and they listened to me. I don't know if this is something that is more common in Swedish organisations, which are usually pretty non-hierarchical, but I think it's pretty cool that you get listened to as a newcomer,” he says.
Download our buyer's guide to big data infrastructure
In this 10-page buyer’s guide, Computer Weekly looks at the mindset and technology businesses need to analyse various forms of data, the low-cost solid state memory powering datastreams from social network feeds and the industrial internet and a revision of the traditional approach of matching back-end infrastructure to application requirements.
- Choosing a platform to manage the big data mix.
- Storage struggles to keep up with data growth explosion
- Choosing a platform to manage the big data mix.
Spotify and big data technologies
Music streaming service Spotify, founded in 2006, is another Swedish-born company that has invested in a big data infrastructure: a 690-node Hadoop cluster in the back office, and on top of that clusters of the open source NoSQL database Apache Cassandra.
“We have 40 million monthly active users, and they generate a lot of data. We process that data to generate new data to hand back to the users. For example, we give song and playlist recommendations,” says Jimmy Mårdell, tech product owner at Spotify, and responsible for the data being delivered back to users.
The first big data infrastructure was built when the company started eight years ago, and much has happened since. In the beginning Spotify ran a small Hadoop cluster with 35 nodes, and data was imported into Hadoop using a store-and-forward approach. Today data is imported through a streaming system using the messaging system Kafka.
“All user activity generates a lot of logs and data, and then we have to ship that from all over the world to our Hadoop cluster – that is what we use Kafka for,” says Mårdell.
The main processing language used to be Python, but now Spotify is moving into Apache Crunch and Scalding instead. Spotify has also built its own workflow manager, called Luigi, which is used to synchronise all the analysing done on its data.
“We need Luigi to stay away from total chaos. We have open sourced Luigi, so other companies can use it as well; we like open source and use it a lot at Spotify. You get help from the community to develop the open source software – it's a win-win situation,” says Mårdell.
Spotify has also recently started using Spark, which is a different way of doing big data processing, as opposed to MapReduce. When it’s time to send the processed data back to the user, the data is first transported back from Hadoop to Cassandra, using tools Spotify has written in-house.
Then Cassandra is used to serve data back to the user. Since it’s important the data is geographically close to the user, Spotify has several Cassandra clusters dispersed around the world.
“We can’t have the Cassandra clusters in one single place, like the Hadoop cluster. If we had the data in Sweden, and an Australian user asked for it, the user experience wouldn't be that good – it would be too slow.”
Spotify has also chosen to have many instances of SQL and Cassandra databases, to secure the stability of the system, according to Mårdell.
“For example, the data that delivers your playlists and the data that delivers the discovery page are separated in totally different database clusters. This means that if the playlist databases would get sick for some reason, everything else in Spotify would still work perfectly. Decoupling is the key to scalability,” says Mårdell.