eHarmony finds MongoDB the perfect match for data store

Online dating site eHarmony has used open source NoSQL database MongoDB for its data store, to speed up delivery of matches between users

Online dating site eHarmony found that open source NoSQL database MongoDB was the perfect match for its data store needs. 

The service had around one million registered members in 2001 but now has 44 million, and its machine-learning compatibility matching engine has gained in sophistication. Consequently, its Postgres SQL relational data store was no longer the best solution.

Thod Nguyen, chief technology officer at eHarmony (pictured) says: “Our compatibility matching model is becoming more and more complex. And, remember, it is bi-directional. It's a different model to, say, Netflix. You can like a movie but it doesn't have to like you back.”

He claims that 5% of all US marriages, since 2005, start at the eHarmony web site, which processes a billion matches each day. The machine-learning technology that has been processing user profiles for a decade is proprietary.

Using MongoDB for its data store means processing the entire user pool can take place within 12 hours, a task that previously took 15 days. 

“But matching is just one component of the website,” says Nguyen. “There are user engagement activities, too,” which have become richer with a new website, he says.

Nguyen joined the Santa Monica-based company 10 months ago, with a background that includes time at and digital marketing platform provider Zurock, and experience in putting NoSQL technologies into production.

He and his 60-strong team have been confronting a “dramatic increase in traffic”, together with the increasing complexity of the user profiles matching model.

“In this particular case MongoDB is the best NoSQL solution for the problem we were trying to address, in terms of scalability and performance," he says.

“The data store of the user pool was previously based on Postgres SQL - centralised and not distributed. It was hard to scale as the data expanded and as the number of attributes within the profiles increased.

“You have to deliver your matches near real-time. If you processed our entire user pool it took weeks to generate matches, especially those top-quality matches. So, in 2012 we started to rethink how we architected the system, with the data store as a key component of that."

eHarmony evaluated HDFS [Hadoop Distributed File System], Oracle’s MySQL, the Voldemort data store, and Cassandra. 

“MongoDB was good at scalability and has great built-in sharding and replication, which makes it good at running complex queries," says Nguyen.

“It also has a flexible and dynamic schema. With the SQL system if you wanted to add an attribute to a profile you needed to do a full data migration. With tens of terabytes of data in production that's very difficult. With the new system we just add more nodes to the cluster.

“It's the best optimal solution for this particular complex problem [the data store element of the architecture].”

For more on NoSQL in web companies gaming site unlocks big data with Hadoop

Betting site YouWin speeds response times with MongoDB database

Big data projects require big changes in hardware and software

He advises others to follow the approach of starting from “the problem to be solved, not the technology as such". 

"Go through multiple different solutions, SQL and NoSQL," he says. "Look at open source. Be open-minded about that. There is a lot of open source that is addressing similar problems, but you need to find the right one for you and your problem set”.

He describes himself as a “great proponent of open source”, but counsels that, “Community support is very important. There is a real difference between proof of concept and an enterprise production environment. Often you don't see problems in the test and development stage, you see them more in production. And for that you need a lot of professional support.

“MongoDB is good in that respect – there is good community support, but also professional support through 10gen.

“And it is also important to give back to the community. We've done that -- with the Seeking query library given to GitHub”.

Read more on Big data analytics

Data Center
Data Management