The recent breach of 2.6TB of data from Panamanian law firm Mossack Fonseca has generated a slew of revelations about how the world’s rich and powerful channel their wealth into offshore companies in a bid to avoid tax, prompting high-profile political resignations and public demands for a clampdown.
But the stories would have been a lot less detailed, and taken far longer to surface, without the help of a very old technology that’s currently making a comeback – the graph database.
Graph databases pre-date the relational database (RDB) model that has dominated business IT for more than 40 years. Instead of storing and manipulating data in tabular rows and columns, graph databases are structured more like the scribbled “mind maps” used for freeform note-taking – bubbles of information joined by a tangle of labelled lines that reveal the connections and relationships between them.
In a graph database, information is stored in the form of nodes (items such as businesses or individuals), properties (information about, or relating to, nodes) and edges (the lines connecting nodes to one another or to properties, where much of the important information resides). They typically don’t require data to be in a rigidly structured format and are often faster and easier to scale than RDBs.
The graph model works particularly well for applications where relationships between items of data are the most important factor.
Matt Aslett, research director at 451 Research responsible for data platforms and analytics, says: “This means they’re very well-suited to applications like social networking, mapping, route planning and logistics, asset management, loyalty schemes, fraud detection, recommendation engines, master data management systems and more.”
They are also perfect for the task of uncovering hidden connections in a mountain of legal and financial data such as the Panama Papers.
Read more about graph databases
Graph databases – the technology that links relations between datasets – will revolutionise the insights of data analytics.
Facebook and LinkedIn use graph databases for social applications. But enterprises should look into how a database like Titan can work with AWS.
Mapping out connections
The graph database renaissance is perhaps not surprising given how well the model mirrors the linked, non-hierarchical structure of the web itself, and the growing popularity of networked and social applications. Indeed, it was social web giants like Twitter, Facebook and Google that prompted the comeback in the first place, since they needed more efficient ways to manage and understand the relationships among their vast networks of users. And as networked smart cities and the internet of things (IoT) take hold, we’re likely to see many more use cases for the technology.
Already, businesses beyond the high-tech giants are increasingly turning to graph databases. Lufthansa, for example is using graph databases to store relationships between the content it offers on flights and the different devices people use to access that content. “To deliver, say, a movie or in-flight offer to a passenger’s personal device, the airline needs to understand the devices people are using – their screen sizes, performance and so on – then map that onto the content delivered to the individual user, as well as knowing details about passengers such as whether they’re frequent flyers or members of any of the company’s loyalty schemes,” says Aslett.
Choosing a graph database from an expanding pool of options
A whole bunch of graph databases, tools and frameworks have sprung up in the past few years, the bulk of them open source. The Wikipedia graph database entry has a non-exhaustive list of around 50 products, with useful feature comparisons.
Most of the business examples in this article use the market-leading Neo4j, which is by far the most mature and widely used graph database, having been around since the early 2000s.
Aslett also notes that Objectivity’s InfiniteGraph, another mature offering that’s particularly effective for crime detection, has significant traction among financial firms and law enforcement agencies.
But the options are expanding. Analyst James Governor of RedMonk thinks it makes sense to let your developers explore what’s out there, since products and feature sets are evolving so rapidly. “Things are changing all the time and there are an unforgiving number of options, including tools that let you overlay graph capabilities onto your existing relational database system. We think what’s interesting is choosing specific databases for specific tasks and finding ways to bring them together,” he says.
“Neo4j is clearly way out ahead in this market. It has built a solid reputation through its long-term focus on graph technology and I’ve spoken to plenty of developers who’ve had good experiences with it,” he adds.
“The TinkerPop stack from Apache is also getting a lot of attention at the moment, so enterprises might want to look at that too. And for really high-scale stuff, there’s Apache Giraph which runs on the Hadoop stack and is being used by the likes of Facebook,” says Governor.
There are plenty of other examples. Mobile operator Telenor uses the technology to understand its users – where they are, what devices they’re using and what they’re permitted to access. Many banks and financial institutions use it for fraud detection. Royal Bank of Scotland is using it in a change management tool called Dart that continually tracks the implication of changes on its core Agile Markets trading system. Online gambling provider Gamesys is using it to manage a referral system and Facebook integration for its customers. The list goes on.
Growth and change ahead
We’re not, however, likely to see graph databases taking over from relational databases any time soon. They still only represent a minuscule fraction of the total database market, although precise figures are hard to come by. 451’s Aslett estimates the market currently represents around $200m of the $286bn sector – which equates to a share of about 0.07%.
But he notes the whole NoSQL market (which includes key value and document stores as well as graph databases) is nonetheless growing at an impressive rate.
“We’re seeing a compound annual growth rate of 43%, compared with 11% for the market overall. The dominance of the relational model means things will naturally take a long time to change, but we’re at the beginning of a potentially significant shift,” Aslett concludes.
Picking apart the Panama Papers – how a consortium of investigative journalists used graph technology to plot the offshore connections of the rich and powerful
The Institute of Investigative Journalists (ICIJ) is a global consortium of almost 200 reporters who work with many of the world’s leading news organisations. When they were handed a cache of more than 11.5 million breached documents from Panamanian law firm Mossack Fonseca, which specialises in setting up offshore companies for its wealthy clients, it was clear that untangling the web of connections concealed within the 2.6TB of data was not going to be possible manually.
“We’ve been dealing with data connected to offshore dealing for the past four years, and during our investigation into the leaked HSBC Swiss Leaks in 2014/15, we implemented the Linkurious data visualisation tool, which uses the Neo4j graph database as its engine,” says Mar Cabra, head of the ICIJ’s data and research unit.
The tool’s implementation was only completed at the tail end of that investigation, so was used mainly for fact-checking, she says. However, the size of the Panama Papers breach eclipsed all previous data hauls. “We don’t have an army of data scientists here. There are only three developers, so it was vital to provide journalists with an intuitive tool they could use to explore the data without the need for technical experts,” says Cabra.
First the team reverse-engineered Mossack Fonseca’s internal database of around 215,000 offshore companies from the piecemeal data they’d been given. They then used the Talend transform and load tool to import the data into Neo4j, from where it could be visualised in Linkurious. “My reporters found it very intuitive and easy to use. They were able to just click dots on the screen to reveal – instantly – how people and entities are connected,” says Cabra. “It has an advanced query language, Cypher, that more technically-savvy reporters can use, and you can also tap into an API [application programming interface] to visualise the data elsewhere. Fuzzy matching, where the system finds similar names, was also really useful. Another great feature for reporters is the ability to export interactive widgets, for example to let readers visually explore the connections around particular politicians.”
Meanwhile, the mass of unstructured data – comprising emails, legal documents and so on – was put into a document store where it could be searched by journalists with familiar Google-style text-search tools. This combination of a visual graph database and searchable document store was vital for piecing together the stories, since the database alone did not reveal all the names. “Mossack Fonseca didn’t record many beneficial owners in its database, only inside PDFs or scanned documents, so it was vital to be able to explore both,” says Cabra.
And from this month, anyone will be able to seek out more connections hidden in the graph data when the ICIJ combines it with the data from previous offshore leaks and makes it publicly available for exploration with a Linkurious front end.
“We’re going to crowdsource even more revelations,” says Cabra. “Our website will contain data on over 300,000 companies in tax havens and the people behind them. Reporters, the public, tax officials and prosecutors are all going to be able to explore those connections – and no doubt that will throw up even more surprising names.”