In a guest blogpost, Neo Technology’s CEO Emil Eifrem discusses the technology shifts that have taken place that enabled the world’s biggest data investigation to occur.
Through The Panama Papers we’ve been learning more about the offshore tax haven activity of the global élite.
This isn’t the first time the ICIJ – the organisation behind this epic data scoop, a network of reporters committed to breaking stories of global public interest – has pulled off an investigation at this level; it’s actually the latest in a series of data journalism wins. Only last year, the ICIJ published the Swiss Leaks story, exposing the fraudulent activity of 100,000 HSBC private bank clients in Switzerland. But at 2.6 terabytes and 11.5 million separate documents, from a size perspective, the Panama Papers dwarfs every data leak that’s dominated headlines in the last decade. Did I say ‘decade’? I meant, ever.
What ties these two data leaks stories together is how the ICIJ team worked with their data via the graph data approach. Mar Cabra, the ICIJ’s Data and Research Unit Editor, has said that when the Swiss Leaks material crossed her desk she knew she needed a different kind of tool to analyse such a complex and interconnected dataset – one that could process such a large volume of connections quickly and efficiently. She has also stated that she wanted an easy-to-use and intuitive solution that didn’t require major intervention of a data scientist or developer, as the data discovery and analysis process had to be accessible to investigative journalists around the globe – regardless of their technical background.
Graph databases recommend themselves for huge projects of this sort – finding patterns in vast amounts of unstructured, ‘flat’ PDF data – because they are very adept at managing highly-connected data and helping users pose complex queries. That’s because instead of working with data the way a traditional relational business database does, graphs use a simple network representation incorporating entities called nodes, properties and edges to define and store data.
Process huge datasets
This architecture makes them highly efficient at analysing interconnections between data, allowing journalists to ‘follow the money’ and spot a story, or a set of connections not previously visible, in ways they have never been able to before. A graph database is therefore a “revolutionary” discovery tool that’s “transformed our investigative journalism process”.
But graph database software has been quietly proving itself in many fields before investigative journalism picked up on it. For over a decade, big web firms like Google and Facebook have built up a serious array of skills and tools that allow them to derive insight and value from massive amounts of data. Data is their core differentiation, and their business models depend on increasingly sophisticated ways of working with information – which is why they use graph databases.
Large web companies have vast resources in terms of time, money and PhDs to devote to this level of data processing and analysis, for sure. But outside those, this capability has been sorely lacking. Indeed, if the Panama Papers leak had happened ten years ago, no story would have been written, because no one else would have had the technology and skill-set to make sense of such a massive dataset at this scale.
It’s only with tools like graph databases that investigation of vast and complex datasets like this can occur. However, the great news is that graph databases are of benefit to far more people than investigative journalists like the ICIJ. And while global organisations have been amassing these proprietary processing capabilities, there’s been a parallel movement towards an open technology stack for working with connected data of this magnitude.
The Panama Papers were important, but you haven’t seen anything yet in terms of solving data and relationship problems at huge scale.
The author is Emil Eifrem, co-founder and CEO of Neo Technology, the company behind the graph database, Neo4j (http://neo4j.com/), which was one of the main tools in the Panama Papers project