University College London (UCL) is consolidating research data from multiple disciplines onto one central system built on a petabyte of DataDirect Networks object storage and parallel file system arrays.
UCL has more than 5,000 research staff in subjects ranging from particle physics and biomedical sciences to humanities such as archaeology.
Historically, research data has resided wherever the researcher put it. That varied from thumb drives or CDs in desk drawers to racks of servers among the more IT-literate departments.
But now that is all set to change. Eighteen months ago Max Wilkinson was brought in as head of research data services with the task of providing researchers with a high-performance, resilient system for storing, sharing, re-using and preserving project data.
He said: “Our goal is to remove the burden of managing project data from individual researchers while making it more available over longer periods of time.”
Wilkinson’s team aims to build a central repository that will be used by up to 3,000 individuals from UCL’s total base of 5,000 researchers in the next 18 to 24 months. The intention is to build those systems on best practice in data management, with metadata catalogues that provide structured information about the research that allow it to be shared, searched and analysed.
Building the setup from scratch
When it came to providing the storage infrastructure to support the plans, Wilkinson’s team started to look for systems that could grow to hundreds of petabytes without creating an excessive storage footprint or administrative overhead.
UCL sought tenders from 21 suppliers with systems that ranged from synchronous file sharing to high-performance, parallel file systems, via SAN, NAS, cloud and object storage systems.
Wilkinson said: “We discovered pretty quickly that an off-the-shelf supported product to do what we wanted didn’t exist, except in an unrealistically expensive sense. So, we had to put something together ourselves.”
So far, it has installed around half a petabyte of DDN storage, comprising 450TB in GRIDscaler parallel file system hardware linked to 50TB in two Web Object Scaler (WOS) object-based storage nodes.
The disks in the GRIDScaler are tiered, with 600GB 15,000rpm SAS drives for metadata and 3TB 7,200 RPM SAS drives module for research data. Meanwhile the WOS platform houses 120 2TB drives.
The GRIDscalar device is effectively a clustered NAS device, although at this point UCL has only one node. Parallel file systems are built to scale to large numbers of files into the billions and many petabytes and potentially allow access to a single file system from many devices.
Clustered NAS and parallel file system capabilities have gone from specialised use cases such as media/broadcast and HPC number-crunching to become mainstream NAS features in the past couple of years.
Object-based storage for ease of search
A key focus, however, of the UCL project is to assess the potential for object storage as a platform for its research data. Object-based storage does away with the need for traditional hierarchical file systems, which can become unwieldy at very large data volume and large numbers of files.
Instead, data is organised in a flat file system, with each object having its own unique identifier in a similar way to DNS on the internet.
For that reason object storage is touted as being suited to very large datasets, but it is early days and has yet to see widespread acceptance.
The current UCL trial comprises the storage of 10 research projects, mostly held on file system storage, with Dropbox-like cloud interfaces being tested.
The trial will help decide whether to proceed further with object storage. A key challenge is the relationship of data on object and file system storage, because data is stored in completely different ways on each.
Wilkinson said: “File systems are familiar and they perform well but don’t scale. Object storage scales but does not necessarily scale and the question of how data on the file system and on object storage relate to each other is one we’re unpicking currently.
“We will only achieve a solution by experimenting. We have to decide how to move things between object storage and the file system, whether to make object storage look more like a file system or vice versa.
“If object storage lives up to its promise of scalability, that will outweigh all other issues.”
Currently the DDN storage is at one physical location but, by the end of year, capacity will double to a second datacentre.