juanjo tugores - Fotolia

Inside Australia’s supercomputing journey

The country’s Commonwealth Scientific and Research Organisation has upgraded its high performance computing infrastructure to keep pace with global research

At Australia’s national research facility in Canberra, the top minds from across the sprawling country gather to solve some of the world’s most pressing issues.

From better understanding extreme weather conditions to helping the visually impaired navigate the streets, the work undertaken by Australia’s Commonwealth Scientific and Research Organisation (CSIRO) has the potential to improve lives and humanity.

For years, CSIRO’s researchers had relied on Bragg, a supercomputer built by Xenon Systems, a Melbourne-based supplier of high performance computing (HPC) systems, to crunch large datasets.

While Bragg – named after the British Nobel Laureates Lawrence and Henry Bragg, who spent part of their careers living and working in Australia – was a leading HPC system when it first launched in 2009, it was becoming obsolete, given the rate of innovation in technology.

“To keep pace with global research, we need to provide Australia’s scientists and engineers with high-performance systems that give them efficiencies in their line of scientific inquiry,” said Angus Macoustra, deputy CIO and head of scientific computing at CSIRO. “The quicker they can analyse a dataset, model a system or simulate an experiment, the quicker they can draw a conclusion to their hypothesis.”

When your research is looking at a dataset that is getting up into the petabytes, moving this sort of data volume around across your network, or in to a cloud provider, comes at significant cost in both time and dollar terms.
Angus Macoustra, CSIRO

While some of the world’s most powerful supercomputers have traditionally been powered by central processing units (CPUs), there is growing adoption of graphics processing units (GPUs) that are better suited for parallel processing and running deep learning algorithms in areas such as weather forecasting.

Housed in a datacentre in Canberra, Bragg is essentially a 128-node GPU cluster, one that Macoustra admits was getting quite old. For one, the GPUs are now three generations behind Bragg’s successor Bracewell.

Doubling total aggregate performance

With a similar footprint to Bragg, Bracewell sports over 114 nodes in the form of Dell PowerEdge C4130 servers that are decked out with Nvidia GPUs and Intel Xeon CPUs. Macoustra said CSIRO has not had the chance to benchmark the system yet, but he reckons that “Bracewell has allowed us to double the total aggregate performance available from all our HPC systems”.

“More importantly than a benchmark result, we’re seeing real world application run times significantly improved – one of our manufacturing projects used to take five hours using all 128 nodes of Bragg, and can now do the same sort of analysis using just a quarter of Bracewell in two hours.”

Building any system the size of Bracewell has its challenges. Macoustra said the system’s multiple components can cause problems, so he put in place a structured and repeatable approach in setting up and configuring the supercomputer. “We’ve done some things in earlier projects such as invest in the Bright Computing cluster management software – this has allowed us to standardise our build image and be able to deploy our software stack at the click of a button.”

Dealing with the problem of power and heat is also a challenge. Macoustra and his team worked with its datacentre service provider in Canberra to address the extreme power and cooling requirements of Bracewell, while still operating in a highly energy-efficient and environmentally sustainable facility.

The other big challenge is achieving the “high performance” in HPC. “There are so many variables that can impact your end result from things such as your cable lengths through to obscure setup variables in your operating system,” said Macoustra. “Thankfully, I’ve got a really talented team that have been doing this for many years now, and they’re pretty good at working with our partners like Dell EMC to optimise the design, and setup, of these systems.”

Read more about high performance computing

Macoustra’s work at CSIRO comes at a time when interest in tapping cloud services to run HPC workloads is growing. Major cloud suppliers like Amazon and Microsoft have been offering services aimed at governments and research institutions with HPC requirements for some years now. Going by some estimates, the global cloud HPC market is expected to reach $10.8bn by 2020, up from $4.4bn in 2015.

Both cloud and on-premise HPC systems play a role at CSIRO, Macoustra said, adding that cloud-based HPC is well-suited for certain types of analysis and processing, such as the work of Denis Bauer and her team from the Australian eHealth Research Centre and their genome analysis software. In fact, CSIRO is reportedly looking to shore up the connectivity between its sites and Amazon’s datacentres.

For applications such as climate modelling conducted by CSIRO’s Climate Science Centre, Macoustra said on-premise HPC systems and those at Australia’s national peak HPC centres – the National Computational Infrastructure and the Pawsey Supercomputing Centre – are preferred.

“There are other reasons, but one of the primary drivers behind which service to use is data – when your research is looking at a dataset that is getting up into the petabytes, moving this sort of data volume around across your network, or in to a cloud provider, comes at significant cost in both time and dollar terms,” he said.

Deep learning techniques

Meanwhile, CSIRO researchers have started to take advantage of Bracewell’s HPC prowess. Data61, CSIRO’s data innovation group, is using deep learning techniques to recognise objects, people, and even facial expressions from an image or live camera feed, for example.

“This type of science has huge application across many sectors of the economy from things like analysing drone or satellite imagery as part of a decision support system for agriculture, through to medical devices that can aid the vision impaired to navigate their surroundings,” said Macoustra.

A CSIRO researcher is also using Bracewell for solar forecasting, where images of the sky and cloud formations are taken every 10 seconds to track and predict cloud movement. This then produces a machine learning model to predict changes in solar irradiance and power generation. 

Macoustra said this is useful for off-grid remote area power systems, which need advanced warning of shade events to spin up additional diesel generators, or for large grid-connected solar farms which need to bid for capacity on the energy market five minutes ahead of time. “The use of Bracewell has allowed this researcher to complete his model processing in a day, whereas previously he was spending a week.”

Read more on Clustering for high availability and HPC