spainter_vfx - stock.adobe.com
How supercomputing is transforming experimental science
NERSC lead data scientist Debbie Bard talks about how large-scale data analytics using super computers is making new types of science possible
Debbie Bard will be making, in a sense, a homecoming speech at the DataTech conference in Edinburgh on 14 March 2019.
Bard leads the Data Science Engagement Group at the National Energy Research Scientific Computing Center (NERSC) at Berkeley National Lab in the US. She is an alumna of the University of Edinburgh, where she did a PhD in physics.
DataTech is part of the two-week DataFest 19, organised by The Data Lab, a data innovation-focused agency supported by the Scottish government. DataTech is being held at the National Museum of Scotland in Edinburgh.
Bard’s talk is entitled, “Supercomputing and the scientist: how HPC and large-scaled data analytics are transforming experimental science”.
She argues that although computing has been an important scientific tool for many decades, the “increasing volume and complexity of scientific datasets is transforming the way we think about the use of computing for experimental science”.
NERSC is the computing centre for the US Department of Energy Office of Science. It runs some of the most powerful computers on the planet. Bard talks about how supercomputing at NERSC is used in experimental science to change how scientists in particle physics, cosmology, materials science and structural biology collect and analyse data.
Bard’s team supports more than 7,000 scientists and 700 projects with supercomputing needs at NERSC. She is a British citizen whose career spans research in particle physics, cosmology and computing on both sides of the pond. She worked at Imperial College London and SLAC National Accelerator Laboratory in the US before joining the data department at NERSC.
Making new experiments possible
Ahead of her talk at DataFest, she took some time out to talk to Computer Weekly.
“The transformational part is about how computing is enabling new kinds of hardware to function to make new experiments possible,” she says.
“If you have a very high-resolution detector, you need to be able to analyse the data coming off that detector, and for that you need HPC [high-performance computing] and large-scale data analytics. That all opens up new opportunities, which then open up new kinds of questions.
“That’s what I get really excited about – when you can use computing to open up new kinds of science that were impossible before, that could not even be thought about.
“For example, in electron microscopy, new types of detectors are producing insane amounts of data, through four-dimensional scanning – that is to say, also in time. That is where supercomputing comes in, to help design analysis algorithms.
“Another is ‘messy’ genomic analysis, where a geneticist has a sample of a microbiome – for example, a soil sample – containing thousands of different organisms of bacteria. Trying to do sequential DNA analysis of all those bacteria is insanely complex. It’s a huge, data-intensive problem. And it is important because if you know which soil is productive, you can grow crops more effectively without pesticides,” she says.
Motivated by the mission
Bard’s team of half a dozen data scientists at NERSC help the organisation’s scientists write code that will run well on their computing resources.
They have all, including Bard, “spent time working in areas that are computationally demanding”, but none are computer scientists. Instead, they come from such fields as bio-informatics, physical chemistry and material science. She is a cosmologist by background.
Bard says it is “challenging to hire people” in the San Francisco Bay Area. There are other labs, such as the Lawrence Livermore National Laboratory, and the research universities, such as Stanford, the University of California, and San Francisco.
“You can’t afford to lose any data, you can’t drop any packets when transferring between sites. Every byte is important”
Debbie Bard, NERSC
There are also the Silicon Valley companies – such as Google, Facebook and Apple – and the rest that can offer much bigger salaries. But people doing scientific computing are “motivated by the mission, working on scientific challenges using massive computers,” says Bard.
The data itself is also different in kind to that analysed by commercial organisations. “Reproducibility is important for scientific data – being able to trace the provenance of the data, what’s been done to it,” says Bard. “And all the metadata around an experiment, such as what time it was done and what the conditions were. That is a big problem.
“You can’t afford to lose any data, you can’t drop any packets when transferring between sites. Every byte is important. You have to think hard about your compression methodologies.
“We don’t have access to any of the easy data compression schemes that might be accessible in the commercial sector, so we need specialist networking to transfer scientific data,” she adds.
There is also the matter of “black box” machine learning algorithms, where a scientist would “really need to know why decisions are being made by algorithms”.
“It can be difficult to have interpretable machine learning algorithms, and that is an area where the research community is having to step up,” says Bard. “If you can’t understand why an algorithm is working, it is difficult for a scientist to accept the results. So, that is a barrier to machine learning being accepted by the scientific community.
“In a commercial application, you don’t really care why your algorithm is saying, ‘That’s a picture of a cat’ or ‘That’s a picture of a dog’, as long as it is doing it accurately. In a scientific application, you do care about accuracy, but also about why it is working, so you can trust it isn’t hiding an internal bias.”
Read more about large-scale data analytics and science
- Visual data analytics make genomics in healthcare possible.
- With data scientists in short supply, physicists and other academic researchers from hard science disciplines are increasingly finding places on data science teams.
- Wayne Eckerson blog: Cosmology, quantum physics, black holes and business intelligence.