In 2012, when most of the UK was gearing up for the London Olympics, strange words started to enter our vocabulary. Terms such as “Higgs boson” and “Large Hadron Collider (LHC)” were hitting the mainstream headlines and CERN, a European particle physics research centre based in a north-west suburb of Geneva in Switzerland, was...
suddenly all the rage.
On 4 July that year, two experiments had led to the discovery of the Higgs boson, or “God Particle” as it was dubbed – a discovery that saw theoretical physicists Peter Higgs and Francois Englert receive the Nobel Prize for physics a year later.
It was CERN’s gold medal moment, but as is the case with the athletes that were about to embark on their own voyages of discovery, a lot of the hard work that goes into such high-profile successes is unseen. It had taken decades of research and experimentation to reach this point, and it wasn’t all down to the work of the physicists.
There are, in fact, two CERNs: the one we all hear about that is trying to solve the mysteries of the universe, and the other one, which is somewhat less glamorous.
There are, according to David Widegren, head of asset and maintenance management at CERN, 13,000 people working at the complex at any point in time, including up to 10,000 visiting particle physicists. The rest are looking after the nuts and bolts and the daily running of the place. This is a challenge, with 700 buildings, roads, car parks, an electricity grid, complex research equipment and, of course, the accelerator complex with its 100 million components.
In 2008, CERN had an accident when a faulty electrical connection between two magnets led to an explosion. CERN released a full explanation, and Widegren claims that if they’d had machine learning and an automated asset management system in place back then, it could potentially have been avoided. That’s the theory anyway.
The point Widegren is making is that CERN has grown so big it needs automation. It already uses Infor’s enterprise asset management software EAM to help keep track of around 2.1 billion assets, including the 100 million components that make up the collider. CERN is, in fact, one of Infor’s oldest customers. It’s been using EAM for over 20 years, and although the earlier iterations were, in Widegren’s words, “quite basic,” today he says EAM has to be powerful and scalable enough to cope with the increasing demands of CERN.
“The two million assets we manage through EAM generate 800GB of data every day,” says Widegren. “If we are to minimise unplanned downtime at CERN, and given that we get a billion Swiss Francs a year to research physics, we need to behave like an enterprise and use this data to maximise the visibility of our systems and assets.”
David Widegren, CERN
CERN now has lifecycle tracing of all its assets, from manufacturing through to waste management – important given that some of the components become radioactive. Not everything has sensors, but Widegren talks about the site’s industrial internet of things (IIoT) network, the need to use sensors more on machines and components to improve management, and how automation will eventually help reduce downtime by enabling alerts to potential issues.
“The next phase is to use the data to drive automation and predictive technology,” he says.
CERN has been in discussion with Infor to trial its machine learning engine Coleman AI, so Widegren and his team of 12 can look for correlations and pattern recognition to see how they can better understand how the colliders behave and predict potential failures in the future.
It’s the latest step in a CERN-wide initiative, that started seven years ago, to modernise its IT and asset management. While Widegren has been focused on the rapid escalation in assets and services – he says they have tripled since 2011 – managing the IT function for a number of users that jumped from a few hundred to 2,000 has also been a challenge.
According to Tim Bell, compute and monitoring group leader at CERN, automation has been an ongoing development and something CERN has been trying to increase in all its processes, when it comes to providing IT services for the community.
In 2011, the IT team was using a self-built tool for IT configuration management, which by Bell’s own admission, was “very limited”. Something had to change, especially given the scaling of the whole community and the LHC preparing for its second, historical run.
The team adopted Puppet, an open source configuration management tool, with a specific aim of making the deployment of its IT infrastructure more palatable. The idea that the IT team would have to configure and manage thousands, rather than hundreds, of machines, documentation and development was a big enough driver to get the Puppet software in place as soon as possible.
“We wanted to remove the limitations of our old solution – mainly that there was not much expertise outside of CERN and we felt we could profit from a larger skills pool by using a more popular solution used elsewhere,” says Bell. “That was also making it difficult for us to hire engineers with the right experience. We also had to take care of documenting and evolving our configuration system, basically on our own.”
Reducing deployment time, while essential to the ongoing viability of IT systems at CERN, has also had a knock-on effect in terms of the IT team. Automation is already changing the way things are done and the roles of key staff.
“Currently, new services can be deployed in a matter of hours and, more importantly, resources dedicated to each service can also be enlarged or reduced dynamically. This is of great help when coping with service load-related problems, as well as to make transparent to users hardware interventions,” says Bell.
“Adding a new node to a given service is a matter of executing a command using the tools that have been developed at CERN, which integrate Puppet with our OpenStack-based service provisioning infrastructure. As a result of automation, we are reducing the size of the team of engineers that had access to our computer centre, as the number of calls they get has been reduced dramatically.”
Tim Bell, CERN
Keeping the physicists happy is, of course, one of the priorities, but that can only really be achieved by making sure they don’t have to be hamstrung by the IT. Reducing the need for support calls was essential, something which Bell believes they have already achieved.
The team adopted a DevOps approach to enable the continuous introduction of service changes to minimise service disruption. It was a new way of working but, says Bell, it fitted with the pattern of a large and constantly changing team. It gave the team structure, and support tickets went from 60,000 for the system administration team in 2011 to a few hundred today.
So, has Puppet and increased automation led to loss of jobs in the IT function, or has it meant the redeployment of roles?
“Definitively, the rate of managed services per headcount has increased significantly, as has the total amount of physics-compute resources we run,” says Bell. “At the same time, the number of members in the IT department at CERN has stayed stable, as the number and size of services has increased with time. We have also been able to enhance IT functionality, such as improving service monitoring or working on new software developments for the physics community, in which we can employ more resources.”
While Bell contemplates his next challenge – to be able to provision services using public cloud resources (he claims the team has already done some proof of concepts provisioning batch worker-nodes in external clouds using Puppet for managing their configuration) – he says there are some lessons for all IT teams to learn from his experience.
“We believe that reusability and specialisation have been key for the success of our Puppet deployment. We’re making use of plenty of upstream Puppet modules, which has contributed significantly to reducing engineering time spent on writing configuration,” he says.
“As well, we have lots of domain-specific experts in the department who are in charge of providing and maintaining configuration for other service managers to build their services on top. For example, if you’re responsible for a content delivery service, you can focus on integrating components, delegating, for instance, the configuration of your back-end storage and the monitoring by simply including centrally maintained Puppet code that you can customise if needed using Hiera.”
Both Bell and Widegren are front and central to CERN’s infrastructure modernisation. It’s the hidden work, the hard yards that are needed to make the physics possible. What’s incredible is that it has taken CERN so long to get here. You would think that the organisation that gave us the world wide web would always be playing with the latest toys. It did, after all, have some of the first touchscreens back in 1971.
Infrastructure, though, is a little different. It’s now big and can be unwieldy, which is why CERN has turned to powerful management tools such as Infor and Puppet, and is looking to increase automation and its use of machine learning.
As Widegren says, “it’s a constant simplification of complex machines and systems”. While it might not be the answer to the mysteries of the universe, it’s making the physics possible. As the LHC goes into a period of hibernation for repairs and upgrades, it’s a maxim by which, for the next 18 months at least, everyone at CERN will be living.
Read more about the IT behind CERN
- What is CERN?
- Case study: CERN adopts OpenStack private cloud to solve big data challenges.
- How CERN, the particle physics laboratory, is using IT service management to support operations outside the IT function.