National Archives races to create electronic archive of EU law before Brexit

The National Archives faces challenges converting the EU's enormous library of laws into a publicly accessible UK archive ahead of Brexit. The Archives’ digital director, John Sheridan, explains how

Almost as soon as the EU referendum result was known, John L Sheridan could see that transforming the way the UK publishes legislation after Brexit was going to be a big job.

Sheridan is digital director at The National Archives, which bears the statutory responsibility for archiving and publishing UK law. That work has two prongs: publishing all current UK law on the government’s public site,, and incorporating all EU law into its own historical archives for future generations to consult.

Both present technical challenges. Besides the sudden considerable expansion of the corpus of both current and archived law that the UK’s departure from Europe brings about, Sheridan will have to solve the question of how to import all the necessary data from Europe’s law archive, EUR-Lex, and convert it to the formats used by the UK.

The Archives has managed large conversation projects before, but Sheridan says: “Even for us, EUR-Lex has been a challenge because of the sheer size of the website and because it’s the first thing we’ve ever done that is inherently multilingual.”

Under the Withdrawal Act, at 11pm on 29 March 2019, all EU legislation will be repealed and will cease to have any legitimacy in the UK (see box below). The corpus of EU law will be imported into UK law, which the act formally requires the Archives to publish in its role as the Queen's Printer.

Over time, the two bodies of law will diverge as the European parliament modifies and extends the EU’s laws and the UK parliament amends and supersedes the legislation it is inheriting. So simply linking from, web style, to texts on EUR-Lex is not an option.

Three difficult problems faced Sheridan and his team of three at the Archives, augmented by three or four more at the UK-based specialist archiving company MirrorWeb, which won the bid to operate the web archive starting in 2017.

“Even for us, EUR-Lex has been a challenge because of the sheer size of the website and because it’s the first thing we’ve ever done that is inherently multilingual”

John Sheridan, The National Archives

First, there was the uncertainty. When they began work within months of the referendum, the shape of the eventual withdrawal legislation was unknown, so whatever the Archives built had to be flexible enough for various scenarios. 

Second was understanding somebody else’s data and working out how to convert complex data in a difficult domain into a format that can underpin good decisions.

The third problem was carrying out the actual conversion, because EUR-Lex codes its documents using one variant of legal XML (Formex, developed by the European Commission), and the Archive uses another (Crown Legislation Markup Language, or CLML).

Sheridan has previous experience in large-scale conversion projects: he has overseen the legislation database move from the early 1990s programming language LISP to web precursor SGML (Standard Generalised Markup Language), to HTML, XML, and finally CLML.

Still, it took nine months to establish that Formex and CLML are compatible enough for translation. Small differences in drafting styles matter when these documents are represented as data.

For example, European legislation uses more arbitrary sub-dividers, more complex annexes, and much more scattered annotation, footnotes and boxes. This print-oriented design poses a persistent challenge for markup languages, says Sheridan, “and that’s definitely true of European legislation” – even though the EU began designating the signed electronic version, rather than print, as the definitive version in 2014.

National Archives project aims to make legislation easier to understand

Even before the EU referendum, The National Archives had begun rethinking how best to present legislation, inspired by the Office of Parliamentary Counsel’s Good Law project.

In a 2014 usability study, carefully constructed comparison tests showed that even some of the participating lawyers struggled to construct mental models of the legislation they were reading, and that this problem greatly outweighed differences in drafting style.

The many cross-references to sections of other pieces of legislation that mount up in just a few paragraphs make UK legislation, which is usually amending previous legislation, difficult to read. The study’s account suggests that the drafters’ user-watching experience was much like that of software developers in the early 1990s, when usability entered software design – incredulity that users could be so clueless, followed by dawning awareness that users were not the problem.

The 2014 study has led to changes in drafting practice to make legislation more understandable and has prompted an ongoing radical rethink about its presentation, beginning with the Archives’ timelines, which were introduced with in 2009 and show how legislation has changed over time.

It has become clear, says John Sheridan, digital director at The National Archives, that the UK’s EU exit requires these timelines to be much more prominent. “When we put this in front of users, they say they find it very reassuring,” he says.

In addition, the Archives is working with specialist Bunnyfoot to help make substantial changes and improvements to the front end of and carry out usability testing.

Even without these variations, legislation is “difficult content to work with”, says Sheridan, adding that “combines both native XML technologies for documents with RDF [linked data] for rich metadata about the documents, including all the information about the amendments”.

The site’s technical architect was Jeni Tennison, now CEO of the Open Data Institute.

In total, the Archives will bring across about 150,000 pieces of legislation, including all the legislation content in the Official Journal of the European Union, case law from the European Court of Justice – in English, French and German – plus all the underlying data, metadata and index pages – a total of tens of millions of documents.

To cover this effort, the Treasury allocated an extra £1.2m this year; last year’s work cost about £465,000.

The Withdrawal Act has made at least one aspect of the job easier – for the first time in British history, the law puts the internet at the centre. The requirement laid on the Archives is to “publish” the legislation, not to print it.

One significant change in the conversion is that where EUR-Lex assigns a unique identifier to each legal document – known as a Celex Number – CLML gives a unique identifier to each legislative article.

Sheridan’s group found that the usual method of gathering a website’s content – the Heritrix crawler the Internet Archive developed for its Wayback Machine – was not suitable for EUR-Lex, which is search-based, rather than browse-based.

It proved easier to compile and verify a giant list of the documents’ Celex numbers by identifying URL patterns to harvest, using tools provided by the EUR-Lex web services and the EU’s open data portal SPARQL access point.

New legal structures after Brexit

EU law, which includes treaties, directives, regulations and judicial decisions, derives its legitimacy in the UK from the European Communities Act 1972, which provided for the accession of the UK to the EEC, or Common Market, and incorporated European Community law into UK law, making it binding on all legislation passed in the UK.

The European Union (Withdrawal) Act 2018 ensures legal continuity after exiting the EU. Section 1 repeals the 1972 Act and creates the new “EU retained law” category of UK law. Section 2 saves the statutory instruments implemented in support of EU directives. Section 3 incorporates direct EU legislation into UK law. Section 8 grants the government the power to modify these laws via secondary legislation and creates a new parliamentary process, “sifting”, for scrutinising statutory instruments. 

The National Archives has two jobs to do. First, it must archive the entire corpus of EU law up until EU Exit Day as part of its archive, which includes all historical UK government web publishing. Second, it must transfer and publish the converted corpus being imported into UK law on

Both the web archive and are hosted in the cloud, giving the Archives flexibility to handle the expansion without requiring new hardware. Similarly, Sheridan’s group has been able to adapt its existing software tools.

The group also worked with Washington, DC-based legal software specialist Juris Datum to write a sophisticated set of transformation routines to convert Formex into CLML.

By mid-August 2018, well over 99% of the 150,000 pieces of legislation had been successfully converted for publication on Sheridan expects to complete the first full capture of the EUR-Lex content by the end of August, including performing both automatic and manual checks and patching any gaps or issues they find.

After that, the project will perform daily incremental captures until Exit day, to ensure that a complete snapshot is ready.

One of Sheridan's most important goals is that it should always be possible to look at any law and trace the process that brought it into being and understand what underpins its legitimacy (see box above).

Therefore, he says, each piece of converted CLML data on will link back to the archived Formex data from the European law archive. Together with publicly available logs of the harvesting work, the link “will give a complete picture of how this data became part of the UK’s legislation database”, he says.

Provenance will be represented using a specification, PROV-O, developed by the World Wide Web Constortium, says Sheridan. “We see this project as an opportunity to show provenance being done really well on the web,” he adds. “The provenance really matters here, as this is law that we are publishing.”

Read more on IT project management

Data Center
Data Management