alunablue - stock.adobe.com
Digitising archives at the Dutch Tribunals Archives Project (Triado) entailed moving from kilometres of archive boxes in a depot to a search engine with which you can access any document.
The Triado project started three years ago, with the aim of identifying which computer technology could best be applied to historical collections to digitise and unlock them.
“Of course, we are not the first to work on this, but we wanted to find out what we could do with the available technology, especially for the collection of the Central Archive of Special Justice (CABR) of the Nationaal Archief (National Archive),” said project leader Edwin Klijn.
This archive was created shortly after the war, and contains files of 300,000 people who were suspected of collaborating with the German occupiers in the Second World War.
“A perpetrator archive, as it is called in professional jargon,” he said.
Moreover, it is probably the most important archive about the Second World War in the Netherlands.
“We wanted to see if we could make 13 metres of the archive, which largely contains typed documents, but also handwritten papers – machine-readable – what the margin of error would be and what we could then do with the data,” said Klijn.
To digitise the documents, the research group experimented with two well-known software packages: ABBYY FineReader and Tesseract. “ABBYY is actually just optical character recognition (OCR), but Tesseract already has more machine learning,” he said.
In addition to digitising, another goal was to be able to access the files and make them searchable at the document level.
“We tried to extract names of people, place names and dates from the machine-readable text using named entity recognition,” said Klijn.
“We then linked this data to other data sources, such as the database for victims of persecution (NDVS) that was created by Remembrance Centre Camp Westerbork and the Jewish Cultural Quarter.”
Read more about digital archiving
- Digital preservation platform from Preservica will protect digital records of significant historical importance to Amnesty and its supporters.
- The National Archives faces challenges converting the EU's enormous library of laws into a publicly accessible UK archive ahead of Brexit. The Archives’ digital director, John Sheridan, explains how.
- National digitised archives move off Hitachi SANs and Qualstar tape, object storage is embraced and a “no-backup” strategy put in place with petabyte-scale Scality S3/file deployment.
This database contains all the names of people who were persecuted during the war. “We hoped we would recognise names from the NDVS in our 13-metre digitised archive. This resulted in a few 100% matches for this small test set, because in the files of the perpetrators, many names of victims are mentioned as well, of course.”
The software’s large margin of error in recognising names was also striking, said Klijn. “There were many German terms, and sometimes cases were written with a capital letter in old Dutch.”
In addition to enrichment, experiments were also carried out to enable the computer to recognise certain types of documents. The CABR contains many standard, predictable documents, such as membership cards, states of intelligence and official reports, so the system must be trainable to recognise them.
“We had a score of 80% right and 20% wrong,” said Klijn. “Also, there is a lot of room for improvement in follow-up projects. I have to say that the learning curve was still rising at the end of the project.”
The advantage of this self-classification experiment is that it is possible to retrieve specific types of documents from the archive, which is a total of 4km long, at the push of a button.
“Finally, at the end of the project, we made a prototype, an internal website where we can search the 13-metre archive that we have digitised,” said Klijn. The archive contains limited public material, so it was not possible to make the archive accessible from the outside. But the prototype showed what is concretely possible with digitised historical text collections and archives.
“I’m itching to continue with new technologies and the rest of the archive,” he said. “Especially to reduce the margin of error more and more.” But there are also difficulties with the digitisation of the entire archive, such as privacy issues and ethical issues.
“It’s an incredibly sensitive archive, not only because of the names it contains, but also because of its character: a ‘perpetrator archive’. Apart from that, I suspect that another major challenge lies in technology and infrastructure. Digitising the entire archive would take about six years.
“That’s why it’s essential to take into account the progressive development of technology and why an infrastructure should be put in place that makes it possible to continuously innovate,” said Klijn. “Because everything we do tomorrow will be outdated the day after tomorrow.”
For example, today, Transkribus is also available, which is software that learns how to read the rest of the documents on the basis of handwritten documents.
“Unfortunately, in 2016, when we started the project, the software was not yet fully developed, so we didn’t work with it,” he said. “We opted for off-the-shelf products and measured the margin of error. In the case of the typed material, it turned out to be 15%. So there is still room for improvement.
“And if we were to start working with Transkribus now, that margin of error would probably also decrease considerably.”
Rewriting war books
It’s important that archives of this kind are made available, for example, for humanities research. Family members, but also historians, will be able to use this data to ask new questions to the archive or to test old ones.
It’s an enormous amount of new data from an original source that can be searched for every word in the text.
“I suspect that when this data becomes available, many books about the war can be rewritten,” said Klijn. “It also means researchers, scientists and historians have to adapt their toolbox. There is not much data available yet, but as soon as these kinds of collections become available digitally – you can do research in a new way.”
Triado shows that new technology has much to offer to digitise and unlock archives. This is very revolutionary in the Dutch archive sector, because it means it’s possible to search down to the document level. At the moment, there are few collections where this is possible.
“If we can obtain sufficient funding to fully digitise the CABR archive, this will be a gigantic project,” he said. “I think we’re talking about one of the largest digitisation projects in the archive sector to date.
“We should not underestimate that,” said Klijn. “But once again, I cannot wait. It’s not just my hands that are itchy, but the hands of the people in my project team as well.”