alunablue - stock.adobe.com

News

Dutch group works on digitising collaborator archives created in aftermath of World War 2

Experiments have been carried out to ascertain the best technology to open up the analogue archives of the Central Archive of Special Justice in the Netherlands

Kim Loohuis

Published: 01 Oct 2019 16:45

Digitising archives at Dutch tribunal archives project Triado entailed moving from kilometres of archive boxes in a depot to a search engine through which users can access any document.

The Triado project started three years ago, with the aim of identifying which computer technology could best be applied to historical collections to digitise and unlock them.

“Of course, we are not the first to work on this, but we wanted to find out what we could do with the available technology, especially for the collection of the Central Archive of Special Justice [CABR] of the Nationaal Archief [National Archive],” said project leader Edwin Klijn.

This archive was created shortly after the Second World War, and contains files of 300,000 people who were suspected of collaborating with the German occupiers in the war – “a perpetrator archive, as it is called in professional jargon”.

Moreover, it is probably the most important archive about the Second World War in the Netherlands.

“We wanted to see if we could make 13 metres of the archive – which largely contains typed documents, but also handwritten papers – machine-readable, what the margin of error would be and what we could then do with the data,” said Klijn.

Software packages

To digitise the documents, the research group experimented with two well-known software packages: ABBYY FineReader and Tesseract. “ABBYY is actually just optical character recognition [OCR], but Tesseract already has more machine learning,” he said.

In addition to digitising, another goal was to be able to access the files and make them searchable at the document level.

“We tried to extract names of people, place names and dates from the machine-readable text using named entity recognition,” said Klijn.

“We then linked this data to other data sources, such as the database for victims of persecution (NDVS) that was created by Remembrance Centre Camp Westerbork and the Jewish Cultural Quarter.”

Auto classification

In addition to enrichment, experiments were also carried out to enable the computer to recognise certain types of documents. The CABR contains many standard, predictable documents, such as membership cards, states of intelligence and official reports, so the system must be trainable to recognise them.

“We had a score of 80% right and 20% wrong,” said Klijn. “Also, there is a lot of room for improvement in follow-up projects. I have to say that the learning curve was still rising at the end of the project.”

The advantage of this self-classification experiment is that it is possible to retrieve specific types of documents from the archive, which is a total of 4km long, at the push of a button.

Major challenges

“At the end of the project, we made a prototype, an internal website where we can search the 13-metre archive that we have digitised,” said Klijn.

The archive contains limited public material, so it was not possible to make it accessible from the outside. But the prototype showed what is concretely possible with digitised historical text collections and archives.

“I’m itching to continue with new technologies and the rest of the archive, especially to reduce the margin of error more and more,” said Klijn. But there are also difficulties with the digitisation of the entire archive, such as privacy issues and ethical issues.

“It’s an incredibly sensitive archive, not only because of the names it contains, but also because of its character: a ‘perpetrator archive’. Apart from that, I suspect another major challenge lies in technology and infrastructure. Digitising the entire archive would take about six years.

“That’s why it’s essential to take into account the progressive development of technology and why an infrastructure should be put in place that makes it possible to continuously innovate,” said Klijn. “Because everything we do tomorrow will be outdated the day after tomorrow.”

For example, today, Transkribus is also available, which is software that learns how to read the rest of the documents on the basis of handwritten documents.

“Unfortunately, in 2016, when we started the project, the software was not yet fully developed, so we didn’t work with it,” he said. “We opted for off-the-shelf products and measured the margin of error. In the case of the typed material, it turned out to be 15%. So there is still room for improvement.

“And if we were to start working with Transkribus now, that margin of error would probably also decrease considerably.”

Rewriting war books

It’s important that archives of this kind are made available, for example, for humanities research. Family members, but also historians, will be able to use this data to ask new questions to the archive or to test old ones.

It’s an enormous amount of new data from an original source that can be searched for every word in the text.

“I suspect that when this data becomes available, many books about the war can be rewritten,” said Klijn. “It also means researchers, scientists and historians have to adapt their toolbox. There is not much data available yet, but as soon as these kinds of collections become available digitally – you can do research in a new way.”

Triado shows that new technology has much to offer to digitise and unlock archives. This is very revolutionary in the Dutch archive sector, because it means it’s possible to search down to the document level. At the moment, there are few collections where this is possible.

“If we can obtain sufficient funding to fully digitise the CABR archive, this will be a gigantic project,” he said. “I think we’re talking about one of the largest digitisation projects in the archive sector to date.

“We should not underestimate that,” said Klijn. “But once again, I cannot wait. It’s not just my hands that are itchy, but the hands of the people in my project team as well.”

Dutch group works on digitising collaborator archives created in aftermath of World War 2

Experiments have been carried out to ascertain the best technology to open up the analogue archives of the Central Archive of Special Justice in the Netherlands

Software packages

Read more about digital archiving

Auto classification

Major challenges

Rewriting war books

Read more on Storage

Researchers say AI fails to describe complexities of Holocaust

Auckland museum enhances digitisation efforts with storage upgrade

Could social media revolutionise war crimes trials?

Anything can happen so IT teams need to be ready, says Kyiv City Council CIO