Cracking open PDF extraction for the future

There’s an interesting free app on the Apple store at the moment called Scanner Pro by Readdle.

pdf Capture.JPG

Scanner Pro claims to be able to turn your iPhone or iPad into a portable scanner and is capable of producing images that are converted to jpegs or PDFs via its camera.

It works quite well, although it did crash the iPad it was first tested on — and it then mailed the test user (me) a blank email before finally mailing a successful PDF plus an attached scan of my socks which were conveniently placed behind the iPad and so in line with its camera.

OK so this is not a scanner review, this is a prelude to also mentioning what’s happening in the USA where sources suggest that work is focused on how government agencies will be able to capture and extract PDF data into the national big data archives.

NOTE: PDFs are still popular as a means of compressing information from larger files into a consistent view format (possibly encrypted) read only format that can not be altered without leaving digital/electronic footprints.

PDFs are tough to crack

For governmental analysis concerns (and other commercial interests too) though, PDFs are tough to perform enterprise-level ETL (Extract-Transform-Load) functions upon.

So a new open-source PDF conversion tools initiative is underway courtesy of the Sunlight Foundation (sounds like a scary religious cult, but it’s not) with its PDF Liberation Hackathon which is being staged stateside this January.

Dedicated to improving open source tools for PDF extraction, this hackathon will aim to bring coders together to add features, extensions and plugins to existing PDF extraction frameworks, making them more flexible, useful and sustainable.

“Sunlight’s PDF Liberation Hackathon will tackle real-world PDF data extraction problems. In doing so, we will build upon existing open-source PDF extraction solutions such as Tabula and Ashima’s PDF Table Extractor,” said Marc Joffe, founder of Public Sector Credit Solutions (PSCS), which applies open data and analytics to rating government bonds.