Google is to include
scanned documents in its search results for the first time.
"In the past, scanned documents were rarely included in search
results as we could not be sure of their content. Today, that
changes. We are now able to perform Optical Character Recognition
(OCR) on any scanned documents that we find stored in Adobe's PDF
format."
This Optical Character Recognition technology lets Google
convert a picture of a document into the words contained in it.
Whilst Google has indexed documents saved as PDFs for some time,
scanned documents are a lot more difficult for a computer to
read.
Scanning is the reverse of printing. Printing turns digital
words into text on paper, whilst scanning makes a digital picture
of the physical paper (and text) so you can store and view it on a
computer.
The scanned picture of the text, however, is not quite the same
as the original digital words, said Google. "Often you can see
tell-tale signs: the ring of a coffee cup, ink smudges, or even
fold creases in the pages.
"To people reading these documents, the distinction between
words and pictures of words makes little difference, but for a
computer the picture is almost unintelligible."