PDF Documents, OCR, and Conspiracy Theories
Let’s get all tech-nerdish for a minute, because I’ve seen an inaccurate statement reported several times now, about the latest inane birth certificate conspiracy theory, most recently at TPMDC: With Drudge Report’s Help, Birthers Latch Onto Phony Forgery Theory.
In fact, the effect was not a sign of foul play at all, but a common attribute of PDF files containing text as an image. On many PDFs, a feature called OCR (optical character recognition) recognizes the letters in the image and separates them into their own layer. This explains why you’re able to highlight and copy raw text from some PDF files even though it’s actually not a word processing document.
As I pointed out yesterday, the OCR setting in Adobe Acrobat is actually irrelevant to this issue; OCR (Optical Character Recognition) has nothing to do with the “layers” you see if you open a PDF file with Adobe Illustrator. Even if you scan a document with OCR turned off (which is the case with the birth certificate PDF released by the White House), these “layers” are still created.
In Portable Document Format, they’re not actually “layers” at all. They’re a result of the method Adobe Acrobat uses to compress and optimize scanned images.
When a PDF document is created from a scanner (even with OCR turned off), areas that contain text are recognized, isolated, and compressed differently than background patterns, lines, and other elements, because different compression algorithms work best for these different types of graphics. When the resulting PDF file is opened with Adobe Illustrator, these elements are interpreted as “layers,” but in terms of the PDF file they’re really not like Illustrator layers at all. The reason for breaking down the image in this fashion is to yield the smallest, most efficient PDF file.
And that’s why, in the White House’s PDF file, the “text” elements are separated (imperfectly) from the background pattern, but remain un-searchable images, not text.
The key point: the layers will still exist, even in documents that don’t use the OCR feature or don’t contain a black President’s birth certificate.