PDF Documents, OCR, and Conspiracy Theories

You don’t need OCR to get those dreaded ‘layers’
Technology • Views: 37,265

Let’s get all tech-nerdish for a minute, because I’ve seen an inaccurate statement reported several times now, about the latest inane birth certificate conspiracy theory, most recently at TPMDC: With Drudge Report’s Help, Birthers Latch Onto Phony Forgery Theory.

In fact, the effect was not a sign of foul play at all, but a common attribute of PDF files containing text as an image. On many PDFs, a feature called OCR (optical character recognition) recognizes the letters in the image and separates them into their own layer. This explains why you’re able to highlight and copy raw text from some PDF files even though it’s actually not a word processing document.

As I pointed out yesterday, the OCR setting in Adobe Acrobat is actually irrelevant to this issue; OCR (Optical Character Recognition) has nothing to do with the “layers” you see if you open a PDF file with Adobe Illustrator. Even if you scan a document with OCR turned off (which is the case with the birth certificate PDF released by the White House), these “layers” are still created.

In Portable Document Format, they’re not actually “layers” at all. They’re a result of the method Adobe Acrobat uses to compress and optimize scanned images.

When a PDF document is created from a scanner (even with OCR turned off), areas that contain text are recognized, isolated, and compressed differently than background patterns, lines, and other elements, because different compression algorithms work best for these different types of graphics. When the resulting PDF file is opened with Adobe Illustrator, these elements are interpreted as “layers,” but in terms of the PDF file they’re really not like Illustrator layers at all. The reason for breaking down the image in this fashion is to yield the smallest, most efficient PDF file.

And that’s why, in the White House’s PDF file, the “text” elements are separated (imperfectly) from the background pattern, but remain un-searchable images, not text.

The key point: the layers will still exist, even in documents that don’t use the OCR feature or don’t contain a black President’s birth certificate.

Jump to top

Create a PageThis is the LGF Pages posting bookmarklet. To use it, drag this button to your browser's bookmark bar, and title it 'LGF Pages' (or whatever you like). Then browse to a site you want to post, select some text on the page to use for a quote, click the bookmarklet, and the Pages posting window will appear with the title, text, and any embedded video or audio files already filled in, ready to go.
Or... you can just click this button to open the Pages posting window right away.
Last updated: 2023-04-04 11:11 am PDT
LGF User's Guide RSS Feeds

Help support Little Green Footballs!

Subscribe now for ad-free access!Register and sign in to a free LGF account before subscribing, and your ad-free access will be automatically enabled.

Donate with
PayPal
Cash.app
Recent PagesClick to refresh