Literature Is Not Data: Against Digital Humanities
BIG DATA IS COMING for your books. It’s already come for everything else. All human endeavor has by now generated its own monadic mass of data, and through these vast accumulations of ciphers the robots now endlessly scour for significance much the way cockroaches scour for nutrition in the enormous bat dung piles hiding in Bornean caves. The recent Automate This, a smart book with a stupid title, offers a fascinatingly general look at the new algorithmic culture: 60 percent of trades on the stock market today take place with virtually no human oversight. Artificial intelligence has already changed health care and pop music, baseball, electoral politics, and several aspects of the law. And now, as an afterthought to an afterthought, the algorithms have arrived at literature, like an army which, having conquered Italy, turns its attention to San Marino.
The story of how literature became data in the first place is a story of several, related intellectual failures.
In 2002, on a Friday, Larry Page began to end the book as we know it. Using the 20 percent of his time that Google then allotted to its engineers for personal projects, Page and Vice-President Marissa Mayer developed a machine for turning books into data. The original was a crude plywood affair with simple clamps, a metronome, a scanner, and a blade for cutting the books into sheets. The process took 40 minutes. The first refinement Page developed was a means of digitizing books without cutting off their spines — a gesture of tender-hearted sentimentality towards print. The great disbinding was to be metaphorical rather than literal. A team of Page-supervised engineers developed an infrared camera that took into account the curvature of pages around the spine. They resurrected a long dormant piece of Optical Character Recognition software from Hewlett-Packard and released it to the open-source community for improvements. They then crowd-sourced textual correction at a minimal cost through a brilliant program called reCAPTCHA, which employs an anti-bot service to get users to read and type in words the Optical Character Recognition software can’t recognize. (A miracle of cleverness: everyone who has entered a security identification has also, without knowing it, aided the perfection of the world’s texts.) Soon after, the world’s five largest libraries signed on as partners. And, more or less just like that, literature became data.