By Edward Nawotka
If you thought metadata was complicated, meet BookLamp.org, a new book discovery search engine that tracks 32,160 distinct data points per book. “We do this by taking the full text provided by a publisher in a digital format and running it though our computer,” explains CEO Aaron Stanton. “Our program breaks a book up into 100 scenes and measures the ‘DNA’ of each scene, looking for 132 different thematic ingredients, and another 2,000 variables.” A reader can go to the BookLamp site, which was launched in beta last week, and do a keyword search for titles that meet the criteria similar to a title they plug into the site. Pundits have dubbed it the “Pandora for Books,” though Stanton prefers the term “Book Genome Project.”
“Say you’re looking for a novel like the The Da Vinci Code. We have found that it contains 18.6% Religion and Religious Institutions, 9.4% Police & Murder Investigation, 8.2% Art and Art Galleries, and 6.7% Secret Societies & Communities, and other elements — we’ll pull out a book with similar elements, provided it is in our database,” says Stanton.
Stanton began the BookLamp project in 2003 while a student in Boise, Idaho, when he and his roommates scanned in a copy of Richard Bachman’s Thinner — something that then took a full six hours to do — before realizing what he wanted to do was likely beyond the scope of a college student. In 2007, he though it would be perfect for Google and managed to land a meeting (see “CanGoogleHearMe.com,” which became a viral meme in its day). Stanton then took the project to Dr. Matthew Jockers, professor of computational linguistics at Stanford University, who helped develop the protocols for BookLamp’s “contextual stylistic analysis.”
Today, BookLamp has some 20,000 texts in its database — primarily from Random House and Kensington publishers — and has amassed nearly 650 million “data points” in all. “We expect it to be in the billions in a few months,” says Stanton.
But can a computer really accurately assess the content of a book? Stanton thinks so. “Our original models are based on focus groups,” he says. “We would give them a highly dense scene and a low density scene, for example, and ask them to assess them, which gave us a basis for training the models. Then we looked at books that might exceed the models and tweaked the formulas. In this way, our algorithms are trained like a human being.”
BookLamp quantifies such elements as density, pacing, description, dialogue and motion, in addition to numerous nuanced micro-categories, such as “pistols/rifles/weapons” or “explicit depictions of intimacy” or “office environments.”
“In many ways, using using thematic and other ‘ingredients’ as an alternative to traditional metadata,” says Stanton, who envisions the project serving readers, writers and publishers equally.
The first iteration of BookLamp — what you currently see online — is squarely aimed at readers. Writers and publishers, on the other hand, will soon be offered the ability to upload their manuscripts to BookLamp and have their books assessed along the same criteria. These works will go into a “living database of manuscripts” — which can be used by publishers who want to seek out manuscripts with certain characteristics. “For example,” says Stanton, “say vampires are hot one year, so you turn down all the books about space aliens, but then the trend shifts to space aliens — you can search our database for manuscripts that match these emerging trends and stay ahead of the curve. For authors, a rejected book is never a rejected book, since it can always be found.”
At the moment, BookLamp’s biggest obstacle might just be the publishers and writers themselves, who may very well be reticent to see their books converted into data points. The limited database of just 20,000 “is by far the biggest criticism of the site.” His goal is to hit 100,000 titles by the end of the year.
Curiosity seekers can sign up for and explore at BookLamp now at www.booklamp.org.