By Edward Nawotka

How BookLamp ingests a book
If you thought metadata was complicated, meet BookLamp.org, a new book discovery search engine that tracks 32,160 distinct data points per book. “We do this by taking the full text provided by a publisher in a digital format and running it though our computer,” explains CEO Aaron Stanton. “Our program breaks a book up into 100 scenes and measures the ‘DNA’ of each scene, looking for 132 different thematic ingredients, and another 2,000 variables.” A reader can go to the BookLamp site, which was launched in beta last week, and do a keyword search for titles that meet the criteria similar to a title they plug into the site. Pundits have dubbed it the “Pandora for Books,” though Stanton prefers the term “Book Genome Project.”
“Say you’re looking for a novel like the The Da Vinci Code. We have found that it contains 18.6% Religion and Religious Institutions, 9.4% Police & Murder Investigation, 8.2% Art and Art Galleries, and 6.7% Secret Societies & Communities, and other elements — we’ll pull out a book with similar elements, provided it is in our database,” says Stanton.
Stanton began the BookLamp project in 2003 while a student in Boise, Idaho, when he and his roommates scanned in a copy of Richard Bachman’s Thinner — something that then took a full six hours to do — before realizing what he wanted to do was likely beyond the scope of a college student. In 2007, he though it would be perfect for Google and managed to land a meeting (see “CanGoogleHearMe.com,” which became a viral meme in its day). Stanton then took the project to Dr. Matthew Jockers, professor of computational linguistics at Stanford University, who helped develop the protocols for BookLamp’s “contextual stylistic analysis.”
Today, BookLamp has some 20,000 texts in its database — primarily from Random House and Kensington publishers — and has amassed nearly 650 million “data points” in all. “We expect it to be in the billions in a few months,” says Stanton.

BookLamp's assessment of Stieg Larsson
But can a computer really accurately assess the content of a book? Stanton thinks so. “Our original models are based on focus groups,” he says. “We would give them a highly dense scene and a low density scene, for example, and ask them to assess them, which gave us a basis for training the models. Then we looked at books that might exceed the models and tweaked the formulas. In this way, our algorithms are trained like a human being.”
BookLamp quantifies such elements as density, pacing, description, dialogue and motion, in addition to numerous nuanced micro-categories, such as “pistols/rifles/weapons” or “explicit depictions of intimacy” or “office environments.”
“In many ways, using using thematic and other ‘ingredients’ as an alternative to traditional metadata,” says Stanton, who envisions the project serving readers, writers and publishers equally.
The first iteration of BookLamp — what you currently see online — is squarely aimed at readers. Writers and publishers, on the other hand, will soon be offered the ability to upload their manuscripts to BookLamp and have their books assessed along the same criteria. These works will go into a “living database of manuscripts” — which can be used by publishers who want to seek out manuscripts with certain characteristics. “For example,” says Stanton, “say vampires are hot one year, so you turn down all the books about space aliens, but then the trend shifts to space aliens — you can search our database for manuscripts that match these emerging trends and stay ahead of the curve. For authors, a rejected book is never a rejected book, since it can always be found.”
At the moment, BookLamp’s biggest obstacle might just be the publishers and writers themselves, who may very well be reticent to see their books converted into data points. The limited database of just 20,000 “is by far the biggest criticism of the site.” His goal is to hit 100,000 titles by the end of the year.
Curiosity seekers can sign up for and explore at BookLamp now at www.booklamp.org.
4 Comments
Amazing! I just posted about BookLamp onmy blog and here you have an article on it! An excellent article, btw, and I’m fully with you on this.
I would trust a computer to give me a range of choices – then of course, I would reserve the right for myself to pick and choose. Trouble is: the choice is likely to be as good (or as bad) as the BookLamp database, which, at this point, with only 20,000 books, is way too narrow to allow for identification of appropriate options every time.
The other point that bothers me, is why BookLamp is focused on getting books from publishers only? With the on-going digital revolution, the stigma attached to self-publication has been lifted (see John Locke, Amanda Hocking, J.A. Konrath et al). Surely there are all sorts of Indies and self-pubbed authors they could turn to, making sure of course that they get the kind of books they need to fill the “holes” in their system…
Another great story from Publishing Perspectives.
Product discovery certainly remains one of the largest challenges and BookLamp’s attempt to develop machine based categorization is another step in solving the challenge. A small wrinkle worth mentioning is that Pandora employs human music analysts to qualify music pieces across the music genome dimensions and human based approach has outperformed machine learning techniques in music. Netflix is on a similar path in the film realm having created numerous attributes to classify movies, but largely relying on the user base to actually classify the films.
This is very interesting technically. Sure, for some time we’ve had computer-based lexical analysis which can describe a particular author’s style, but I find this new avenue somewhat worrying.
Here at Just the Right Book, we love that the literary world is giving this some thought. We think about it every day. A lot. Our answer, which seems to work for a lot of readers, is to begin with a curated list of fiction (no, there is no DaVinci Code). We match the books to the reader according to their answers to a ten-question quiz. Each reader gets three suggested books that are just right for them based on their answers to the quiz. The beautiful thing is that you’ll get different recommendations if you’re in a new mood tomorrow and change your answers a little. We use a human algorithm that codes books based on metrics developed by a group of booksellers who have worked with readers for decades. It works, and we’re constantly refining it. Try it.www.JusttheRightBook.com/quiz.