It’s Okay, You Use OCR…But Tell Ebook Buyers

In Digital by Edward Nawotka

As error prone as digital conversion services are, could the book world benefit by offering consumers more sales information about how an ebook is produced?

By Benjamin Denckla

Ben Denckla

Benjamin Denckla

You deserve to know, before purchase, whether an ebook was made using OCR (optical character recognition).

One might argue that this is unnecessary because you can simply return the ebook if you discover it to be unsatisfactory, due to OCR or other reasons.

I find the option of return to be insufficient. First of all, the window of time in which the ebook can be returned may have expired. For example, for Kindle, this window is only 7 days. Second, even if you are within the return window, this is a frustrating process. Consider the following scenario. You buy the ebook, find it to be full of OCR errors, return it, order the paper book, wait for it to be delivered, and only then can you resume reading.

Another problem with the non-solution of return is, what if the ebook came from a subscription service like Amazon Prime or Amazon Unlimited? In that case, there is no notion of return. Your only option, as far as I know, is to move yourself, mentally, one step closer to canceling your subscription, since you’ve just discovered that the subscription you paid for is worth a little less than you thought. That’s frustrating, and if you do end up being tipped over the margin into canceling, that’s frustrating, too: you’ll probably end up feeling like even though you had to cancel, you were throwing the baby away with the bathwater.

Before subscribing, you probably took a look at what titles were available as part of the subscription. Don’t you wish you could have also gotten a sense of the quality of conversion of those titles, too?

I admit that the use of OCR is only a weak indicator of quality. Some books are carefully corrected after OCR, and some conversions that don’t use OCR still manage to be botched in other creative ways.

But, at least you know that if OCR was not used, certain types of errors are pretty much impossible. And, though the use of OCR is only a weak indicator of quality, it has the advantage of being an objective indicator of quality.

As for ebooks that do use OCR, perhaps they could offer other indicators of quality, such as what OCR engine was used, or how many person-hours were spent proofreading the OCR output. (Many customers would be shocked but thankful to learn, before purchase, that this latter number is often close to zero.)

If at this point you’re thinking that it is far-fetched to imagine consumers who are interested in such details, let me provide a few analogies from existing practices in the publishing of books, ebooks and music.

In books, a colophon often presents relatively arcane information about the production process, such as the percentage of post-consumer recycled content in the paper, whether the paper is acid-free, and the fonts used within. As a side note, this colophon information is usually repeated in ebooks, which I find to be humorous but inappropriate. An ebook should not attempt to be a record of the contents of the paper book. It is an independent edition of the same work that the paper book represents. Thus, at most, the paper book’s colophon should appear qualified by phrases such as “The 2011 paper edition of this book was printed on…”

In ebooks, various pre-purchase information is already available. For example, it is typically known, pre-purchase, whether an ebook’s text is reflowable or, instead, appears on fixed-layout pages. And, for Kindle, it is known, pre-purchase, whether various features are enabled, such as Text-to-Speech, X-Ray, Word Wise, and Lending.

In music, readers of a certain age may remember that compact discs were often annotated with three-letter codes like “AAD,” “ADD,” and “DDD.” These are called SPARS codes. They described whether analog (A) or digital (D) equipment was used in the three major phases of production: recording, mixing, and mastering. The merits of SPARS codes were hotly debated, and the underlying issues of analog vs. digital are still hotly debated in the music world.

In the book world, though, I think something like a SPARS code is needed, and should be far less contentious. Text, unlike music, was “born digital.” Text was digital before we knew what digital was. It consists, by its nature, of a stream of a finite set of symbols. Thus any process that treats text as an image, such as scanning a paper book and running OCR on it, is, objectively, running the risk of distorting the books content. As such, consumers should be made aware when such processes were used.

To put it succinctly, if cryptically: an ebook made with OCR is like a DAD CD in a world where everyone agreed that analog was worse than digital.

Let’s move away from dubious analogies back to the matter at hand. Though my original suggestion was just to provide an “OCR bit,” i.e. a simple indication of whether OCR was used, I’ve already expanded that suggestion to include an option for publishers using OCR to “defend” or distinguish themselves by also saying what OCR engine was used (OmniPage, ABBYY, etc.) and how many person-hours were invested in proofreading the OCR output. In addition (or instead), the conversion house could be named explicitly, the theory being that reputations could be built up, allowing consumers to make judgments based on these reputations rather than technical details of production.

Publishers who are doing (or contracting) good work should have nothing to fear.

A process closely related to OCR that should also be disclosed to the consumer is Adobe ClearScan or similar. The use of a ClearScan-like process represents an interesting compromise between a scanned and OCR’ed representation of a book. It decomposes the book into tiny scans of what it thinks (often correctly) are letters. It then represents the book as a collection of these tiny scans, represented as a custom font.

Though technically impressive, and a good compromise in certain cases, it is particularly important that a ClearScan-like process be disclosed to the consumer because its errors, though they may be widespread, are often not immediately visible to the consumer. Its errors are often not visible until a feature is used that needs the underlying text, such as the following Kindle features:

  • copy & paste
  • search
  • display of highlighted sections
  • dictionary lookup
  • Wikipedia lookup

So, unlike OCR, this is not only a process but also a different format for the ebook. In a way this makes the argument for disclosing it stronger than for a pure process like OCR, since in some sense it really does constitute a different product that the customer is getting, more analogous to disclosing whether a book is reflowable vs. fixed-layout.

Having presented some arguments for disclosing the use of OCR and other related production information, we now turn to the question, who should provide this information?

Fans of crowdsourcing might say that existing review systems already provide this information, indirectly. Any information that you shouldn’t have to discover yourself, post-purchase, should be discoverable, pre-purchase, by carefully reading reviews. In other words, if a book has quality problems due to OCR or anything else, won’t that show up in its reviews, e.g. on Amazon?

My answer is no. Or, it may show up in the reviews, but the customer should not be expected to have to scour the reviews for such information.

I think publishers should provide this information, and distributors like Amazon should demand it, or at least give them incentives to do so.

Publishers who are doing (or contracting) good work should have nothing to fear. Far from it, they should want to be given a standardized way to advertise that their work distinguishes them from the standard low quality of ebooks. (I know I seem to be proposing a kind of “reverse Lake Wobegon effect” for ebooks, where the average quality is below average, but if one’s reference point is the quality of paper books, things begin to make sense.)

Distributors should benefit from giving their customers the information they need to make informed decisions that result in less post-purchase dissatisfaction.

And of course I think customers would benefit from having more information about what they are considering purchasing.

Benjamin Denckla is an independent software engineer specializing in ebook creation without OCR. His particular focus has been ebooks containing Biblical Hebrew. Past phases of his software career have included realtime digital audio processing, build and test systems, and LAMP web sites.

About the Author

Edward Nawotka

A widely published critic and essayist, Edward Nawotka serves as a speaker, educator and consultant for institutions and businesses involved in the global publishing and content industries. He was also editor-in-chief of Publishing Perspectives since the launch of the publication in 2009 until January 2016.