The State of Jewish Digital Publishing (Part 1)

In Guest Contributors by Guest Contributor

In the first of a two part article, Ben Denckla outlines the challenges of creating reflowable digital texts that incorporate challenging elements present in a wide variety of Jewish books.

By Benjamin Denckla

Jewish publishing has a glorious history. If we give the words “Jewish” and “publishing” broad meanings, the

Ben Denckla

Ben Denckla

greatest success in Jewish publishing came very early: the canonization and redaction of the Hebrew Scriptures (Tanakh). If we count this as publishing, it may well be the all-time greatest success in any type of publishing, Jewish or otherwise.

It is less of a stretch to point to something like the Bomberg Talmud as an early and glorious example of Jewish publishing. But, Bomberg’s Christianity brings up the question: what is Jewish publishing? Is it publishing by Jews? For Jews? Of Jewish authors? On Jewish themes? For our purposes, we’ll just say that it can be any of the above.

Despite this glorious history, Jewish digital publishing is, as of yet, nothing to be particularly proud of. It has done no worse and no better than digital publishing in general. And digital publishing in general is, as of yet, nothing to be proud of.

Before I proceed to review the state of Jewish digital publishing, a definition and a caveat are in order. First, a definition: by digital publishing I mean “stays digital” publishing. I mean things like ebooks, apps, and web sites, not paper books that just happen to be produced using digital tools. Within “stays digital” publishing, I’ll further narrow my focus to reflowable publishing, that is, I’ll largely ignore publishing for media like fixed-layout EPUB and Adobe PDF. Such media certainly have their place, but their mimicry of paper makes them less challenging, and therefore less interesting to discuss. I tend to think of such media as only “half digital.” This is not meant as an insult, but rather as a thought-provoking (and perhaps oxymoronic) neologism.

Second, a caveat: though above I give a broad definition to the Jewish in Jewish publishing, my knowledge is somewhat narrow. I mainly know about publishing for North American, English-speaking audiences who would identify themselves as non-Orthodox Jews. The largest denominations within this loose group are Reform and Conservative Judaism. Notably, this excludes publishing for Israeli audiences. It also excludes publishing for audiences who would identify themselves as part of that large mishmash we often reify by calling it Orthodox Judaism. I don’t mean to be divisive in making these distinctions. Rather, I make these distinctions in order to be respectful of these other Jewish publishing sub-cultures. Admission of ignorance is an important form of respect.

The Hebrew alphabet

Jewish digital publishing has some particular challenges. The most obvious challenge is that many Jewish books use the Hebrew alphabet. This includes not only books that use the Hebrew alphabet (alef-bet) throughout, but also the many books that sprinkle Hebrew-alphabet words inside a dominantly Latin-alphabet context like English.

In a world biased toward Latin-alphabet cultures, any excursion from the Latin alphabet has its challenges. Indeed in a world biased toward English-speaking cultures, any excursion from ascii, such as the use of diacritics, has its challenges!

Not all excursions are the same, though. Though not as exotic as Asian or Indic scripts, the Hebrew alphabet does not fit the Latin mold as well, as, for example, Cyrillic does.

Most publishers avoid the challenges of the Hebrew alphabet in digital media. The main ways they avoid it are as follows.

  • Don’t release digital editions of books with Hebrew in them.
  • Represent Hebrew as images rather than text. “Half digital”?
  • Use technology like Adobe ClearScan to represent Hebrew (and possibly Latin-alphabet text, too) in a custom, non-Unicode font derived from scanned images of the book. “Half digital”?

Over the course of the next few sections I’ll describe the challenges that drive publishers to these extreme (and extremely unsatisfying) measures.

Before moving on I should note that historically, Bible software has provided the only reliable platform for the display of Hebrew. This is changing, and most of my discussion below is oriented towards publishing in generic media like EPUB-based ebooks.

The Alphabet: Direction

Perhaps the most obvious challenge with Hebrew is its right-to-left direction. This is not so bad in books that use the Hebrew alphabet throughout. The real challenge comes in books that sprinkle Hebrew-alphabet words inside a dominantly left-to-right context like English.

Incredibly, html has only recently added a bulletproof solution to this problem: the bdi element. Support for it is not yet widespread. Only slowly is the World Wide Web living up to its name; it is in transition from its roots as the Latin Wide Web. The Roman Empire dies slowly.

To make things worse, there are several other html technologies that might seem to solve the problem, but do not: the bdo element, and the use of dir=”ltr”. Actually, the semantics of dir=”ltr” have recently been changed in a way that solves the problem, but these changes are even more recent than the addition of bdi. Support for these new semantics are not only not yet widespread, I’m not sure they even exist anywhere.

The EPUB world is the ignored stepchild of the HTML world. So, for example, as slow as HTML changes are to propagate out to the world of browsers, they are even slower to propagate out to the world of ereader software. And information about ereader standards compliance is hard to come by. You can easily find information about Internet Explorer standards (non-) compliance going back several versions. And, there is increasingly good information about the wild and woolly world of mobile browsers. But, for the most part, EPUB creators have to do their own experiments or follow the time-honored engineering practice of crossing one’s fingers and hoping for the best.

So, getting back to directional isolation of Hebrew, the only bulletproof, practical solution for EPUB is to ignore what html may offer in the future and drop down to the Unicode level, using the left-to-right mark after every Hebrew span. At least HTML defines a character entity reference for it: ‎‎.

The Alphabet: Wrapping

A particular challenge when creating ebooks is not only directionally isolating Hebrew text, but doing so at the right granularity. Usually, this means making the isolated span as big as possible. In particular, the multiple words of a phrase using the Hebrew alphabet should be isolated in a single span. If they are not, the phrase will not wrap correctly. This is an insidious problem, for two reasons. One is that, like all line-wrapping problems, it cannot be seen unless the text happens to wrap just so. The other is that even when done correctly, wrapped RTL text in an LTR context looks pretty weird, so it is a little hard to tell when it is done wrong.

The easiest way I’ve found to think about it is this: though Hebrew is RTL, it is still top-to-bottom! So, you can’t just naively chop a Hebrew phrase in half to wrap it, leaving, as you would in English, the left half of the phrase on the top line and wrapping the right half of the phrase to the next line. That would leave the end of the phrase on the top line and the beginning on the next line!

Another way to think about it is this: think of how an English phrase should wrap if it were inside a Hebrew context. You wouldn’t leave, as you would in Hebrew, the right half of the phrase on the top line and wrap the left half of the phrase to the next line!

I’m not sure how well ocr software handles this problem, but I can say that it is one of those rare cases where even working from digital sources does not solve the problem. In a paper book, it hardly matters what means are used to achieve an end, and ebook conversion is rarely considered. So, a Hebrew phrase may be implemented, inappropriately, as more than one span. Even worse, if the span wraps, the correct order of spans (for that particular wrap) may be hard-coded, such that when it is unwrapped, the order is wrong!

Sometimes the correct granularity of Hebrew spans can be ambiguous. For example if, in a paper book, two Hebrew words appear separated by a slash, is that slash “in Hebrew” or “in English”? This is an important question, since it determines whether one single span should be used, or two spans, one for each word, should be used. This, in turn, determines wrapping behavior, as well as more obscure behaviors that expose the underlying ordering, such as text selection and text-to-speech rendering.

The Levels of the Alphabet

In addition to its “funky” direction, the Hebrew alphabet has a complex system of diacritics. We can distinguish three levels of support for Hebrew:

  • the alphabet proper
  • the vowel points
  • the cantillation marks

(We refer to level 2 as “vowel points,” as is commonly done, despite the fact that not all of these symbols represent vowels and not all of them look like points.)

Different books will need different levels of support. Usually vowel points are only needed for Biblical Hebrew. So vowel point support is usually not needed for books using the Hebrew alphabet only for languages such as Modern Hebrew, Yiddish, Aramaic, Ladino, or Arabic. (Yes, Arabic. Though it may seem strange from the perspective of today’s geopolitics, Arabic, or, if you like, its Judeo-Arabic variants, were important Jewish languages in the past.)

Though vowel points may occasionally appear outside of Biblical contexts, cantillation marks never do. Even in Biblical contexts, they may not be used, since they are usually only relevant to the chanting of the text.

Like most classifications, this three-level one breaks down a little when subjected to scrutiny. In particular, there is a single mark, the meteg, whose use can complicate matters of digital publishing. In the three-level classification system, it would belong with the vowel points, though it is not a vowel. But, since it can cause so much trouble, and since many pointed texts do not use it, an argument could be made that a separate level should be introduced for it, as follows.

  • the alphabet proper
  • the vowel points
  • meteg
  • the cantillation marks

What is the current browser and ereader support at these various levels? At present, there is widespread (although not universal) support for the Hebrew alphabet proper. Vowel point support is usually present if the alphabet is, but the quality of vowel point alignment varies. I cannot comment on support for meteg and cantillation marks since I do not have much experience with them.

Since, unfortunately, so much digital publishing is done by ocr, it is important to note that no commercially available ocr software can recognize anything beyond the Hebrew alphabet proper.

What is the current Unicode support at these various levels? The only thing that I’ve found missing from Unicode is a relatively obscure vowel point called hataf qamats qatan. Also, the whole way in which Hebrew diacritics are set up in Unicode is somewhat flawed, in that normalization can re-order diacritics whose order was significant. The following circumstances greatly mitigate the magnitude of this problem.

  • a lot of text is not sensitive to normalization (e.g. text with neither meteg nor cantillation)
  • not that much software normalizes
  • cgj (Combining Grapheme Joiner) can be used to prevent re-ordering due to normalization

Personally, I’ve only had one problem related to Unicode’s flawed normalization of Hebrew, and that problem was tiny: it was annoying to have the w3c validator complain to me that my text was not in Normalization Form C (NFC), the recommended form for html text.

Romanizing the Alphabet

So enough about the alphabet. Even transliterated (romanized) text can cause challenges for Jewish digital publishing, particularly when converting from ocr. There are two issues here. First, romanized words often appear in italics and text in italics, no matter what the contents, poses certain challenges for ocr. These challenges are both in terms of recognizing the letterforms correctly, but also in terms of recognizing the fact that they are in italics. So, not only do I seem to see a higher error rate in words in italics, I also see words whose letters were recognized perfectly but whose italic formatting was lost.

The second challenge with romanized words is that they do not spell check, at least not using generic dictionaries. So, when sprinkled into a dominantly-English text, romanized Yiddish words, for example, pose more of a challenge than, let’s say, French words. There are spell-checking systems where more than one dictionary can be enabled at once, e.g. English and French, but I know of none that allow English and romanized Yiddish. There is no technical reason why you could not have one, except that romanization is a nightmarishly diverse world of non-standards.

The higher error rate I seem to see in words in italics may come at least in part from inability to partner ocr with spell checking rather than inherent problems with ocr in the narrow sense (letterform recognition).

Another challenge is that the more technical styles of romanization use somewhat exotic diacritics, e.g. marking a consonant with a dot below it or marking a vowel with a line above it (a macron). These can cause problems in ocr, and some ereaders may not support them, or may not support them well. The letters alef and ayin are romanized in an unfortunate variety of ways, all of which may cause some problems for ocr. Though not all ereaders may support them, or may not support them well, I strongly recommend the adoption of Unicode’s half rings for alef and ayin (right and left respectively). This eliminates the ambiguities in the current rather spicy but confusing stew of right and left single quotes, undirected (ascii) apostrophes, and even superscripts of the letter c!

Commentary and columns

Consider the following passage from Umberto Eco’s The Name of the Rose, aptly used as an epigraph to Matti Friedman’s Aleppo Codex.

Sino ad allora avevo pensato che ogni libro parlasse delle cose, umane o divine, che stanno fuori dai libri. Ora mi avvedevo che non di rado i libri parlano di libri, ovvero è come si parlassero fra loro.
(I had thought each book spoke of the things, human or divine, that lie outside books. Now I realized that not infrequently books speak of books: it is as if they spoke among themselves.)

Jewish books, particularly on religious themes, often feature two (or more!) parallel texts. This parallelism can be literal (columns), or figurative, in commentary that may be below the text it comments on, or even more distant, in endnotes. A book of commentary may not even include the text commented upon.

These different types of parallelism pose different challenges for digital publication. In general, digital media are more like scrolls than codexes. A little bit unlike ancient Jewish scrolls, digital scrolling tends to be up/down, not side-to-side. But they share something in common: neither the digital publisher nor the sofer (scribe) can know how much the user can see at once.

This is not a problem for the modern sofer, since the primary text he produces does not even have the “commentary” of vowels, much less true commentary or translation. (Though I use the pronoun “he” above, women have started to produce sefer Torot, a welcome development, in my opinion.)

Not knowing how much the user can see is a big problem for the digital publisher. Jewish publishing has, over hundreds of years, developed a culture of translation and commentary that has adapted itself to the specific capabilities (and limitations) of the codex. For our present discussion, the capability of interest is the ability to assume that the user can easily flit his or her focus over an entire page, or even over an entire spread of two pages.

In digital media, pages exist to varying extents. But even in an ereader, perhaps the most paged of digital media, the page only exists, as it were, in the eye (or device) of the beholder. Many factors unknown to the publisher come together to form a page. These factors include, for example, the font size, the margin size, the line spacing, and the whims of the line-breaking algorithm in play. The only factor that the publisher can respond to is the size of the device or window. (This response would be by css media query.)

At this point let me reiterate that, as mentioned at the beginning of this article, I am ignoring media like fixed-layout EPUB and Adobe PDF. As an aside, note that some pdf viewers have an obscure feature that provides a reflowable view of the text, and the PDF format allows for tags to assist in the proper reflowing of text. Though intriguing, reflowable views and reflow tags are rarely-exploited features. Also I am ignoring cases in which the publisher makes the (usually inadvisable) choice to (attempt to) control those factors that are usually left as “free variables” like font size, margin size, line spacing, etc.

So, Jewish digital publishers can’t design for the page, much less for the spread. Thankfully, columns aren’t really a page-based concept. But, columns do require a medium whose width can accommodate them. Columns may not fit, or may not fit comfortably, on many smaller devices such as small tablets, E-Ink devices, and phones. Of course fit depends on factors including number of columns, font, margins, and orientation (landscape vs. portrait). Of these factors, usually only the number of columns is (or should be) under the publisher’s control. Various techniques of responsive design can provide columns in those media that can accommodate them, and serial presentation in narrower media. This is a good compromise.

Though columns provide literal parallelism, other mechanisms of paper publishing such as footnotes provide a kind parallelism, too. Of course footnotes don’t port directly to most digital media. So they are usually represented by a link. epub provides a way to specify that an html link is a note. (Since the distinction between footnotes and endnotes is gone, it is just a note.) Some ereaders present these notes as pop-ups rather than jumps to a new location. This can give a less disruptive reading experience than the jump behavior of a generic html link.

Bible software provides the most sophisticated medium for parallel texts. It has the distinction of not only being able to synchronize the presentation of texts within a product but also between products. So, the user can do things like create his or her own three-column view containing two Bible translations and a commentary.

Unfortunately, these advanced capabilities are limited to synchronizing canonical texts. So, for example, they cannot be used to synchronize the display of any old book and its footnotes. Nor can these capabilities be used to synchronize books outside of the canon. The canon is defined by the software, which, understandably, has a Christian bias reflecting its customer base. I don’t mean to diminish the accomplishments of Bible software: the synchronization they do provide is an impressive achievement of under-appreciated difficulty. For some insight into the difficulty of even synchronizing references within the most canonical of works, namely, the Tanakh, see this Logos blog post.

Benjamin Denckla is an independent software engineer specializing in ebook creation without OCR. His particular focus has been ebooks containing Biblical Hebrew. Past phases of his software career have included realtime digital audio processing, build and test systems, and LAMP web sites. A version of this article originally appeared on Ben’s Github site.


About the Author

Guest Contributor

Guest contributors to Publishing Perspectives have diverse backgrounds in publishing, media and technology. They live across the globe and bring unique, first-hand experience to their writing.