How Big Data Helps Us Deal with More and More Books

In Feature Articles by Guest Contributor

Alejandro Piscitelli argues that machine-driven book analytics is a practical tool to help society deal with the exponential increase in book publications.

By Alejandro Piscitelli


Alejandro Piscitelli

An astonishing number of books are published on a daily basis. We are not referring to their digital clones or to some other bizarre creations. By the word “books” we mean the more than one million new releases published every year, in all languages spoken across the world.

Ars longa, vita brevis.” Art is long, life is short.

Art transcends life by one or two generations and at the pace of publication, the number of books is growing out of control, and we can not help but wonder what we will do with all these books.

According to Franco Moretti, a former Italian Marxist literary critic and key proponent of Digital Humanities, the solution is simple: not to read them. Moretti is absolutely serious about this. This distinguished scholar is not only the director and founder of the Stanford Literary Lab, but he is also the author of several pamphlets that try to demonstrate computers can recognize literary genres and that Network Theory can be applied to analyze plots.

Literary Lab’s mission, as a matter of fact, is to address literary problems using the scientific method: testing hypothesis, developing computer models and quantitative analysis. This is basically what Digital Humanities do, though even after 10 years since its first appearance, many scholars continue to question this approach (see Daniel Allington, The Managerial Humanities; or, Why the Digital Humanities don’t Exist).

Franco Moretti (together with Lev Manovich) is undoubtedly the most audacious among the exponents of this much-debated discipline. He strongly believes that it is not only possible, but also desirable to understand literature without studying literary texts. The point is to gather massive amounts of data in order to detect hidden patterns that close reading does not reveal. We need “distant reading,” Moretti argues, because its opposite, close reading, can’t uncover the true scope and nature of literature.

Literary critics focus only on the most significant books generally considered part of the canon: 200 books a century at most. Moretti asks if this figure can be representative, provided that as many as 60,000 other novels were published in 19th-century England alone — not even touching upon works from other epochs or places. Since no feasible amount of reading can fix that, what is called for is a change not in scale, but in strategy. To understand literature, Moretti argues, we must stop reading books. Let robots read them for us.

Stanford Literary Lab

An image from Stanford Literary Lab’s study looking at the relations of characters in Shakespeare’s Hamlet.

The first experiment carried out by Moretti’s team consisted in feeding 30 novels – identified by genre – into two computer programs, which were then asked to recognize the genre of six additional works. Both programs succeeded, one using grammatical and semantic signals, the other using word frequency. Another famous Sanford Literary Lab’s pamphlet seeks to detect these hidden aspects in plots (primarily in Hamlet) by transforming them into networks. To do so, Moretti turns characters into nodes (“vertices” in network theory) and their verbal exchanges into connections (“edges”). His networks make visible specific ‘regions’ within the plot and enable experimentation.

Needless to say, there are many critics who argue that despite the enormous effort made by people from Literary Lab, the conclusions of these pamphlets are generally predictable or irrelevant. In relation to this, Kathryn Schultz’s What Is Distant Reading? is worth reading.

Franco Moretti

Franco Moretti

Reading Moretti, it’s impossible not to notice him vying for scientific status. He appears now as literature’s Linnaeus (taxonomizing a vast new trove of data), as Vesalius (exposing its essential skeleton), as Galileo (revealing and reordering the universe of books), as well as Darwin (seeking “a law of literary evolution”). The trouble is that Moretti is not studying a science. Literature is an artificial universe and the written word, unlike the natural world, cannot be counted on to obey a set of laws.

Moretti, enthusiastic over the prospect of “a unified theory of plot and style,” argues that literature is “a collective system that should be grasped as such.” But this is a theology of sorts — if not the claim that literature is a system, at least the conviction that we can find meaning only in its totality. Moreover, as theologies go, Moretti’s is neither new nor unheard of. The idea that truth can best be revealed through quantitative models dates back to the development of statistics.

The most epistemologically interesting path for us might be the combination of an in-depth close reading with a distant, macroscopic reading. Rather than seeing the two different approaches as incompatible, we should mix and combine, like Francisco Varela taught us. Given our prior millenary dedication to exegesis, perhaps we should dedicate more time to the relatively new science of Big Data. It seems it might indeed generate a stimulating method of analysis presenting us with new elements to discover, analyze and adapt. After all, as Stephen Wolfram says: “People Are More Predictable than Particles.”

Alejandro Piscitelli is a philosopher and Professor of Data Processing, Informatics and Telematics at the University of Buenos Aires. A version of this article originally appeared in the literary journal Trama y Texturas and was translated from the Spanish by Valentina Morotti.

About the Author

Guest Contributor

Guest contributors to Publishing Perspectives have diverse backgrounds in publishing, media and technology. They live across the globe and bring unique, first-hand experience to their writing.