Data Conversion Laboratory: Making the Long Tail

In Digital by Edward Nawotka


Data Conversion Laboratory’s Mark Gross argues that digitizing even the most esoteric material and making it searchable can create a market for it.

By Edward Nawotka

“Everything is getting more international, especially over the last four or five years,” says Mark Gross, CEO, Data Conversion Laboratory (DCL), based in Fresh Meadows, Queens. “Now you can deliver information all over the world. To send an electronic book it is delivered immediately. It is very cheap to deliver an electronic product. That has changed the market over the last few years.”

Mark Gross, CEO, Data Conversion Laboratories

Mark Gross, CEO, Data Conversion Laboratory

DCL, which has clients in some 35 countries, is focused largely on digital conversion services for STM, education and journal publishers, as well as a growing clientele of university presses. Among its largest clients are PubMed Central, which is part of the NIH, Elsevier in the Netherlands and Brill in Germany, as well as companies that require technical publications, such as Caterpillar, which publishes in more than 100 languages.

The company is also working with Bowker to assist authors in direct conversion of titles to ebook formats.

Unlike numerous “conversion companies,” DCL’s focus over the years has been automation, which derived from the firm’s origins in the computational world and the electronic world. “Our first work was in automated publishing and we approached this topic of how to make information digital, how do you automate as much as possible,” says Gross. At present, some 95-99% of DCL’s conversions are automated, with some hitting 100%.

Converting backlist is a top priority for many journal publishers, because it raises discoverability. “It opens up tremendous markets,” says Gross. “There are many organizations that have content lying dormant — it’s not just scientific. We just completed work for the American Alpine Club, who have the largest collection of mountaineering maps in the world, they specialize in the Himalayas, and before you would have to Colorado to see them. They have put it all online and made it available. And they are now thinking of geotagging. They also have the largest collection of mountaineering accident reports going back to the 1800s. Information buried all over the place.”

Gross notes that one of DCL’s customers, the Optical Society of Americas — has converted all of their articles dating back to 1917. “My first meeting with them I didn’t know they were serious: why would physicists want to look at articles from 1917? But the research in many fields does go back to the ’20s, ’30s and ’40s. You need access to that information in order to do that additional research. In the first 18 months, they started finding the marketplace was very intense.”

Elsevier: Indexing Nearly 3 Million Articles

So what can a company like Elsevier, which already has much of its materials online and digitized, benefit from additional work by DCL? “They talk about going back to the publishers of the 1500s — 1550 and what they are doing now and we’re helping them deal with the references and indexes in articles going back to the 1990s. And while all the stuff is available online already, the references are not broken out as finely as modern scientific standards require now. When people do standards in XML and you are dealing with the scientific world, you can break down the bibliographic references to a microlevel that makes it easier to identify. Material more than 10 or 15 years old it is not as findable in its current state.”

To get a sense of the current state of DCL’s work Elsevier, which covers 2.5 million articles published prior to 1996.  Say each one has 25 references, this results in a total of 75 million references that are suddenly searchable. And that is just ten years.

“With Elsevier, we knew it had to be an automated approach that was as close to 100% as we go. We have done tens of millions of references. That might be the differentiator between us and other conversion companies,” says Gross

Work Load and Efficiencies

The majority of DCL’s technology and customer facing operations are in New York, where the company employs 65 people who are handling all the tech and R&D and project management. The labor intensive work — proofing and checking — is mostly done in India and in China, and there are fairly large groups spread across the globe who are in specialized areas, such as math, composition, transcription and reference checking.

Automation reduces the cost of conversion to one-tenth of doing something by hand.

All this, “is more affordable that you might think,” says Gross. Why? Again, automation. “It makes things economically viable that were not viable before. It would have been a tough business case if it were being done in a traditional way. At the high level of accuracy now is what makes it viable. If it wasn’t accurate it wouldn’t be worthwhile. When you deal with large repositories of information, it might be 300 people working on a project like that. How do you pass along the information, rules and instructions so everyone understands it the same way — it’s just not possible to be as accuracy a 9 am as 5 p.m. Electronically there are hundreds of rules that we are following.”

Citing the firms work with the US patent office, Gross notes, “three years now we have been processing patent documents that come in from the lawyers (not the ones that have been published) for years, this would come in and imaged, but it is now searchable. The patent office is years behind. They wanted a system for images coming in and turning them into XML. We got involved — my technical people persevered with a fully automated process — we do over a million pages a month for them taking an image and putting it into full XML.”

“Automation is fractional compared to the cost of doing it by hand,” he says. “just one-tenth the cost.”

University Presses and Non-profits

In previous generations, many university presses might publish a book that would attract, at most, 100 customers. But with digitization and globalization, this opens up the world to thousands or tens of thousands of potentially interested readers.

“For many STM, journal, and academic publishers, we are going back to helping them to retag that and put in more granularity into their content. The importance of that is that people will be able to find and download content more easily. Authors can be read more widely and their rankings go up and it has implications on tenure, for example. People in the scientific world are tracking how closely they are being read. It’s another example of people having the content and their having more granularity. Again, you can now find stuff you wouldn’t have been able to do.”

What this ultimately means for nonprofits is that they are fulfilling their target of disseminating info while profit-making bodies they have new products they didn’t have before. Subscribing and becoming a member to a journal or organization suddenly becomes more valuable

Gross returns to his example of The Optical Society. “We took all the images from the articles and separated them out with captions — there are over 350,000 images in their databank and that is a whole separate product people are interested in. That database became a product. They can do historical sequence. With very little effort they can do a collection of retrospective articles on a physicist.”

While some in the publishing industry have argued that long tail is a myth, “what we have found is quite the opposite,” says Gross. “My experience is once you make that information available on a large scale, even the most esoteric material, and multiply it by billions of people, then suddenly you have a market,” says Gross. “That is the real long tail.”

About the Author

Edward Nawotka

A widely published critic and essayist, Edward Nawotka serves as a speaker, educator and consultant for institutions and businesses involved in the global publishing and content industries. He was also editor-in-chief of Publishing Perspectives since the launch of the publication in 2009 until January 2016.