Data Mining for STM Content Offers Opportunities, Obstacles

In Guest Contributors by Guest Contributor

By Roy S. Kaufman and Emily Sheahan

While demand for data mining of scholarly content is mounting, lack of standardization of search technologies, interfaces and licensing terms hinders its use.

When researchers make these mining requests, publishers of Scientific, Technical, and Medical content (STM) generally handle mining requests from third parties liberally. However, they have concerns if the mining results can replace, or compete with, the original content or if the mining is burdening their systems. Many publishers have publicly available mining policies, and most handle mining requests case-by-case.

According to one study for open-access journals, mining is generally allowed as part of standard terms and conditions.  For content published in a more restrictive way, however, nearly all publishers require information about the intent and purpose of the mining request [Smit and van der Graaf, 2011]. In addition, for many publishing organizations as well as the associations on whose behalf journals and books are published, the administration of complicated mining rights is labor intensive and expensive. As a consequence, the supply side of STM data mining currently lacks broad-based product offerings.

On the demand side, the audience is broad, and the needs vary significantly. Content mining holds tremendous potential for unlocking scientific discoveries and a great deal more. Scientists who conduct research within pharmaceutical companies might be interested in studying all reported side effects of a particular substance. Others might be doing research on the possible correlation between certain genes and a particular disease.

But the potential for data mining goes much further. In the financial world, data mining techniques are pervasive and are already deployed in numerous applications, ranging from black box trading to macro-economic analyses. Enriching these applications with additional information from STM data mining could be of great value to hedge fund managers and other finance professionals. In legal affairs, the mining of content can help in addressing challenges in IP litigation such as e-discovery. Engineers can enhance risk factor analysis and quality assurance by mining information associated with a particular substance or process. Publishers may benefit from a way to mine their own content and that of other rightsholders to open up new markets for this information. The list of STM data mining applications is endless.

In discussions of the mining of scientific articles and books, there is a clear intersection between data and text mining and access. Content mining benefits from the availability of a body of content, which is sufficiently large and is relevant to a particular topic. In many cases a significant volume of that content is available only by subscription. To be as effective as possible, semantic data and text searches require access to all content, whether available through open access, through repositories, or through subscriptions.

When looking at the mining issue, publishers ask some very valid questions. How can rightsholders allowing mining protect their most valued asset — their content — from theft or manipulation? Can rightsholders allow searches of their databases without allowing copies of their entire database of works being made? How does a rightsholder determine when to charge for queries and how much? Can rightsholders allow searches of their databases without creating problems for other users of the database?

Clearly text and data mining is one of the next areas in which both publishers and content users are eager to find a solution, enabling scientific discovery while realizing and protecting the value of rightsholder content. Such a voluntary solution may require the participation of an intermediary, an experienced collective management organization that can design policies and processes that effectively serve the needs of pharmaceutical companies, STM publishers, researchers and others.

Roy S. Kaufman is Managing Director of New Ventures, Copyright Clearance Center and Emily Sheahan is Executive Director, Copyright Clearance Center

References: Journal Article Mining: a research study into Practices, Policies, Plans…and Promises, Eefke Smit and Maurits van der Graaf. PRC June 2011 153pp.

DISCUSS: Content Curation vs. Creation?

About the Author

Guest Contributor

Guest contributors to Publishing Perspectives have diverse backgrounds in publishing, media and technology. They live across the globe and bring unique, first-hand experience to their writing.