As we’ve been telling Copyright Clearance Center’s (CCC) Christopher Kenneally, CCC’s “Data Dilemma” video may be the first trailer ever created for a panel discussion. We’ll add it here for you, as Mark Piesing gets started with a l0ok at how the session went near the end of the week at Olympia London. — Porter Anderson
By Mark Piesing | @MarkPiesingInterstellar explorers. Star maps. Industrial Revolution. And evolution’s pitched battles of survival.
There’s something about Big Data that brings out Big Metaphors, and the seminar “The Data Dilemma” at the London Book Fair on Thursday (April 14) was no exception.
“Big data” is a term for data sets created by the modern digital world that are so large or complex that traditional, more human-scale ways of analyzing data are inadequate and it’s up to raw computing power and machine intelligence to see the patterns within.
Publishing has up to now been behind the curve in unlocking the secrets of Big Data to chose what to publish, help readers discover new things to read, and so increase or even predict new trends. Instead, publishers have generally preferred to rely on old-fashioned human intuition to make their decisions.
However, as the three panelists argued in this London Book Fair Insights Seminar program, this has to change.
“Data is the new oil, but information is the new petroleum, as by itself data has no business value until you give it content,” said Haralambos “Babis” Marmanis, Vice President and CTO of Copyright Clearance Centre.
“So the ‘data dilemma’ is what to do with the data – how we can extract meaning and value from it?
“There are three things that have made this possible now, such as increasing computing power, programmable infrastructure such as the cloud which allows you to create on-demand computing infrastructure, and machine learning like Natural Language Processing [NLP].
“Publishing is like an expedition in orbit around an alien planet working out what the resources are and then how to exploit it.”
Discoverability and Natural Language Processing
“Book discovery is one of the biggest problems facing publishing,” said Jim Bryant, CEO and co-founder of Trajectory, a global digital distribution and book discovery network that works in algorithmic book-recommendation responses to the problem.
“We have analyzed the full text of a large number of books in English, German and Chinese using NLP down to the parts of speech of individual sentences,” Bryant said. “We normalize to remove frequently occurring words to find the author’s unique writing style. Using techniques like sentiment analysis which give a score to positive words like ‘outstanding’ [+5] and negative words like ‘catastrophe’ [-5], we try to create a personality for each book that can be expressed visually in a cloud of key words.
“This cloud is particularly useful for book discovery when a book is translated into other languages. It allows us to find the closest matching books to recommend to readers rather than by analyzing behavior, which most of retail uses.”
Relevance, Not Data, When It Comes to Dilemma
“I don’t talk about a data dilemma, I talk about a relevance dilemma,” said Sybil Wong Ph.D., of Sparrho, a search and recommendation tool that professionals can use to find scientific information in their fields.
“There are 2.5 million scientific research articles published every year, which breaks down to 208,300 articles every month. Every researcher reads 22 on average a month, which is 0.0001% of those published — and this has peaked. So it is rather like looking at a star field and trying to think about which stars you should explore.
“The challenge for the user is how find recommendations in a timely manner. For us the challenge is to understand what they’re using our tool for and why they’re not using Google search.
“What we found is that they use us when they are not sure what they are looking for but they are looking for inspiration. The other use is to stay ahead of the competition – which is easier to do. We employ a reductive approach to what is relevant by eliminating papers that have not been published in the last 30 days, and don’t mention key words of interest to the author or their journals of interest, so that rather than finding a needle in a giant haystack, we are splitting haystacks into smaller and smaller haystacks as we narrow the search for known unknowns.”
What machine curation struggles to do is to recognize articles that the user feels are relevant, recognize people with similar motivations or recognize alternative words to key words of interest.
“If I was building a platform now,” said Wong, “I would want to make sure I was collecting the right data now that I might want to use in a year’s time. In our previous version of Sparrho, we had personalized feeds and you could click what was relevant or irrelevant and that’s all. In our new version, we have a pin-board where you can save and share what to read later – and even add comments – even if we don’t have the processing power yet to analyze it.
“The challenge, though, is how to monetize this as, since it is an academic tool for research, we don’t want to turn it into an ad platform.”
The message of the session was summed up by Marmanis. “Our capabilities have now reached a tipping point,” he said.
“Technology is no longer the bottleneck. The bottleneck is how we use it and the limits of our imagination. Machines are here to make us more powerful and not replace us.
“This is the fourth industrial revolution. Clearly the world is changing and the publishing world has to change with it. It is not the strongest that survive, it is the species that is most able to adapt that survives.
“This is a strategic inflection point where those who see it and embrace the technology are still close to those who carry on with business as usual. But they will quickly move apart.
“Do you want to be a dinosaur or a dynamo?”
Be sure to download a copy of our free Publishing Perspectives Spring Magazine here as a PDF.