A Data Scientist Explains Some Approaches to “Big Data”

In Feature Articles by Edward Nawotka

When it comes to data analysis, ask yourself, “Is it one book selling a million copies or a million books selling one copy?,” says Hilary Mason.

By Edward Nawotka, Editor-in-Chief

Hilary Mason

Hilary Mason

Hilary Mason, CEO and Founder of Fast Forward Labs is a data scientist. She honed her skills for four years as the chief data scientist at Bit.ly and has since launched her own consultancy and co-authored the book Data Driven (O’Reilly). While addressing the crowd at Digital Book World last week, she offered her observations about how publishers can use, exploit and even play with “big data.”

The following is an edited transcript of her remarks, sometimes paraphrased for clarity:

It’s 2015, let’s look at how we communicate. Reddit is the armpit of the internet, a free place with little moderation — some horrible things have bloomed there, but also some really beautiful things.

Some things bubble up and get a lot of attention. There was a question on Reddit, “If someone from the 1950s were alive today, what would be the most difficult thing to explain to them about the way we live?”

The answer cell phone connected to the internet, something which contains all the world’s information, but which we use to look at photos of cats and get into arguments with strangers.

Data DrivenToday, everyone has their own documentation of a public experience which is going out to their own private web, it has provided a tremendous stream of data.

When we consider “big data” we need to ask a few questions. How big is it? is it too big to fit in Excel? Is that “big…” Engineers believe it big data, basically, data that is too big to fit on one computer…of course that changes every year.

The innovation of big data is not about the size, but about our retooling of the data to fit our psychology.

Being a data scientist involves three key areas:Math and statistics, programming (knowing how to manipulate data and get it out) — and, the hardest part — which is asking good questions. Ultimately, data analysis is about analyzing human behavior.

One of my favorite quotes comes from William Gibson, so said: “The future is already here, it’s just not evenly distributed.”

Sharing a simple example from my time at Bit.ly we saw that there are interesting differences. When we looked at the discussion around pizza [here she displayed a “word cloud”] we saw that in New York the key word associated with “slice,” in California it was “artichoke,” and in Rome it was simply “pizza.” From this you can see cultural differences emerging around something as simple as pizza.

Another thing we realized is that what people share is different from what they read. People share things they feel strongly about — politics, opinion for example — but they are clicking on celebrity news and sports, just like everyone else.

We also learned a few more surprising facts:

There are far more photos of dogs shared than cats.

We tried experimenting with sharing links with emotional labels, such as anger or happiness.

Social influence bias exists on the web. We experimented with posting stories that started with a single “thumbs up” and others with a “thumbs down.” If the article was good, the thumbs down was voted up, but those posts that started with a positive response were voted up and up and up…

Another experiment was able to discover the epicenter of an earthquake from the data from fitness bands that tracked when people were waking up.

But it is one thing to analyze data and another to put it into action. In one project that  looked at optimal placement for ambulances waiting for calls in New York City. It found that most ambulances sat at places that were close to a 24-hour bathroom and coffee. So the project helped move them to places that had similar amenities, but were also closer to the optimal point in their area to make a response. The result has been dramatically lower response times in New York City.

So, how do you do data science in a competitive resource environment? First define the problem you want to solve.

Ask what are my error metrics for having successfully answered that question?

If we answer this question correctly, what is the first thing that we will do with it?

If everyone in the world uses this, how does it change human behavior?

Finally — what is the most evil thing that can be done with this?

We like to think startups are where all the action is but that is not true: the largest repositories of information are locked in large companies.

In publishing, fan communities are one example.

And remember that it is important to critically analyze your data: averages are not useful, you must understand the underlying context. Is it one book selling a million copies or a million books selling one copy?

About the Author

Edward Nawotka

A widely published critic and essayist, Edward Nawotka serves as a speaker, educator and consultant for institutions and businesses involved in the global publishing and content industries. He was also editor-in-chief of Publishing Perspectives since the launch of the publication in 2009 until January 2016.