The Origins of ‘Big Data': An Etymological Detective Story

Steve Lohr
February 1, 2013 9:10 am February 1, 2013 9:10 am PhotoCredit Lloyd Miller for The New York Times

Words and phrases are fundamental building blocks of language and culture, much as genes and cells are to the biology of life. And words are how we express ideas, so tracing their origin, development and spread is not merely an academic pursuit but a window into a society’s intellectual evolution.

Digital technology is changing both how words and ideas are created and proliferate, and how they are studied. Just last month, for example, the Library of Congress said its archive of public Twitter messages has reached 170 billion tweets and rising, by about 500 million tweets a day.

The Library of Congress archive, resulting from a deal struck with Twitter in 2010, is not yet open to researchers. But the plan is that it soon will be. In a white paper, the Library said that social media promises to be a rich resource that provides “a fuller picture of today’s cultural norms, dialogue, trends and events to inform scholarship, the legislative process, new works of authorship, education and other purposes.”

The new digital forms of communication — Web sites, blog posts, tweets — are often very different from the traditional sources for the study of words, like books, news articles and academic journals.

“It’s almost like oral language instead of edited text,” said Fred R. Shapiro, editor of the “Yale Book of Quotations” and an associate librarian at the Yale Law School. “It’s the way of the future.”

The unruly digital data of the Web is a big ingredient in what is now being called “Big Data.” And as it turns out, the term Big Data seems to be most accurately traced not to references in news or journal archives, but to digital artifacts now posted on technical Web sites, appropriately enough.

To our modest tale of word sleuthing: Last August, I wrote a Sunday column about 2012 being the breakout year for Big Data as an idea, in the marketplace, and as a term.

At the time, I did some reporting on the roots of the term, and I asked Mr. Shapiro of Yale to dig into it. He scoured data bases and came up with several references, including in press releases for product announcements and one intriguing use of the term by a now-famous author (more on that later).

But Mr. Shapiro couldn’t find anything as crisp and definitive as he had done for me years earlier when I asked him to try to find the first reference to the word “software” as a computing term. It was in 1958, in an article in “The American Mathematical Monthly,” written by John Tukey, a Princeton mathematician.

So, without a conclusive answer, I didn’t write about the origins of the term Big Data in that Sunday column. But afterward, I heard from people who had ideas on the subject.

Francis X. Diebold, an economist at the University of Pennsylvania, got in touch and even wrote a paper, with the mildly tongue-in-cheek title, “I Coined the Term ‘Big Data’ ” I had not thought of economics as the breeding ground for the term, but it is not unreasonable. Some of the statistical and algorithmic methods now in the Big Data tool kit trace their heritage to economic modeling and Wall Street.

Mr. Diebold staked a claim based on his paper, “Big Data Dynamic Factor Models for Macroeconomic Measurement and Forecasting,” presented in 2000 and published in 2003. The economic-modeling paper was first academic reference found to Big Data, according to research by Marco Pospiech, a Ph. D. candidate at the Technical University of Freiberg in Germany.

By then, I had heard from Douglas Laney, an veteran data analyst at Gartner. His said the father of the term Big Data might well be John Mashey, who was the chief scientist at Silicon Graphics in the 1990s.

I replied to Mr. Diebold that I thought from what I had seen he probably had plenty of competition. And I passed along the e-mail correspondence I had received. Mr. Diebold said thanks much, and added that he had a University of Pennsylvania research librarian looking into it as well.

The term Big Data is so generic that the hunt for its origin was not just an effort to find an early reference to those two words being used together. Instead, the goal was the early use of the term that suggests its present connotation — that is, not just a lot of data, but different types of data handled in new ways.

The credit, it seemed to me, should go to someone who was aware of the computing context. That is why, in my view, a very intriguing reference, discovered by the Yale researcher Mr. Shapiro, does not qualify.

In 1989, Erik Larson, later the author of bestsellers including “The Devil in the White City” and “In The Garden of Beasts,” wrote a piece for Harper’s Magazine, which was reprinted in The Washington Post. The article begins with the author wondering how all that junk mail arrives in his mailbox and moves on to the direct-marketing industry. The article includes these two sentences: “The keepers of big data say they do it for the consumer’s benefit. But data have a way of being used for purposes other than originally intended.”

Prescient indeed. But not, I don’t think, a use of the term that suggests an inkling of the technology we call Big Data today.

Since I first looked at how he used the term, I liked Mr. Mashey as the originator of Big Data. In the 1990s, Silicon Graphics was the giant of computer graphics, used for special-effects in Hollywood and for video surveillance by spy agencies. It was a hot company in the Valley that dealt with new kinds of data, and lots of it.

There are no academic papers to support the attribution to Mr. Mashey. Instead, he gave hundreds of talks to small groups in the middle and late 1990s to explain the concept and, of course, pitch Silicon Graphics products. The case for Mr. Mashey is on the Web sites of technical and professional organizations, like Usenix. There, some of his presentation slides from those talks are posted, including “Big Data and the Next Wave of Infrastress” in 1998.

For me, looking for the origins of Big Data has been a matter of personal curiosity, something to get back to someday and write up on a weekend.

When I called Mr. Mashey recently, he said that Big Data is such a simple term, it’s not much a claim to fame. His role, if any, he said, was to popularize the term within a portion of the high-tech community in the 1990s. “I was using one label for a range of issues, and I wanted the simplest, shortest phrase to convey that the boundaries of computing keep advancing,” said Mr. Mashey, a consultant to tech companies and a trustee of the Computer History Museum in Mountain View, Calif.

At the University of Pennsylvania, Mr. Diebold kept looking into the subject as well. His follow-up inquiries, he said, proved to be “a journey of increasing humility.” He has written to two papers since the first one.

His most recent paper concludes: “The term Big Data, which spans computer science and statistics/econometrics, probably originated in the lunch-table conversations at Silicon Graphics in the mid-1990s, in which John Mashey figured prominently.”

Tracing the origins of Big Data points to the evolution in the field of etymology, according to Mr. Shapiro. The Yale researcher began his word-hunting nearly 35 years ago, as a student at the Harvard Law School, poring through the library stacks. He was an early user of databases of legal documents, news articles and other documents, in computerized archives.

The Web, Mr. Shapiro said, opens up new linguistic terrain. “What you’re seeing is a marriage of structured databases and novel, less structured materials,” he said. “It can be a powerful tool to see far more.”


