Google book tool tracks cultural change with words

December 16, 2010

Dan Charles

Download Story
(Flickr/Ben Gallagher)
Researchers have built a database of more than 500 billion words, culled from a collection of 5 million books.

Perhaps the biggest collection of words ever assembled has just gone online: 500 billion of them, from 5 million books published over the past four centuries.

The words make up a searchable database that researchers at Harvard say is a new and powerful tool to study cultural change.

The words are a product of Google's book-scanning project. The company has converted approximately 15 million books so far into electronic documents. That's about 15 percent of all books ever published. It includes books published in English, Spanish, French, German, Chinese, Russian and Hebrew.

Many of these books are covered by copyright, and publishers aren't letting people read them online. But the new database gets around that problem: It's just a collection of words and phrases, stripped of all context except the date in which they appeared.

Yet Erez Lieberman Aiden, a mathematician and bioengineer at Harvard and co-creator of this new database, says it opens the door to a whole new style of literary scholarship.

"Instead of saying, 'What insight can I glean if I have one short text in front of me?' -- it's, 'What insight can I glean if I have 500 billion words in front of me; if I have such a large collection of texts that you could never read it in a thousand lifetimes?' "

A 'Fantastically Addictive' Tool

You can, for instance, type in a word or a short phrase, and the database produces a graph -- a curve that traces how often an author used those words every year since 1800.

"And you realize that it's fantastically addictive," says Jean-Baptiste Michel, a mathematician and biologist at Harvard who created the new database together with Aiden. "You can just spend hours and hours typing in the names of people you know, places you like, or just random stuff. And so you end up discovering quite a lot of things that way."

The researchers discovered, for instance, that the trajectory of fame -- the curve that shows how often a very famous person is mentioned in books -- has changed over the centuries. Today, fame is more fleeting.

"You become famous earlier in life; so fame knocks on your door earlier than before. And then you rise to fame even faster than before. The flip side of this is that you become forgotten also somewhat faster than before," says Michel.

Specific years -- 1973, for instance -- also seem to fade from the literary record more quickly nowadays. And God got a lot of print in the early 19th century, but not today.

Windows Into Evolving Cultures

Aiden and Michel argue that these graphs are windows into evolving cultures. All those words represent a chunk of our cultural DNA; not a genome, they say, but a "culturome." They've named the website where anybody can search their database culturomics.org. It's just been unveiled in the journal Science.

Aiden is, however, quick to point out that the collection is limited.

"Books are just one form of cultural exchange," he says. "It's a biased form of cultural exchange. Only certain types of people write books, and only certain types of people manage to get their books published."

But at least books have survived, and it's possible to catalog the words in them, unlike casual conversations or lovers' quarrels.

Some scholars may be horrified by this approach to literature, but Stanford historian Caroline Winterer is not. She says such new tools give historians more comprehensive information about the words that people used in the past to describe their world.

"Before, you had to sit there and, well, you actually had to read the whole text, God forbid! And you'd find two or three examples, and nobody could really check up on it. For better or for worse, it does give you a more accurate sense of some things in the humanities."

But some things require knowledge of a word's context. Take the decline of the word "God," Winterer says. Over the past century or two, some writers started describing the wonders of the natural world as divine. Their books don't always use the word God, "But they are talking about nature, or the environment, or Yosemite, or Yellowstone; these are all codes for God." Copyright 2010 National Public Radio. To see more, visit http://www.npr.org/.