A good friend linked me to a TED Talk entitled “What we learned from five million books,” and as is often the case with TED Talks, it details an ambitious project. Google have for some time been in the process of digitizing any book they can get their hands on, creating an electronic record of human (written) knowledge and history. What happens if you get a computer to process every word of every book, and analyse it?
One awesome result of this project, and the ingenuity of the folks at Google when it comes to this kind of cultural project or ‘culturomics’, is the Google Ngram Viewer, which lets you plug your own queries in and creates a graph representing word or phrase use in five million books across two centuries.
The speakers in the above video are Jean-Baptiste Michel, who according to the website “looks at how we can use large volumes of data to better understand our world”, and Erez Lieberman Aiden, whose interests are diverse (“spanning genomics, linguistics, mathematics”). Both are excellent speakers and very entertaing, and present their research in a very accessible way. This is typical of TED Talks – if you’ve never seen them, head on over to ted.com and have fun! This particular project owes Google for taking a somewhat simplified version of what these guys were doing and allowing anyone to access and use it on the web. It’s crowd-sourcing queries for free, and it lets you use ngrams to examine the history.
I found it surprisingly difficult to come up with a query at first, perhaps humbled by some of the great examples in the video. Eventually I plugged in “race,ethnicity” and got this result:
That’s race in blue and ethnicity in red, in case you can’t see it easily. The graph was more or less as you’d predict – the word ethnicity comes in recently as an alternative to ‘race’, but at a time when race was being talked about more and more. Interestingly there’s a big increase in the use of both terms in the last decade or so, and race is still much, much more common. Of course, we must remember that the program can’t register ambiguities, so pairing the word ‘race’ with ‘ethnicity’ doesn’t stop it picking out examples of running or horse races! As a more specific, recently created word, ethnicity naturally occurs less often.
Here’s one suggested by a commenter on the TED website:
Both terms have ambiguities, but it’s an interesting reflection on our increasing obsession with romantic love (in the English-speaking world). I’ve come to think that romantic love is a natural companion to contemporary capitalist culture, based as it is on individual choice as the prime motivator.
Another that a commenter posted, which lacks ambiguity and prodiced a very interesting graph:
When using the Ngram Viewer, it’s important to remember what the program is doing. It’s scanning millions of books, and plotting how often your word or phrase is used in books published at different times. It’s case sensitive and won’t distinguish between homonyms. More importantly, it doesn’t just magically tell you what was common in society at the time – searching for ‘rape’ doesn’t tell you how much rape was committed, but only how much it was written about. With that caveat though, it’s an excellent and fascinating tool for getting a sense of what has been important in society throughout history. Go on and give it a go.