all bits considered data to information to knowledge

18Mar/120

Linguistic Darwinism, compliments of Big Data

With the shift from printed to digitized word  the languages finally could be analyzed in ways not possible before.

Over 5,000,000 books have been scanned, digitized and plugged into Internet maelstrom, and unstructured data analysis techniques  have evolved to the point where it could yield insights some might have intuitively anticipated but could never quite prove it. It took Big Data and Google's Culturomics project to make the breakthrough, and the results are in - it's a linguistic jungle out there, and the "survival of the fittest" principle governs the life and death of the words.

A team of authors in the article "Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death" published in the current issue of Science magazine examines these principles, and Christopher Shea of WSJ popularizes the results in his

Turns out that English language has over a million words, and continues to grow at the rate of ~8,500 words per year (the 2002 Webster's Third New International Dictionary has 348,000). And sporting career of a new word is about 30 to 50 years after which a word either disappears into the quick sands of archives or enters permanent lexicon. The process was undoubtedly sped up with the advent of the Internet, and proliferation of spellcheckers could have made it more rigorous.

The pattern is virtually identical across the three analyzed languages (English, Spanish and Hebrew).