Free Speech Doesn't Exist: Explaining Zipf's Law

What if I told you that free speech doesn't exist? What if I told you that every word you've ever uttered wasn't done by free will, but instead adheres to a set mathematical formula? And what if I told you that almost all records of both written and spoken words in the universe adhere to this formula as well? Well if that catches your interest, then let me introduce you to Zipf's Law. 

A couple of weeks ago, I blogged about Benford's Law. To give a very brief summary, it's basically a similar thing but instead says that all first digits of numbers in the universe adhere to a fixed pattern. As I said in that blog, I've been falling down a ton of YouTube rabbit holes about cool shit like this, and it all fascinates me. I'm going to start blogging some of these things for weekend reading, and I saw some people on Twitter and in the comments recommend Zipf's Law so that's what next on our docket. Let's enter the classroom. 

Zipf's Law was discovered by George Zipf, a linguist at Harvard. Basically, Zipf's law says that the frequency of all words in the universe adhere to a set formula. The frequency of any word is inversely proportional to its frequency rank (1/frequency rank). To put that in simpler terms, let's say that "the" is the most used word in a book. The second most used word will appear 1/2 as often as "the" does. The third most used word? 1/3 as much as "the" does. The fourth most used word? 1/4 as much as "the" does. The 43rd most used word? 1/43 as much as "the" does. I think you get the point by now. 

This is true in both written and spoken word and applies to all languages. From English to ancient languages that we can't even translate. Tests have been done on Shakespeare books, Wikipedia entries, and well basically everything else. And the law is pretty much always accurate. 

Here's how scarily accurate it is. If you tally the frequency of all words used in Wikipedia and Project Gutenberg (a public domain of tens of thousands of books), "the" is the most frequently used word at about 181 million uses. The 5,555th most frequent word is "sauce." So Zipf's law says that it should come up about 30,000 times (181 million divided by 5,555). And the actual number? 29,594. Insane. 

Are you hooked yet? Here's a video of a guy explaining all this in a way that's probably way easier to understand. 

Here are some more graphs used in that video proving the validity of Zipf's Law. 

Language isn't the only thing that Zipf's law can be applied to. It's also proven to be accurate in city populations, website traffic, earthquake magnitudes, last names, cookbook ingredients, the number of phone calls people received, the frequency of opening chess moves, and more. But there's something about it applying to language that makes it even more freaky. Human language is incredibly creative and free. How can a mathematical formula explain it? 

The short answer is we don't know. While Zipf's Law has proven to be accurate in a lot of cases, we're not sure exactly why or what it means. One possible explanation is basically that there's many words we use A LOT. And there's many words we barely use at all. Here's a list of the most 100 used words in the English language from the video above. 

Just those 100 words make up roughly half of our everything we say. On the other side of things, studies done to classical books show that about half of the words used in them only appear once. So basically half the time we're using the same words above over and over. And the other half of the time, we're using a bunch of unique words that don't repeat at all. So does any of this matter? Well, some think it offers meaningful insight into human language. Some think it's just a random trend that doesn't really mean anything. Personally, I don't care what it means. I just think it's really fucking cool.