Opinion – Marcelo Viana: Zipf’s mysterious law


Around 1935, American linguist George Zipf observed that when listing words in descending order of their use in different contexts, the frequency of the first word in the list was (approximately) 2 times greater than that of the second word, 3 times greater than that of the third, and so on.

For example, the three most used words in English are the article “the”, the preposition “of” and the conjunction “and”, with “the” appearing 1.92 times more than “of” and 2.42 times more than “and”.

In fact, this peculiar behavior had already been pointed out before, by the French stenographer Jean-Baptist Issop (1868 – 1950) and by the German physicist Felix Auerbach, and it is not the privilege of the English language either: it is valid for all known languages, including languages artificial ones like Esperanto.

Furthermore, it is not restricted to the domain of linguistics: the same type of distribution occurs in lists of data from different sources. One of the most studied situations, already pointed out by Auerbach in 1913, concerns the size of cities.

For example, when we list Brazilian cities in descending order of their populations, we observe that the largest (São Paulo) is 1.92 times larger than the second (Rio de Janeiro) and 2.42 times larger than the third (Brasília) .

The first attempt to explain this phenomenon mathematically was due to Zipf himself and is very curious. He assumed that both speaker and listener want to put as little effort into communication as possible, and he used statistical arguments to conclude that this would lead to the type of frequency distribution prescribed by law. But it is not clear how this idea could be extended to other instances of Zipf’s law outside of linguistics.

Other possible scientific explanations have been proposed over the years, but the validity of Zipf’s law remains a mystery. In part, this is due to the fact that, unlike most mathematical statements, this law is only approximately correct: word frequencies in language, city populations, and other similar data have complex behavior, which the law from Zipf reflects only roughly.


You May Also Like

Recommended for you

Immediate Peak