Coffee break: Zipf’s law and the patterns that underlie our lives

Share: Facebook; Twitter; LinkedIn; Flipboard; Print; Email

A friend alerts me to this PhysOrg article about Zipf’s law, according to which,

… the same patterns emerge in a wide variety of situations. The linguist George Kingsley Zipf first proposed the law in 1949, when he noticed that the distribution of words in a newspaper, book, or other literary article always followed the same pattern.

Zipf counted how many times each word appeared, and found that the probability of the occurrence of words starts high and tapers off. Specifically, the most frequent word occurs about twice as often as the second most frequent word, which occurs about twice as often as the fourth most frequent word, and so on. Mathematically, this means that the frequency of any word is inversely proportional to its rank. When the Zipf curve is plotted on a log-log scale, it appears as a straight line with a slope of -1.

Some enterprising researchers tested Zipf’s law on the growth of Linux:

“Linux Debian gave us the opportunity to verify the ‘proportional mechanism,’ thanks to an important dataset and a huge investigation potential,” Maillart said. “All changes (evolution) in open source software are freely available and therefore can be tracked in detail. However, model verification has brought one answer and many resulting questions we intend to give an answer to. We think particularly of mechanisms of success/failure of projects in relation with their management.

“Remember that we still do not clearly understand the reasons of the success of the open source, since it’s free and based on altruist contributions by programmers,” he said. “Additionally, one can bet that further research in this direction (open source and proportional growth) may raise useful questions for other systems (cities, economy, etc.) that would bring new insights to explain their evolution.”

Re the appearance of words in a document, one factor may be that, to discuss a given subject, some words are essential and others useful but not essential. A third group are optional and thus may or may not appear. Here, for example, is a Wordle of this post:

Incidentally, a friend writes to say,

There you have, altruism raising its ugly head once more. An “evolutionary” software project, probably run by people that wouldn’t question evolutionary assumptions in any other context, relying upon the altruism of its contributors.

Actually, it’s not that they don’t understand the reasons, it’s that they can’t accept the evidence. The evidence is that altruism is normal enough among humans not to need an explanataion as an aberration – but not universal and therefore not governed by a law. Call it part of the design, if you like.

Here’s the abstract info:

Phys. Rev. Lett. 73, 3169 – 3172 (1994)
Linguistic Features of Noncoding DNA Sequences
R. N. Mantegna1, S. V. Buldyrev1, A. L. Goldberger2, S. Havlin1, C. K. Peng2, M. Simons2, and H. E. Stanley11Center for Polymer Studies and Department of Physics, Boston University, Boston, Massachusetts 02215 2Cardiovascular Division, Harvard Medical School, Beth Israel Hospital, Boston, Massachusetts 02215
We extend the Zipf approach to analyzing linguistic texts to the statistical study of DNA base pair sequences and find that the noncoding regions are more similar to natural languages than the coding regions. We also adapt the Shannon approach to quantifying the “redundancy” of a linguistic text in terms of a measurable entropy function, and demonstrate that noncoding regions in eukaryotes display a smaller entropy and larger redundancy than coding regions, supporting the possibility that noncoding regions of DNA may carry biological information.

There was further discussion in:

Comment: Richard F. Voss, Comment on “Linguistic Features of Noncoding DNA Sequences”, Phys. Rev. Lett. 76, 1978 (1996)
Comment: Sebastian Bonhoeffer, Andreas V. Herz, Maarten C. Boerlijst, Sean Nee, Martin A. Nowak, and Robert M. May, No Signs of Hidden Language in Noncoding DNA, Phys. Rev. Lett. 76, 1977 (1996)

Comment: N. E. Israeloff, M. Kagalenko, and K. Chan, Can Zipf Distinguish Language From Noise in Noncoding DNA?, Phys. Rev. Lett. 76, 1976 (1996)

Reply: R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, and H. E. Stanley, Mantegna et al. Reply:, Phys. Rev. Lett. 76, 1979 (1996)

Leave a Reply