Entropy of the English language

How much information is conveyed by our language?

Entropy of a sequence

The uncertainty in bits of a set of possible outcomes with probabilities $\left{p_1, p_2, p_3,....\right}$ is $$H=-\sum_i p_i \log_2 p_i$$

This is also the amount of information contained in a report of which outcome actually occurred.

General result: for any set of $n$ symbols, each with probability $p_i$, the maximum uncertainty occurs when all the probabilities are equal: $p_i = 1/n$.

Probability of a sequence of symbols

Using $$H=-\sum_i p_i \log_2 p_i$$ the uncertainty associated with a single, uppercase, letter of the English alphabet (assuming that any letter is equally likely to come up) is $$-\sum_{i=1}^{26} \frac{1}{26}\log_2(1/26) = \log_2 26 = 4.7 \text{ bits}.$$

What's the uncertainty of a sequence of $N$ sequential, random letters?

The probability of ATYEE is $(1/26)^5$. What is the uncertainty of a sequence of five letters?

...So $H_{\text{seq }N}$, the uncertainty of a sequence of $N$ symbols is apparently $N\,H_{\text{sym}}$.

Most efficient encoding of a message

Perhaps you've heard the one about the group of [insert name of group here] who see each other all the time, so they start numbering their jokes...

Consider a message composed with 4 letters, A, T, C, G, where the probabilities of each occurring are $p_A=0.5$, $p_T=0.25$, $p_C=0.125$, and $p_G=0.125$. And so the entropy per character is: $$-[0.5*\log_2 0.5 + 0.25*\log_2 0.25+2*0.125*\log_2 0.125]=1.75=7/4$$

An 8-character message might be:

AAAATTCG

And should have an uncertainty of (7/4)*8=14 bits.

First attempt: A=00, T=01, C=10, G=11:

AAAATTCG = 0000000001011011 (16 binary digits)

Using the Fano encoding scheme... A=0, T=10, C=110, G=111 we can write this as

AAAATTCG=00001010110111 (14 bits!)

Define some new symbols: a=00, t=01, c=10, g=11 which will be written out as binary, and then expanded using the Fano sequence...

00001010110111 = aaccgtg

Voila- A more efficient encoding scheme!

We say the original encoding scheme had some redundancy in it.

You may have seen before some variation of this proposal to reform the spelling of the English language which illustrates how you might be able to get the same message across with fewer letters and shorter words.

Under what circumstances is redundancy a good thing?

Redundancy / Entropy of English

Our initial estimate of the entropy of english per letter was $$H=\log_2 27 = 4.76$$

Youcanusuallyfigureoutthemeaningofa stringofletters inenglishsoweknowthespaceadds totheredundancyofwrittenenglish. So, we can already reduce the entropy a little to $$H=\log_2 26 = 4.70$$

Not all letters are equally probable. So, using actual probabilities $p_E=0.105$, $p_T=0.072$, $p_O=0.065$....$P_{J,Q,Z}=0.001$ we get $$H=4.03$$

Shannon had the idea to calculate probabilities of 2-letter sequences, then 3-letter sequences, then 4-letter sequences, and use the limit of this series as an approximation to the info uncertainty of written English: Call $F_n=H_n/n$, he found $$F_0=4.76,\ F_1=4.03,\ F_2=3.32,\ F_3=3.1,\ F_4=2.14.$$

Some of this is based on: Entropy and Redundancy in English.