An introduction to information theory and entropy

{\LARGE\bf An introduction to information theory and entropy}\newline \newline \newline

An introduction to information theory and entropy

Tom Carter

http://cogs.csustan.edu/~tom/SFI-CSSS
Complex Systems Summer School

June, 2002

Our general topics: Top

Measuring complexity
Some probability background
Basics of information theory
Some entropy theory
The Gibbs inequality
A simple physical example (gases)
Shannon's communication theory
Application to Biology (analyzing genomes)
Application to Physics (lasers)
Some other measures
Some additional material
Examples using Bayes' Theorem
Analog channels
References

The quotes <-

Science, wisdom, and counting
Being different - or random
Surprise, information, and miracles
Information (and hope)
H (or S) for Entropy
Thermodynamics
Language, and putting things together
Tools

Science, wisdom, and counting <-

``Science is organized knowledge. Wisdom is organized life.''
- Immanuel Kant
``My own suspicion is that the universe is not only stranger than we suppose, but stranger than we can suppose.''
- John Haldane
``Not everything that can be counted counts, and not everything that counts can be counted.''
- Albert Einstein (1879-1955)
``The laws of probability, so true in general, so fallacious in particular .''
- Edward Gibbon

Measuring complexity Top

Workers in the field of complexity face a classic problem: how can we tell that the system we are looking at is actually a complex system? (i.e., should we even be studying this system? :-)
Of course, in practice, we will study the systems that interest us, for whatever reasons, so the problem identified above tends not to be a real problem. On the other hand, having chosen a system to study, we might well ask ``How complex is this system?''
In this more general context, we probably want at least to be able to compare two systems, and be able to say that system A is more complex than system B. Eventually, we probably would like to have some sort of numerical rating scale.
Various approaches to this task have been proposed, among them:
1. Human observation and (subjective) rating
2. Number of parts or distinct elements (what counts as a distinct part?)
3. Dimension (measured how?)
4. Number of parameters controlling the system
5. Minimal description (in which language?)
6. Information content (how do we define/measure information?)
7. Minimal generator/constructor (what machines/methods can we use?)
8. Minimum energy/time to construct (how would evolution count?)
Most (if not all) of these measures will actually be measures associated with a model of a phenomenon. Two observers (of the same phenomenon?) may develop or use very different models, and thus disagree in their assessments of the complexity. For example, in a very simple case, counting the number of parts is likely to depend on the scale at which the phenomenon is viewed (counting atoms is different from counting molecules, cells, organs, etc.).
We shouldn't expect to be able to come up with a single universal measure of complexity. The best we are likely to have is a measuring system useful by a particular observer, in a particular context, for a particular purpose.
My first focus will be on measures related to how surprising or unexpected an observation or event is. This approach has been described as information theory.

Being different - or random <-

``The man who follows the crowd will usually get no further than the crowd. The man who walks alone is likely to find himself in places no one has ever been before. Creativity in living is not without its attendant difficulties, for peculiarity breeds contempt. And the unfortunate thing about being ahead of your time is that when people finally realize you were right, they'll say it was obvious all along. You have two choices in life: You can dissolve into the mainstream, or you can be distinct. To be distinct is to be different. To be different, you must strive to be what no one else but you can be. ''
-Alan Ashley-Pitt
``Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin.''
- John von Neumann (1903-1957)

Some probability background Top

At various times in what follows, I may float between two notions of the probability of an event happening. The two general notions are:
1. A frequentist version of probability:
  In this version, we assume we have a set of possible events, each of which we assume occurs some number of times. Thus, if there are N distinct possible events (x₁, x₂, ¼, x_N), no two of which can occur simultaneously, and the events occur with frequencies (n₁, n₂, ¼, n_N), we say that the probability of event x_i is given by
  
  P(x_i) = n_i
  N
  å
  j = 1
  n_j
  
  This definition has the nice property that
  
  N
  å
  i = 1
  P(x_i) = 1
2. An observer relative version of probability:
  In this version, we take a statement of probability to be an assertion about the belief that a specific observer has of the occurrence of a specific event.
  Note that in this version of probability, it is possible that two different observers may assign different probabilities to the same event.
  Furthermore, the probability of an event, for me, is likely to change as I learn more about the event, or the context of the event.
3. In some (possibly many) cases, we may be able to find a reasonable correspondence between these two views of probability. In particular, we may sometimes be able to understand the observer relative version of the probability of an event to be an approximation to the frequentist version, and to view new knowledge as providing us a better estimate of the relative frequencies.

I won't go through much, but some probability basics, where a and b are events:
P(not a) = 1 - P(a).
P(a or b) = P(a) + P(b) - P(a and b).
We will often denote P(a and b) by P(a, b). If P(a, b) = 0, we say a and b are mutually exclusive.
Conditional probability:

P(a |b) is the probability of a, given that we know b. The joint probability of both a and b is given by:

P(a, b) = P(a |b) P(b).

Since P(a, b) = P(b, a), we have Bayes' Theorem:

P(a |b)P(b) = P(b |a) P(a),

or

P(a |b) = P(b |a) P(a)
P(b)
.

If two events a and b are such that

P(a |b) = P(a),

we say that the events a and b are independent. Note that from Bayes' Theorem, we will also have that

P(b |a) = P(b),

and furthermore,

P(a, b) = P(a |b)P(b) = P(a)P(b).

This last equation is often taken as the definition of independence.
We have in essence begun here the development of a mathematized methodology for drawing inferences about the world from uncertain knowledge. We could say that our observation of the coin showing heads gives us information about the world. We will develop a formal mathematical definition of the information content of an event which occurs with a certain probability.

Surprise, information, and miracles <-

``The opposite of a correct statement is a false statement. The opposite of a profound truth may well be another profound truth.''
- Niels Bohr (1885-1962)
``I heard someone tried the monkeys-on-typewriters bit trying for the plays of W. Shakespeare, but all they got was the collected works of Francis Bacon.''
- Bill Hirst
``There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.''
- Albert Einstein (1879-1955)

Basics of information theory Top

We would like to develop a usable measure of the information we get from observing the occurrence of an event having probability p . Our first reduction will be to ignore any particular features of the event, and only observe whether or not it happened. In essence this means that we can think of the event as the observance of a symbol whose probability of occurring is p.
We will thus be defining the information in terms of the probability p.
We will want our information measure I(p) to have several properties:
1. Information is a non-negative quantity: I(p) ³ 0.
2. If an event has probability 1, we get no information from the occurrence of the event: I(1) = 0.
3. If two independent events occur (whose joint probability is the product of their individual probabilities), then the information we get from observing the events is the sum of the two informations: I(p₁*p₂) = I(p₁) + I(p₂).
4. We will want our information measure to be a continuous (and, in fact, monotonic) function of the probability (slight changes in probability should result in slight changes in information).
We can therefore derive the following:
1. I(p²) = I(p*p) = I(p) + I(p) = 2*I(p)
2. Thus, further, I(pⁿ) = n*I(p)
  (by induction ...)
3. I(p) = I((p^1/m)^m) = m * I(p^1/m), so I(p^1/m) = ¹/_m*I(P) and thus in general
  
  I(p^n/m) = n
  m
  *I(p)
4. And thus, by continuity, we get, for 0 < p £ 1, and a > 0 a real number:
  
  I(p^a) = a*I(p)
From this, we can derive the nice property:

I(p) = -log_b(p) = log_b(1/p)

for some base b.

Summarizing: from the four properties,
1. I(p) ³ 0
2. I(p₁*p₂) = I(p₁) + I(p₂)
3. I(p) is monotonic and continuous in p
4. I(1) = 0
we can derive that

I(p) = log_b(1/p) = - log_b(p),

for some positive constant b. The base b determines the units we are using.
We can change the units by changing the base, using the formulas, for b₁, b₂, x > 0,

x = b₁^log_b₁(x)

and therefore

log_b₂(x) = log_b₂(b₁^log_b₁(x)) = (log_b₂(b₁))(log_b₁(x)).

Thus, using different bases for the logarithm results in information measures which are just constant multiples of each other, corresponding with measurements in different units:
1. log₂ units are bits (from 'binary')
2. log₃ units are trits(from 'trinary')
3. log_e units are nats (from 'natural logarithm') (We'll use ln(x) for log_e(x))
4. log₁₀ units are Hartleys, after an early worker in the field.
Unless we want to emphasize the units, we need not bother to specifiy the base for the logarithm, and will write log(p). Typically, we will think in terms of log₂(p).

For example, flipping a fair coin once will give us events h and t each with probability 1/2, and thus a single flip of a coin gives us -log₂(1/2) = 1 bit of information (whether it comes up h or t).
Flipping a fair coin n times (or, equivalently, flipping n fair coins) gives us -log₂((1/2)ⁿ) = log₂(2ⁿ) = n*log₂(2) = n bits of information.
We could enumerate a sequence of 25 flips as, for example:

hthhtththhhthttththhhthtt

or, using 1 for h and 0 for t, the 25 bits

1011001011101000101110100.

We thus get the nice fact that n flips of a fair coin gives us n bits of information, and takes n binary digits to specify. That these two are the same reassures us that we have done a good job in our definition of our information measure ...

Information (and hope) <-

``In Cyberspace, the First Amendment is a local ordinance.''
- John Perry Barlow
``Groundless hope, like unconditional love, is the only kind worth having.''
- John Perry Barlow
``The most interesting facts are those which can be used several times, those which have a chance of recurring. ... Which, then, are the facts that have a chance of recurring? In the first place, simple facts.''
H. Poincare, 1908

Some entropy theory Top

Suppose now that we have n symbols {a₁, a₂, ¼, a_n}, and some source is providing us with a stream of these symbols. Suppose further that the source emits the symbols with probabilities {p₁, p₂, ¼, p_n}, respectively. For now, we also assume that the symbols are emitted independently (successive symbols do not depend in any way on past symbols).
What is the average amount of information we get from each symbol we see in the stream?

What we really want here is a weighted average. If we observe the symbol a_i, we will get be getting log(1/p_i) information from that particular observation. In a long run (say N) of observations, we will see (approximately) N*p_i occurrences of symbol a_i (in the frequentist sense, that's what it means to say that the probability of seeing a_i is p_i). Thus, in the N (independent) observations, we will get total information I of

I = n
å
i = 1
(N*p_i)*log(1/p_i).

But then, the average information we get per symbol observed will be

I/N

=

(1/N) n
å
i = 1
(N*p_i)*log(1/p_i)

=

n
å
i = 1
p_i*log(1/p_i)

Note that lim_x®0 x*log(1/x) = 0, so we can, for our purposes, define p_i*log(1/p_i) to be 0 when p_i = 0.

This brings us to a fundamental definition. This definition is essentially due to Shannon in 1948, in the seminal papers in the field of information theory.
As we have observed, we have defined information strictly in terms of the probabilities of events. Therefore, let us suppose that we have a set of probabilities (a probability distribution) P = {p₁, p₂, ¼, p_n}. We define the entropy of the distribution P by:

H(P) = n
å
i = 1
p_i*log(1/p_i).

I'll mention here the obvious generalization, if we have a continuous rather than discrete probability distribution P(x):

H(P) = ó
õ P(x)*log(1/P(x))dx.

Another worthwhile way to think about this is in terms of expected value. Given a discrete probability distribution P = {p₁, p₂, ¼, p_n}, with p_i ³ 0 and å_{i = 1}ⁿ p_i = 1, or a continuous distribution P(x) with P(x) ³ 0 and òP(x)dx = 1, we can define the expected value of an associated discrete set F = {f₁, f₂, ¼, f_n} or function F(x) by:

< F > = n
å
i = 1
f_i p_i

or

< F(x) > = ó
õ F(x) P(x) dx.

With these definitions, we have that:

H(P) = < I(p) > .

In other words, the entropy of a probability distribution is just the expected value of the information of the distribution.

Several questions probably come to mind at this point:

What properties does the function H(P) have? For example, does it have a maximum, and if so where?
Is entropy a reasonable name for this? In particular, the name entropy is already in use in thermodynamics. How are these uses of the term related to each other?
What can we do with this new tool?
Let me start with an easy one. Why use the letter H for entropy? What follows is a slight variation of a footnote, p. 105, in the book Spikes by Rieke, et al. :-)

H (or S) for Entropy <-

``The enthalpy is [often] written U. V is the volume, and Z is the partition function. P and Q are the position and momentum of a particle. R is the gas constant, and of course T is temperature. W is the number of ways of configuring our system (the number of states), and we have to keep X and Y in case we need more variables. Going back to the first half of the alphabet, A, F, and G are all different kinds of free energies (the last named for Gibbs). B is a virial coefficient or a magnetic field. I will be used as a symbol for information; J and L are angular momenta. K is Kelvin, which is the proper unit of T. M is magnetization, and N is a number, possibly Avogadro's, and O is too easily confused with 0. This leaves S . . .'' and H. In Spikes they also eliminate H (e.g., as the Hamiltonian). I, on the other hand, along with Shannon and others, prefer to honor Hartley. Thus, H for entropy . . .

The Gibbs inequality Top

First, note that the function ln(x) has derivative 1/x. From this, we find that the tangent to ln(x) at x = 1 is the line y = x - 1. Further, since ln(x) is concave down, we have, for x > 0, that

ln(x) £ x - 1,

with equality only when x = 1.
Now, given two probability distributions, P = {p₁, p₂, ¼, p_n} and Q = {q₁, q₂, ¼, q_n}, where p_i, q_i ³ 0 and å_i p_i = å_i q_i = 1, we have

n
å
i = 1
p_i ln æ
ç
è q_i
p_i
ö
÷
ø

£

n
å
i = 1
p_i æ
ç
è q_i
p_i
- 1 ö
÷
ø = n
å
i = 1
(q_i - p_i)

=

n
å
i = 1
q_i - n
å
i = 1
p_i = 1 - 1 = 0,

with equality only when p_i = q_i for all i. It is easy to see that the inequality actually holds for any base, not just e.

We can use the Gibbs inequality to find the probability distribution which maximizes the entropy function. Suppose P = {p₁, p₂, ¼, p_n} is a probability distribution. We have

H(P) - log(n)

n
å
i = 1

p_i log(1/p_i) - log(n)

n
å
i = 1

p_i log(1/p_i) - log(n)

n
å
i = 1

p_i

n
å
i = 1

p_i log(1/p_i) -

n
å
i = 1

p_i log(n)

n
å
i = 1

p_i(log(1/p_i) - log(n))

n
å
i = 1

p_i(log(1/p_i) + log(1/n))

n
å
i = 1

p_ilog

æ
ç
è

1/n

p_i

ö
÷
ø

with equality only when p_i = ¹/_n for all i. The last step is the application of the Gibbs inequality.

What this means is that

0 £ H(P) £ log(n).

We have H(P) = 0 when exactly one of the p_i's is one and all the rest are zero. We have H(P) = log(n) only when all of the events have the same probability ¹/_n.
That is, the maximum of the entropy function is the log() of the number of possible events, and occurs when all the events are equally likely.
An example illustrating this result: How much information can a student get from a single grade? First, the maximum information occurs if all grades have equal probability (e.g., in a pass/fail class, on average half should pass if we want to maximize the information given by the grade).
The maximum information the student gets from a grade will be:
Pass/Fail : 1 bit.
A, B, C, D, F : 2.3 bits.
A, A-, B+, . . ., D-, F : 3.6 bits.
Thus, using +/- grading gives the students about 1.3 more bits of information per grade than without +/-, and about 2.6 bits per grade more than pass/fail.
If a source provides us with a sequence chosen from 4 symbols (say A, C, G, T), then the maximum average information per symbol is 2 bits. If the source provides blocks of 3 of these symbols, then the maximum average information is 6 bits per block (or, to use different units, 4.159 nats per block).

We ought to note several things.

First, these definitions of information and entropy may not match with some other uses of the terms.
For example, if we know that a source will, with equal probability, transmit either the complete text of Hamlet or the complete text of Macbeth (and nothing else), then receiving the complete text of Hamlet provides us with precisely 1 bit of information.
Suppose a book contains ascii characters. If the book is to provide us with information at the maximum rate, then each ascii character will occur with equal probability - it will be a random sequence of characters.

Second, it is important to recognize that our definitions of information and entropy depend only on the probability distribution. In general, it won't make sense for us to talk about the information or the entropy of a source without specifying the probability distribution.
Beyond that, it can certainly happen that two different observers of the same data stream have different models of the source, and thus associate different probability distributions to the source. The two observers will then assign different values to the information and entropy associated with the source.
This observation (almost :-) accords with our intuition: two people listening to the same lecture can get very different information from the lecture. For example, without appropriate background, one person might not understand anything at all, and therefore have as probability model a completely random source, and therefore get much more information than the listener who understands quite a bit, and can therefore anticipate much of what goes on, and therefore assigns non-equal probabilities to successive words . . .

Thermodynamics <-

``A theory is the more impressive the greater the simplicity of its premises is, the more different kinds of things it relates, and the more extended its area of applicability. Therefore the deep impression which classical thermodynamics made upon me. It is the only physical theory of universal content which I am convinced that, within the framework of the applicability of its basic concepts, it will never be overthrown (for the special attention of those who are skeptics on principle).''
- A. Einstein, 1946
``Thermodynamics would hardly exist as a profitable discipline if it were not that the natural limit to the size of so many types of instruments which we now make in the laboratory falls in the region in which the measurements are still smooth.''
- P. W. Bridgman, 1941

A simple physical example (gases) Top

Let us work briefly with a simple model for an idealized gas. Let us assume that the gas is made up of N point particles, and that at some time t₀ all the particles are contained within a (cubical) volume V. Assume that through some mechanism, we can determine the location of each particle sufficiently well as to be able to locate it within a box with sides 1/100 of the sides of the containing volume V. There are 10⁶ of these small boxes within V.
We can now develop a (frequentist) probability model for this system. For each of the 10⁶ small boxes, we can assign a probability p_i of finding a gas particle in the small box by counting the number of particles n_i in the box, and dividing by N. That is, p_i = [(n_i)/ N]. From this probability distribution, we can calculate an entropy:

H(P)

=

10⁶
å
i = 1
p_i*log(1/p_i)

=

10⁶
å
i = 1
n_i
N
* log(N/n_i)

If the particles are evenly distributed among the 10⁶ boxes, then we will have that each n_i = N/10⁶, and in this case the entropy will be:

H(evenly)

=

10⁶
å
i = 1
N/10⁶
N
* log æ
ç
è N
N/10⁶
ö
÷
ø

=

10⁶
å
i = 1
1
10⁶
* log(10⁶)

=

log(10⁶).

There are several ways to think about this example.

First, notice that the calculated entropy of the system depends in a strong way on the relative scale of measurement. For example, if the particles are evenly distributed, and we increase our accuracy of measurement by a factor of 10 (i.e., if each small box is 1/1000 of the side of V), then the calculated maximum entropy will be log(10⁹) instead of log(10⁶).
For physical systems, we know that quantum limits (e.g., Heisenberg uncertainty relations) will give us a bound on the accuracy of our measurements, and thus a more or less natural scale for doing entropy calculations. On the other hand, for macroscopic systems, we are likely to find that we can only make relative rather than absolute entropy calculations.
Second, we have simplified our model of the gas particles to the extent that they have only one property, their position. If we want to talk about the state of a particle, all we can do is specify the small box the particle is in at time t₀. There are thus Q = 10⁶ possible states for a particle, and the maximum entropy for the system is log(Q). This may look familiar for equilibrium statistical mechanics ...

Third, suppose we generalize our model slightly, and allow the particles to move about within V. A configuration of the system is then simply a list of 10⁶ numbers b_i with 1 £ b_i £ N (i.e., a list of the numbers of particles in each of the boxes). Suppose that the motions of the particles are such that for each particle, there is an equal probability that it will move into any given new small box during one (macroscopic) time step. How likely is it that at some later time we will find the system in a ``high'' entropy configuration? How likely is it that if we start the system in a ``low'' entropy configuration, it will stay in a ``low'' entropy configuration for an appreciable length of time? If the system is not currently in a ``maximum'' entropy configuration, how likely is it that the entropy will increase in succeeding time steps (rather than stay the same or decrease)?

Let's do a few computations using combinations:

æ
ç
è

ö
÷
ø

m! * (n - m)!

and Stirling's approximation:

n! »

nⁿe^-nÖn.

Let us start here:

There are 10⁶ configurations with all the particles sitting in exactly one small box, and the entropy of each of those configurations is:

H(all in one) =

10⁶
å
i = 1

p_i * log(1/p_i) = 0,

since exactly one p_i is 1 and the rest are 0. These are obviously minimum entropy configurations.

Now consider pairs of small boxes. The number of configurations with all the particles evenly distributed between two boxes is:

æ
ç
è

10⁶

ö
÷
ø

10⁶!

(2)!(10⁶ - 2)!

10⁶ * (10⁶ - 1)

5 * 10¹¹,

which is a (comparatively :-) large number. The entropy of each of these configurations is:

H(two boxes) = 1/2 * log(2) + 1/2 * log(2) = log(2).

We thus know that there are at least 5 * 10¹¹ + 10⁶ configurations. If we start the system in a configuration with entropy 0, then the probability that at some later time it will be in a configuration with entropy ³ log(2) will be

5*10¹¹

5*10¹¹ + 10⁶

= (1 -

10⁶

5*10¹¹ + 10⁶

)

(1 - 10^-5).

As an example at the other end, consider the number of configurations with the particles distributed almost equally, except that half the boxes are short by one particle, and the rest have an extra. The number of such configurations is:

æ
ç
è

10⁶

10⁶/2

ö
÷
ø

10⁶!

(10⁶/2)!(10⁶ - 10⁶/2)!

10⁶!

((10⁶/2)!)²

(10⁶)^10⁶e^-10⁶

___
Ö10⁶

(

(10⁶/2)^10⁶/2e^-(10⁶/2)

_____
Ö10⁶/2

)²

(10⁶)^10⁶e^-10⁶

___
Ö10⁶

2p(10⁶/2)^10⁶e^-(10⁶)10⁶/2

2^10⁶+1

___
Ö10⁶

2^10⁶

(2¹⁰)^10⁵

10^3*10⁵.

Each of these configurations has entropy essentially equal to log(10⁶).

From this, we can conclude that if we start the system in a configuration with entropy 0 (i.e., all particles in one box), the probability that later it will be in a higher entropy configuration will be > (1 - 10^-3*10⁵).

Similar arguments (with similar results in terms of probabilities) can be made for starting in any configuration with entropy appreciably less than log(10⁶) (the maximum). In other words, it is overwhelmingly probable that as time passes, macroscopically, the system will increase in entropy until it reaches the maximum.

In many respects, these general arguments can be thought of as a ``proof'' (or at least an explanation) of a version of the second law of thermodynamics: Given any macroscopic system which is free to change configurations, and given any configuration with entropy less than the maximum, there will be overwhelmingly many more accessible configurations with higher entropy than lower entropy, and thus, with probability indistinguishable from 1, the system will (in macroscopic time steps) successively change to configurations with higher entropy until it reaches the maximum.

Language, and putting things together <-

``An essential distinction between language and experience is that language separates out from the living matrix little bundles and freezes them; in doing this it produces something totally unlike experience, but nevertheless useful.''
- P. W. Bridgman, 1936
``One is led to a new notion of unbroken wholeness which denies the classical analyzability of the world into separately and independently existing parts. The inseparable quantum interconnectedness of the whole universe is the fundamental reality.''
- David Bohm

Shannon's communication theory Top

In his classic 1948 papers, Claude Shannon laid the foundations for contemporary information, coding, and communication theory. He developed a general model for communication systems, and a set of theoretical tools for analyzing such systems.
His basic model consists of three parts: a sender (or source), a channel, and a receiver (or sink). His general model also includes encoding and decoding elements, and noise within the channel.

Shannon's communication model

In Shannon's discrete model, it is assumed that the source provides a stream of symbols selected from a finite alphabet A = {a₁, a₂, ¼, a_n}, which are then encoded. The code is sent through the channel (and possibly disturbed by noise). At the other end of the channel, the receiver will decode, and derive information from the sequence of symbols.
Let me mention at this point that sending information from now to then is equivalent to sending information from here to there, and thus Shannon's theory applies equally as well to information storage questions as to information transmission questions.

Given a source of symbols and a channel with noise (in particular, a probability model for these elements), we can talk about the capacity of the channel. The general model Shannon worked with involved two sets of symbols, the input symbols and the output symbols. Let us say the two sets of symbols are A = {a₁, a₂, ¼, a_n} and B = {b₁, b₂, ¼, b_m}. Note that we do not necessarily assume the same number of symbols in the two sets. Given the noise in the channel, when symbol b_j comes out of the channel, we can not be sure which a_i was put in. The channel is characterized by the set of probabilities {P(a_i|b_j)}.
We can then consider various related information and entropy measures. First, we can consider the information we get from observing a symbol b_j. Given a probability model of the source, we have an a priori estimate P(a_i) that symbol a_i will be sent next. Upon observing b_j, we can revise our estimate to P(a_i|b_j). The change in our information (the mutual information) will be given by:

I(a_i; b_j)

=

log æ
ç
è 1
P(a_i)
ö
÷
ø - log æ
ç
è 1
P(a_i|b_j)
ö
÷
ø

=

log æ
ç
è P(a_i|b_j)
P(a_i)
ö
÷
ø

We have the properties:

I(a_i; b_j)

=

I(b_j; a_i)
I(a_i; b_j)

=

log(P(a_i|b_j)) + I(a_i)
I(a_i; b_j)

£

I(a_i)

If a_i and b_j are independent (i.e., if P(a_i, b_j) = P(a_i) * P(b_j)), then

I(a_i; b_j) = 0.
What we actually want is to average the mutual information over all the symbols:

I(A; b_j)

=

å
i
P(a_i|b_j) * I(a_i;b_j)

=

å
i
P(a_i|b_j) * log æ
ç
è P(a_i|b_j)
P(a_i)
ö
÷
ø
I(a_i;B)

=

å
j
P(a_i|b_j) * log æ
ç
è P(b_j|a_i)
P(b_j)
ö
÷
ø ,

and from these,

I(A; B)

=

å
i
P(a_i) * I(a_i;B)

=

å
i

å
j
P(a_i, b_j) * log æ
ç
è P(a_i, b_j)
P(a_i)P(b_j)
ö
÷
ø

=

I(B; A).

We have the properties:

I(A; B) ³ 0,

and

I(A; B) = 0

if and only if A and B are independent.

We then have the definitions and properties:

H(A)

n
å
i = 1

P(a_i) * log(1/P(a_i))

H(B)

m
å
j = 1

P(b_j) * log(1/P(b_j))

H(A|B)

n
å
i = 1

m
å
j = 1

P(a_i|b_j) * log(1/P(a_i|b_j))

H(A, B)

n
å
i = 1

m
å
j = 1

P(a_i, b_j) * log(1/P(a_i, b_j))

H(A, B)

H(A) + H(B|A)

H(B) + H(A|B),

and furthermore:

I(A; B)

H(A) + H(B) - H(A, B)

H(A) - H(A|B)

H(B) - H(B|A)

If we are given a channel, we could ask what is the maximum possible information can be transmitted through the channel. We could also ask what mix of the symbols {a_i} we should use to achieve the maximum. In particular, using the definitions above, we can define the Channel Capacity C to be:

C =
max
P(a)
I(A; B).
We have the nice property that if we are using the channel at its capacity, then for each of the a_i,

I(a_i;B) = C,

and thus, we can maximize channel use by maximizing the use for each symbol independently.
We also have Shannon's main theorem:
For any channel, there exist ways of encoding input symbols such that we can simultaneously utilize the channel as closely as we wish to the capacity, and at the same time have an error rate as close to zero as we wish.
This is actually quite a remarkable theorem. We might naively guess that in order to minimize the error rate, we would have to use more of the channel capacity for error detection/correction, and less for actual transmission of information. Shannon showed that it is possible to keep error rates low and still use the channel for information transmission at (or near) its capacity.
Unfortunately, Shannon's proof has a a couple of downsides. The first is that the proof is non-constructive. It doesn't tell us how to construct the coding system to optimize channel use, but only tells us that such a code exists. The second is that in order to use the capacity with a low error rate, we may have to encode very large blocks of data. This means that if we are attempting to use the channel in real-time, there may be time lags while we are filling buffers. There is thus still much work possible in the search for efficient coding schemes.
Among the things we can do is look at natural coding systems (such as, for example, the DNA coding system, or neural systems) and see how they use the capacity of their channel. It is not unreasonable to assume that evolution will have done a pretty good job of optimizing channel use ...

Tools <-

``It is a recurring experience of scientific progress that what was yesterday an object of study, of interest in its own right, becomes today something to be taken for granted, something understood and reliable, something known and familiar - a tool for further research and discovery.''
-J. R. Oppenheimer, 1953
``Nature uses only the longest threads to weave her patterns, so that each small piece of her fabric reveals the organization of the entire tapestry.''
- Richard Feynman

Application to Biology (analyzing genomes) Top

Let us apply some of these ideas to the (general) problem of analyzing genomes.
We can start with an example such as the comparatively small genome of Escherichia coli, strain K-12, substrain MG1655, version M52. This example has the convenient features:
1. It has been completely sequenced.
2. The sequence is available for downloading (http://www.genetics.wisc.edu/).
3. Annotated versions are available for further work.
4. It is large enough to be interesting (somewhat over 4 mega-bases, or 4 million nucleotides), but not so huge as to be completely unwieldy.
5. The labels on the printouts tend to make other people using the printer a little nervous :-)
6. Here's the beginning of the file:
```
>gb|U00096|U00096 Escherichia coli
   K-12 MG1655 complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCT
CTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAA
TTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAG
AGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGG
TGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACC
AAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
            
```

In this exploratory project, my goal has been to apply the information and entropy ideas outlined above to genome analysis. Some of the results I have so far are tantalizing. For a while, I'll just walk you through some preliminary work. While I am not an expert in genomes/DNA, I am hoping that some of what I am doing can bring fresh eyes to the problems of analyzing genome sequences, without too many preconceptions. It is at least conceivable that my naiveté will be an advantage ...
My first step was to generate for myself a ``random genome'' of comparable size to compare things with. In this case, I simply used the Unix `random' function to generate a file containing a random sequence of about 4 million A, C, G, T. In the actual genome, these letters stand for the nucleotides adenine, cytosine, guanine, and thymine.
Other people working in this area have taken some other approaches to this process, such as randomly shuffling an actual genome (thus maintaining the relative proportions of A, C, G, and T). Part of the justification for this methodology is that actual (identified) coding sections of DNA tend to have a ratio of C+G to A+T different from one. I didn't worry about this issue (for various reasons).
My next step was to start developing a (variety of) probability model(s) for the genome. The general idea that I am working on is to build some automated tools to locate ``interesting'' sections of a genome. Thinking of DNA as a coding system, we can hope that ``important'' stretches of DNA will have entropy different from other stretches. Of course, as noted above, the entropy measure depends in an essential way on the probability model attributed to the source. We will want to try to build a model that catches important aspects of what we find interesting or significant. We will want to use our knowledge of the systems in which DNA is embedded to guide the development of our models. On the other hand, we probably don't want to constrain the model too much. Remember that information and entropy are measures of unexpectedness. If we constrain our model too much, we won't leave any room for the unexpected!
We know, for example, that simple repetitions have low entropy. But if the code being used is redundant (sometimes called degenerate), with multiple encodings for the same symbol (as is the case for DNA codons), what looks to one observer to be a random stream may be recognized by another observer (who knows the code) to be a simple repetition.
The first element of my probability model(s) involves the observation that coding sequences for peptides and proteins are encoded via codons, that is, by sequences of blocks of triples of nucleotides. Thus, for example, the codon AGC on mRNA (messenger RNA) codes for the amino acid serine (or, if we happen to be reading in the reverse direction, it might code for alanine). On DNA, AGC codes for UCG or CGA on the mRNA, and thus could code for cysteine or arginine.

Amino acids specified by each codon sequence on mRNA.
A = adenine G = guanine C = cytosine T = thymine U = uracil
Table from http://www.accessexcellence.org
Key for the above table:
Ala: Alanine
Arg: Arginine
Asn: Asparagine
Asp: Aspartic acid
Cys: Cysteine
Gln: Glutamine
Glu: Glutamic acid
Gly: Glycine
His: Histidine
Ile: Isoleucine
Leu: Leucine
Lys: Lysine
Met: Methionine
Phe: Phenylalanine
Pro: Proline
Ser: Serine
Thr: Threonine
Trp: Tryptophane
Tyr: Tyrosine
Val: Valine

For our first model, we will consider each three-nucleotide codon to be a distinct symbol. We can then take a chunk of genome and estimate the probability of occurence of each codon by simply counting and dividing by the length. At this level, we are assuming we have no knowledge of where codons start, and so in this model, we assume that ``readout'' could begin at any nucleotide. We thus use each three adjacent nucleotides.
For example, given the DNA chunk:
```
AGCTTTTCATTCTGACTGCAACGGGCAATATGTC
```
we would count:
```
AAT  1  AAC  1  ACG  1  ACT  1  AGC  1
ATA  1  ATG  1  ATT  1  CAA  2  CAT  1 
CGG  1  CTG  2  CTT  1  GAC  1  GCA  2
GCT  1  GGC  1  GGG  1  GTC  1  TAT  1     
TCA  1  TCT  1  TGA  1  TGC  1  TGT  1
TTC  2  TTT  2
```
We can then estimate the entropy of the chunk as:

å
p_i * log₂(1/p_i) = 4.7 bits.

The maximum possible entropy for this chunk would be:

log₂(32) = 5 bits.
We want to find ``interesting'' sections (and features) of a genome. As a starting place, we can slide a ``window'' over the genome, and estimate the entropy within the window. The plot below shows the entropy estimates for the E. coli genome, within a window of size 6561 ( = 3⁸). The window is slid in steps of size 81 ( = 3⁴). This results in 57,194 values, one for each placement of the window. For comparison, the values for a ``random'' genome are also shown.

Entropy of E. coli and random
window 6561, slide-step 81

At this level, we can make the simple observation that the actual genome values are quite different from the comparative random string. The values for E. coli range from about 5.8 to about 5.96, while the random values are clustered quite closely above 5.99 (the maximum possible is log₂(64) = 6).
From here, there are various directions we could go. With a given window size and step size (e.g., 6561:81, as in the given plot), we can look at interesting features of the entropy estimates. For example, we could look at regions with high entropy, or low entropy. We could look at regions where there are abrupt changes in entropy, or regions where entropy stays relatively stable.
We could change the window size, and/or step size. We could work to develop adaptive algorithms which zoom in on interesting regions, where ``interesting'' is determined by criteria such as the ones listed above.
We could take known coding regions of genomes, and develop entropy ``fingerprints'' which we could then try to match.
There are various ``data massage'' techniques we could use. For example, we could take the fourier transform of the entropy estimates, and explore that. Below is an example of such a fourier transform. Notice that it has some interesting ``periodic'' features which might be worth exploring. It is also interesting to note that the fourier transform of the entropy of a ``random'' genome has the shape of approximately 1/f = 1/f¹ (as expected ...), whereas the E. coli data are closer to 1/f^1.5.
The discrete Fourier transform of a sequence (a_j)_{j = 0}^q-1 is the sequence (A_k)_{k = 0}^q-1 where

A_k = 1
Öq
q-1
å
j = 0
a_je^{[(2pijk)/ q]}

One way to think about this is that (A_k) = F ((a_j)) where the linear transformation F is given by:

[F]_j,k = 1
Öq
e^{[(2pijk)/ q]}

Note that the inverse of F is its conjugate transpose F^f - that is,

[F^-1]_k,j = 1
Öq
e^{-[(2pijk)/ q]}.

The plots that follow are log-log plots of the norms |A_k| = (A_k[`(A_k)])^1/2 (power spectra).

Fourier transform of E. coli
window 6561, slide-step 81

Fourier transform of random
window 6561, slide-step 81

Application to Physics (lasers) Top

Suppose we have a system for which we can measure certain macroscopic characteristics. Suppose further that the system is made up of many microscopic elements, and that the system is free to vary among various states. Given the discussion above, let us assume that with probability essentially equal to 1, the system will be observed in states with maximum entropy.
We will then sometimes be able to gain understanding of the system by applying a maximum information entropy principle, and, using Lagrange multipliers, derive formulas for aspects of the system.
Suppose we have a set of macroscopic measurable characteristics f_k, k = 1, 2, ¼, M (which we can think of as constraints on the system), which we think of as related to microscopic characteristics via:

å
i
p_i * f_i^(k) = f_k.

We want to maximize the entropy, å_ip_ilog(1/p_i), subject to these constraints. Using Lagrange multipliers l_k (one for each constraint), we have the general solution:

p_i = exp æ
è - l-
å
k
l_kf_i^(k) ö
ø .

If we define Z, called the partition function, by

Z(l₁, ¼, l_M) =
å
i
exp æ
è -
å
k
l_kf_i^(k) ö
ø ,

then we have e^l = Z, or l = ln(Z).
Let us apply this method to a specific example - a single mode laser. For a laser, we will be interested in the intensity of the light emitted, and the coherence property of the light will be observed in the second moment of the intensity. The electric field strength of such a laser will have the form

E(x, t) = E(t)sin(kx),

and E(t) can be decomposed in the form

E(t) = Be^-iwt + B^*e^iwt.

If we measure the intensity of the light over time intervals long compared to the frequency, but small compared to fluctuations of B(t), the output will be proportional to BB^* and to the loss rate, 2k, of the laser:

I = 2kB B^*.

The intensity squared will be

I² = 4k²B²B^*2.
If we assume that B and B^* are continuous random variables associated with a stationary process, then the information entropy of the system will be:

H = ó
õ p(B,B^*)log æ
ç
è 1
p(B,B^*)
ö
÷
ø d²B.

The two constraints on the system will be the averages of the intensity and the square of the intensity:

f₁

=

< 2kB B^* > ,
f₂

=

< 4k²B²B^*2 > .

Then, of course, we will let

f_B,B^*⁽¹⁾

=

2kB B^*,
f_B,B^*⁽²⁾

=

4k²B²B^*2.

We can now use the method outlined above, finding the maximum entropy general solution derived via Lagrange multipliers for this system.
Applying the general solution, we get:

p(B, B^*) = exp[- l- l₁2kBB^* - l₂4k²(BB^*)²],

or, in other notation:

p(B, B^*) = N * exp(- a|B|² - b|B|⁴).

This function in laser physics is typically derived by solving the Fokker-Planck equation belonging to the Langevin equation for the system.
For quick reference, the typical generic Langevin equation looks like:

.
q

= K(q) + F(t)

where q is a state vector, and the fluctuating forces F_j(t) are typically assumed to have

< F_j(t) >

=

0
< F_j(t)F_j¢(t¢) >

=

Q_jd_jj¢d(t - t¢).
The associated generic Fokker-Planck equation for the distribution function f(q, t) then looks like:

¶f
¶t
= -
å
j
¶
¶q_j
(K_jf) + 1
2

å
jk
Q_jk ¶²
¶q_j¶q_k
f.

The first term is called the drift term, and the second the diffusion term. This can typically be solved only for special cases ...
For much more discussion of these topics, I can recommend the book Information and Self-organization, A Macroscopic Approach to Complex Systems by Hermann Haken, Springer-Verlag Berlin, New York, 1988.

Some other measures Top

There have been various approaches to expanding on the idea of entropy as a measure of complexity. One useful generalization of entropy was developed by the Hungarian mathematician A. Rényi. His method involves looking at the moments of order q of a probability distribution {p_i}:

S_q = 1
q-1
log
å
i
p_i^q

If we take the limit as q® 1, we get:

S₁ =
å
i
p_ilog(1/p_i),

the entropy we have previously defined. We can then think of S_q as a generalized entropy for any real number q.
Expanding on these generalized entropies, we can then define a generalized dimension associated with a data set. If we imagine the data set to be distributed among bins of diameter r, we can let p_i be the probability that a data item falls in the i'th bin (estimated by counting the data elements in the bin, and dividing by the total number of items). We can then, for each q, define a dimension:

D_q =
lim
r® 0
1
q-1

log
å
i
p_i^q
log(r)
.
Why do we call this a generalized dimension?
Consider D₀. First, we will adopt the (analyst's?) convention that p_i⁰ = 0 when p_i = 0. Also, let N_r be the number of non-empty bins (i.e., the number of bins of diameter r it takes to cover the data set).
Then we have:

D₀ =
lim
r® 0

log
å
i
p_i⁰
log(1/r)
=
lim
r® 0
log(N_r)
log(1/r)

Thus, D₀ is the Hausdorff dimension D, which is frequently in the literature called the fractal dimension of the set.
Three examples:
1. Consider the unit interval [0,1]. Let r_k = 1/2^k. Then N_{r_k} = 2^k, and
  
  D₀ =
  lim
  k®¥
  log(2^k)
  log(2^k)
  = 1.
2. Consider the unit square [0,1]X[0,1]. Again, let r_k = 1/2^k. Then N_{r_k} = 2^2k, and
  
  D₀ =
  lim
  k®¥
  log(2^2k)
  log(2^k)
  = 2.
3. Consider the Cantor set:
  
  The construction of the Cantor set is suggested by the diagram. The Cantor set is what remains from the interval after we have removed middle thirds countably many times. It is an uncountable set, with measure (``length'') 0. For this set we will let r_k = 1/3^k. Then N_{r_k} = 2^k, and
  
  D₀ =
  lim
  k®¥
  log(2^k)
  log(3^k)
  = log(2)
  log(3)
  » 0.631.
  
  The Cantor set is a traditional example of a fractal. It is self similar, and has D₀ » 0.631, which is strictly greater than its topological dimension ( = 0). It is an important example since many nonlinear dynamical systems have trajectories which are locally the product of a Cantor set with a manifold (i.e., Poincaré sections are generalized Cantor sets).
  An interesting example of this phenomenon occurs with the logistics equation:
  
  x_i+1 = k * x_i * (1 - x_i)
  
  with k > 4. In this case (of which you rarely see pictures ...), most starting points run off rapidly to - ¥, but there is a strange repellor(!) which is a Cantor set. It is a repellor since arbitrarily close to any point on the trajectory are points which run off to - ¥. One thing this means is that any finite precision simulation will not capture the repellor ...
We can make several observations about D_q:
1. If q₁ £ q₂, then D_q₁ £ D_q₂.
2. If the set is strictly self-similar with equal probabilities p_i = 1/N, then we do not need to take the limit as r® 0, and
  
  D_q
  
  =
  
  1
  q-1
  log(N * (1/N)^q)
  log(r)
  
  =
  
  log(N)
  log(1/r)
  
  =
  
  D₀
  
  for all q. This is the case, for example, for the Cantor set.
3. D₁ is usually called the information dimension:
  
  D₁ =
  lim
  r® 0
  
  å
  i
  p_i * log(1/p_i)
  log(r)
  
  The numerator is just the entropy of the probability distribution.
4. D₂ is usually called the correlation dimension:
  
  D₂ =
  lim
  r® 0
  
  log
  å
  i
  p_i²
  log(r)
  
  This dimension is related to the probability of finding two elements of the set within a distance r of each other.

Some additional material Top

What follows are some additional examples, and expanded discussion of some topics ...

Examples using Bayes' Theorem Top

A quick example:
Suppose that you are asked by a friend to help them understand the results of a genetic screening test they have taken. They have been told that they have tested positive, and that the test is 99% accurate. What is the probability that they actually have the anomaly?
You do some research, and find out that the test screens for a genetic anomaly that is believed to occur in one person out of 100,000 on average. The lab that does the tests guarantees that the test is 99% accurate. You push the question, and find that the lab says that one percent of the time, the test falsely reports the absence of the anomaly when it is there, and one percent of the time the test falsely reports the presence of the anomaly when it is not there.
The test has come back positive for your friend. How worried should they be? Given this much information, what can you calculate as the probability they actually have the anomaly?
In general, there are four possible situations for an individual being tested:
1. Test positive (Tp), and have the anomaly (Ha).
2. Test negative (Tn), and don't have the anomaly (Na).
3. Test positive (Tp), and don't have the anomaly (Na).
4. Test negative (Tn), and have the anomaly (Ha).
We would like to calculate for our friend the probability they actually have the anomaly (Ha), given that they have tested positive (Tp):

P(Ha |Tp).

We can do this using Bayes' Theorem.
We can calculate:

P(Ha |Tp)

=

P(Tp |Ha) * P(Ha)
P(Tp)
.

We need to figure out the three items on the right side of the equation. We can do this by using the information given. Suppose the screening test was done on 10,000,000 people. Out of these 10⁷ people, we expect there to be 10⁷/10⁵ = 100 people with the anomaly, and 9,999,900 people without the anomaly. According to the lab, we would expect the test results to be:
- Test positive (Tp), and have the anomaly (Ha):
  
  0.99 * 100 = 99 people.
- Test negative (Tn), and don't have the anomaly (Na):
  
  0.99 * 9,999,900 = 9,899,901 people.
- Test positive (Tp), and don't have the anomaly (Na):
  
  0.01 * 9,999,900 = 99,999 people.
- Test negative (Tn), and have the anomaly (Ha):
  
  0.01 * 100 = 1 person.
Now let's put the the pieces together:

P(Ha)

=

1
100,000

=

10^-5

P(Tp)

=

99 + 99,999
10⁷

=

100,098
10⁷

=

0.0100098

P(Tp |Ha)

=

0.99

Thus, our calculated probability that our friend actually has the anomaly is:

P(Ha |Tp)

=

P(Tp |Ha) * P(Ha)
P(Tp)

=

0.99 * 10^-5
0.0100098

=

9.9 * 10^-6
1.00098 * 10^-2

=

9.890307 * 10^-4

<

10^-3

In other words, our friend, who has tested positive, with a test that is 99% correct, has less that one chance in 1000 of actually having the anomaly!
There are a variety of questions we could ask now, such as, ``For this anomaly, how accurate would the test have to be for there to be a greater than 50% probability that someone who tests positive actually has the anomaly?''
For this, we need fewer false positives than true positives. Thus, in the example, we would need fewer than 100 false positives out of the 9,999,900 people who do not have the anomaly. In other words, the proportion of those without the anomaly for whom the test would have to be correct would need to be greater than:

9,999,800
9,999,900
= 99.999%
Another question we could ask is, ``How prevalent would an anomaly have to be in order for a 99% accurate test (1% false positive and 1% false negative) to give a greater than 50% probability of actually having the anomaly when testing positive?''
Again, we need fewer false positives than true positives. We would therefore need the actual occurrence to be greater than 1 in 100 (each false positive would be matched by at least one true positive, on average).
Note that the current population of the US is about 280,000,000 and the current population of the world is about 6,200,000,000. Thus, we could expect an anomaly that affects 1 person in 100,000 to affect about 2,800 people in the US, and about 62,000 people worldwide, and one affecting one person in 100 would affect 2,800,000 people in the US, and 62,000,000 people worldwide ...
Another example: suppose the test were not so accurate? Suppose the test were 80% accurate (20% false positive and 20% false negative). Suppose that we are testing for a condition expected to affect 1 person in 100. What would be the probability that a person testing positive actually has the condition? We can do the same sort of calculations.
Let's use 1000 people this time. Out of this sample, we would expect 10 to have the condition.
- Test positive (Tp), and have the condition (Ha):
  
  0.80 * 10 = 8 people.
- Test negative (Tn), and don't have the condition (Na):
  
  0.80 * 990 = 792 people.
- Test positive (Tp), and don't have the condition (Na):
  
  0.20 * 990 = 198 people.
- Test negative (Tn), and have the condition (Ha):
  
  0.20 * 10 = 2 people.
Now let's put the the pieces together:

P(Ha)

=

1
100

=

10^-2

P(Tp)

=

8 + 198
10³

=

206
10³

=

0.206

P(Tp |Ha)

=

0.80

Thus, our calculated probability that our friend actually has the anomaly is:

P(Ha |Tp)

=

P(Tp |Ha) * P(Ha)
P(Tp)

=

0.80 * 10^-2
0.206

=

8 * 10^-3
2.06 * 10^-1

=

3.883495 * 10^-2

<

.04

In other words, one who has tested positive, with a test that is 80% correct, has less that one chance in 25 of actually having this condition. (Imagine for a moment, for example, that this is a drug test being used on employees of some corporation ...)
We could ask the same kinds of questions we asked before:
1. How accurate would the test have to be to get a better than 50% chance of actually having the condition when testing positive?
  (99%)
2. For an 80% accurate test, how frequent would the condition have to be to get a better than 50% chance?
  (1 in 5)
Some questions:
1. Are these examples realistic? If not, why not?
2. What sorts of things could we do to improve our results?
3. Would it help to repeat the test? For example, if the probability of a false positive is 1 in 100, would that mean that the probability of two false positives on the same person would be 1 in 10,000 ([1/ 100] * [1/ 100])? If not, why not?
4. In the case of a medical condition such as a genetic anomaly, it is likely that the test would not be applied randomly, but would only be ordered if there were other symptoms suggesting the anomaly. How would this affect the results?
Another example:
Suppose that Tom, having had too much time on his hands while an undergraduate Philosophy major, through much practice at prestidigitation, got to the point where if he flipped a coin, his flips would have the probabilities:

P(h) = 0.7, P(t) = 0.3.

Now suppose further that you are brought into a room with 10 people in it, including Tom, and on a table is a coin showing heads. You are told further that one of the 10 people was chosen at random, that the chosen person flipped the coin and put it on the table, and that research shows that the overall average for the 10 people each flipping coins many times is:

P(h) = 0.52, P(t) = 0.48.

What is the probability that it was Tom who flipped the coin?
By Bayes' Theorem, we can calculate:

P(Tom |h)

=

P(h |Tom) P(Tom)
P(h)
= 0.7 * 0.1
0.52

=

0.1346.

Note that this estimate revises our a priori estimate of the probability of Tom being the flipper up from 0.10.
This process (revising estimated probability) of course depends in a critical way on having a priori estimates in the first place ...

Analog channels Top

The part of Shannon's work we have looked at so far deals with discrete (or digital) signalling systems. There are related ideas for continuous (or analog) systems. What follows gives a brief hint of some of the ideas, without much detail.
Suppose we have a signalling system using band-limited signals (i.e., the frequencies of the transmissions are restricted to lie within some specified range). Let us call the bandwidth W. Let us further assume we are transmitting signals of duration T. In order to reconstruct a given signal, we will need 2WT samples of the signal. Thus, if we are sending continuous signals, each signal can be represented by 2WT numbers x_i, taken at equal intervals. We can associate with each signal an energy, given by:

E = 1
2W
2WT
å
i = 1
x_i².

The distance of the signal (from the origin) will be

r = æ
è å
x_i² ö
ø 1/2

= (2WE)^1/2

We can define the signal power to be the average energy:

S = E
T
.

Then the radius of the sphere of transmitted signals will be:

r = (2WST)^1/2.

Each signal will be disturbed by the noise in the channel. If we measure the power of the noise N added by the channel, the disturbed signal will lie in a sphere around the original signal of radius (2WNT)^1/2. Thus the original sphere must be enlarged to a larger radius to enclose the disturbed signals. The new radius will be:

r = (2WT(S + N))^1/2.

In order to use the channel effectively and minimize error (misreading of signals), we will want to put the signals in the sphere, and separate them as much as possible (and have the distance between the signals at least twice what the noise contributes ...). We thus want to divide the sphere up into sub-spheres of radius = (2WNT)^1/2. From this, we can get an upper bound on the number M of possible messages that we can reliably distinguish. We can use the formula for the volume of an n-dimensional sphere:

V(r, n) = p^n/2rⁿ
G(n/2 + 1)
.

We have the bound:

M

£

p^WT(2WT(S + N))^WT
G(WT + 1)
G(WT + 1)
p^WT(2WTN)^WT

=

æ
ç
è 1 + S
N
ö
÷
ø WT

The information sent is the log of the number of messages sent (assuming they are equally likely), and hence:

I = log(M) = WT * log æ
ç
è 1 + S
N
ö
÷
ø ,

and the rate at which information is sent will be:

W * log æ
ç
è 1 + S
N
ö
÷
ø .

We thus have the usual signal/noise formula for channel capacity ...
An amusing little side light: ``Random'' band-limited natural phenoma typically display a power spectrum that obeys a power law of the general form [1/( f^a )]. On the other hand, from what we have seen, if we want to use a channel optimally, we should have essentially equal power at all frequencies in the band. This means that a possible way to engage in SETI (the search for extra-terrestrial intelligence) will be to look for bands in which there is white noise! White noise is likely to be the signature of (intelligent) optimal use of a channel ...

Top

References

[1]: Brillouin, L., Science and information theory Academic Press, New York, 1956.
[2]: Brooks, Daniel R., and Wiley, E. O., Evolution as Entropy, Toward a Unified Theory of Biology, Second Edition, University of Chicago Press, Chicago, 1988.
[3]: Campbell, Jeremy, Grammatical Man, Information, Entropy, Language, and Life, Simon and Schuster, New York, 1982.
[4]: Cover, T. M., and Thomas J. A., Elements of Information Theory, John Wiley and Sons, New York, 1991.
[5]: DeLillo, Don, White Noise, Viking/Penguin, New York, 1984.
[6]: Feller, W., An Introduction to Probability Theory and Its Applications, Wiley, New York,1957.
[7]: Feynman, Richard, Feynman lectures on computation, Addison-Wesley, Reading, 1996.
[8]: Gatlin, L. L., Information Theory and the Living System, Columbia University Press, New York, 1972.
[9]: Haken, Hermann, Information and Self-Organization, a Macroscopic Approach to Complex Systems, Springer-Verlag, Berlin/New York, 1988.
[10]: Hamming, R. W., Error detecting and error correcting codes, Bell Syst. Tech. J. 29 147, 1950.
[11]: Hamming, R. W., Coding and information theory, 2nd ed, Prentice-Hall, Englewood Cliffs, 1986.
[12]: Hill, R., A first course in coding theory Clarendon Press, Oxford, 1986.
[13]: Hodges, A., Alan Turing: the enigma Vintage, London, 1983.
[14]: Hofstadter, Douglas R., Metamagical Themas: Questing for the Essence of Mind and Pattern, Basic Books, New York, 1985
[15]: Jones, D. S., Elementary information theory Clarendon Press, Oxford, 1979.
[16]: Knuth, Eldon L., Introduction to Statistical Thermodynamics, McGraw-Hill, New York, 1966.
[17]: Landauer, R., Information is physical, Phys. Today, May 1991 23-29.
[18]: Landauer, R., The physical nature of information, Phys. Lett. A, 217 188, 1996.
[19]: van Lint, J. H., Coding Theory, Springer-Verlag, New York/Berlin, 1982.
[20]: Lipton, R. J., Using DNA to solve NP-complete problems, Science, 268 542-545, Apr. 28, 1995.
[21]: MacWilliams, F. J., and Sloane, N. J. A., The theory of error correcting codes, Elsevier Science, Amsterdam, 1977.
[22]: Martin, N. F. G., and England, J. W., Mathematical Theory of Entropy, Addison-Wesley, Reading, 1981.
[23]: Maxwell, J. C., Theory of heat Longmans, Green and Co, London, 1871.
[24]: von Neumann, John, Probabilistic logic and the synthesis of reliable organisms from unreliable components, in automata studies( Shanon,McCarthy eds), 1956 .
[25]: Papadimitriou, C. H., Computational Complexity, Addison-Wesley, Reading, 1994.
[26]: Pierce, John R., An Introduction to Information Theory - Symbols, Signals and Noise, (second revised edition), Dover Publications, New York, 1980.
[27]: Roman, Steven, Introduction to Coding and Information Theory, Springer-Verlag, Berlin/New York, 1997.
[28]: Sampson, Jeffrey R., Adaptive Information Processing, an Introductory Survey, Springer-Verlag, Berlin/New York, 1976.
[29]: Schroeder, Manfred, Fractals, Chaos, Power Laws, Minutes from an Infinite Paradise, W. H. Freeman, New York, 1991.
[30]: Shannon, C. E., A mathematical theory of communication Bell Syst. Tech. J. 27 379; also p. 623, 1948.
[31]: Slepian, D., ed., Key papers in the development of information theory IEEE Press, New York, 1974.
[32]: Turing, A. M., On computable numbers, with an application to the Entscheidungsproblem, Proc. Lond. Math. Soc. Ser. 2 42, 230 ; see also Proc. Lond. Math. Soc. Ser. 2 43, 544, 1936.
[33]: Zurek, W. H., Thermodynamic cost of computation, algorithmic complexity and the information metric, Nature 341 119-124, 1989.

File translated from T_EX by T_TH, version 2.25.
On 30 May 2002, 00:41.

An introduction to information theory and entropy

Tom Carter http://cogs.csustan.edu/~tom/SFI-CSSS Complex Systems Summer School

June, 2002

References

Tom Carter

http://cogs.csustan.edu/~tom/SFI-CSSS
Complex Systems Summer School