\section{Classical information theory}  \lab{s:cit}This and the next section will summarise the classical theory of information and computing. This is textbook material (Minsky 1967, Hamming 1986) but is included here since it forms a background to quantum information and computing, and the article is aimed at physicists to whom the ideas may be new. \subsection{Measures of information}  \lab{s:mi}The most basic problem in classical information theory is to obtaina measure of information, that is, of amount of information. SupposeI tell you the value of a number $X$. How much information have yougained? That will depend on what you already knew about $X$. For example,if you already knew $X$ was equal to $2$, you would learn nothing,no information, from my revelation. On the other hand, if previouslyyour only knowledge was that $X$ was given by the throw of a die,then to learn its value is to gain information. We have met herea basic paradoxical property, which is that {\em information}is often a measure of {\em ignorance}: the information content (or `self-information') of $X$is defined to be the information you would gain if you learned thevalue of $X$.If $X$ is a random variable which has value $x$ with probability$p(x)$, then the information content of $X$ is defined to be  \beq  S( \{p(x) \} ) = - \sum_x p(x) \log_2 p(x).    \lab{S}  \eeqNote that the logarithm is taken to base 2, and that $S$ is always positive since probabilities are bounded by $p(x) \le 1$. $S$ is a function of the {\em probability distribition} of values of $X$. It is important to remember this, since in what follows we will adopt the standard practice of using the notation $S(X)$ for $S( \{ p(x) \})$. It is understood that $S(X)$ does not mean a function of $X$, but rather the information content of the variable $X$. The quantity $S(X)$ is also referred to as an entropy, for obvious reasons. If we already know that $X=2$, then $p(2)=1$ andthere are no other terms in the sum, leading to $S=0$, so $X$has no information content. If, on the other hand, $X$ is givenby the throw of a die, then $p(x)=1/6$ for $x \in \{1,2,3,4,5,6\}$so $S = -\log_2(1/6) \simeq 2.58$. If $X$ can take $N$different values, then the information content (or entropy) of $X$is maximised when the probability distribution $p$ is flat,with every $p(x) = 1/N$ (for example a fair die yields $S \simeq 2.58$,but a loaded die with $p(6)=1/2, p(1 \cdots 5)=1/10$ yields $S \simeq2.16$). This is consistent with the requirement that the information(what we would gain if we learned $X$) is maximum when our priorknowledge of $X$ is minimum.Thus the maximum information which could in principle bestored by a variable which can take on $N$ different values is$\log_2(N)$. The logarithms are takento base 2 rather than some other base by convention. The choicedictates the unit of information: $S(X) = 1$ when $X$ can take two valueswith equal probability. A two-valued or binary variable thus cancontain one unit of information. This unit is called a {\em bit}.The two values of a bit are typically written as the binarydigits 0 and 1.In the case of a binary variable, we can define $p$ to be theprobability that $X=1$, then the probability that $X=0$ is $1-p$and the information can be written as a function of $p$ alone:\beqH(p) = -p \log_2 p - (1-p) \log_2 (1-p)   \lab{H}\eeqThis function is called the {\em entropy function}, $0 \le H(p) \le 1$.In what follows, the subscript 2 will be droppedon logarithms, it is assumed that all logarithms areto base 2 unless otherwise indicated.The probability that $Y=y$ given that $X=x$ is written $p(y | x)$.The {\em conditional entropy} $S(Y | X)$ is defined by  \begin{eqnarray}S(Y | X) &=& -\sum_x p(x) \sum_y p(y | x) \log p(y | x) \lab{SYgX} \\	 &=& -\sum_x \sum_y p(x,y) \log p(y | x)  \end{eqnarray}where the second line is deduced using $p(x,y) = p(x) p(y | x)$(this is the probability that $X=x$ {\em and} $Y=y$). By inspection of the definition, we see that $S(Y | X)$is a measure of how much information on average would remainin $Y$ if we were to learn $X$. Note that $S(Y | X) \le S(Y)$always and $S(Y | X) \ne S(X | Y)$ usually.The conditional entropy is important mainly as a stepping-stoneto the next quantity, the {\em mutual information}, defined by  \begin{eqnarray}I(X: Y) &=& \sum_x \sum_y p(x,y) \log \frac{p(x,y)}{p(x) p(y)} \\	&=& S(X) - S(X | Y)                    \lab{I}  \end{eqnarray}From the definition, $I(X:Y)$ is a measure of how much $X$ and$Y$ contain information about each other\footnote{Many authorswrite $I(X;Y)$ rather than $I(X:Y)$. I prefer the latter sincethe symmetry of the colon reflects the fact that $I(X:Y) = I(Y:X)$.}.If $X$ and $Y$ are independent then $p(x,y) = p(x) p(y)$ so $I(X:Y) = 0$. The relationships between the basic measures of information areindicated in fig. 3. The reader may like to prove as an exercise that $S(X,Y)$, the information content of $X$ and $Y$ (the information we would gain if, initially knowing neither, we learned the value of both $X$ and $Y$) satisfies$S(X,Y) = S(X) + S(Y) - I(X:Y).$Information can disappear, but it cannot spring spontaneously from nowhere.This important fact finds mathematical expression in the {\em dataprocessing inequality}:  \beq\mbox{if} \;\; X \rightarrow Y \rightarrow Z\;\;\; \mbox{then}\;\;\; I(X:Z) \le I(X:Y).                         \lab{data}  \eeqThe symbol $X \rightarrow Y \rightarrow Z$ means that $X, Y$ and $Z$form a process (a Markov chain) in which $Z$ depends on $Y$ but not directlyon $X$: $p(x,y,z) = p(x) p(y | x) p (z | y)$. The content of the data processing inequality is thatthe `data processor' $Y$ can pass on to $Z$ no more information about $X$ than it received.\subsection{Data compression}   \lab{s:dc}Having pulled the definition of information content, equation \eq{S},out of a hat, our aim is now to prove that this is a good measureof information. It is not obvious at first sight even how to thinkabout such a task. One of the main contributions of classicalinformation theory is to provide useful ways to think aboutinformation. We will describe a simple situation in order to illustratethe methods. Let us suppose one person,traditionally called Alice, knows the value of $X$, and shewishes to communicate it to Bob. Werestrict ourselves to the simple case that $X$ hasonly two possible values: either `yes' or `no'. We saythat Alice is a `source' with an `alphabet' of two symbols. Alice communicates by sending binary digits (noughts and ones) toBob. We will measure the information content of $X$ by countinghow many bits Alice must send, {\em on average}, to allow Bobto learn $X$. Obviously, she could just send 0 for `no' and 1 for`yes', giving a `bit rate' of one bit per $X$ value communicated. However, what if $X$ were an essentially random variable, except that it is more likely to be `no' than `yes'? (think of the output of decisions from a grant funding body, for example). In this case, Alice can communicate more efficiently by adopting the following procedure. Let $p$ be the probability that $X=1$ and $1-p$ be the probability that $X=0$. Alice waits until $n$ values of $X$ are available to be sent, where $n$ will be large. The mean number of ones in such a sequence of $n$ values is $np$, andit is likely that the number of ones in any given sequence is close to thismean. Suppose $np$ is an integer, then the probability of obtaining anygiven sequence containing $np$ ones is  \beqp^{np} (1-p)^{n-np} = 2^{-n H(p)}.  \eeqThe reader should satisfy him or herself that the two sides of this equationare indeed equal: the right hand side hints at how the argument can be generalised. Such a sequence is called a {\em typical sequence}. To be specific, we define the set of typical sequences to be all sequences such that   \beq2^{-n( H(p) + \epsilon)} \le p(\mbox{sequence}) \le2^{-n( H(p) - \epsilon)}  \eeqNow, it can be shown that the probability that Alice's $n$ valuesactually form a typical sequence is greater than $1-\epsilon$,for sufficiently large $n$, no matter how small $\epsilon$ is. This implies that Aliceneed not communicate $n$ bits to Bob in order for him to learn $n$decisions. She need only tell Bob {\em which typical sequence} shehas. They must agree together beforehand how the typical sequencesare to be labelled: for example, they may agree to number them in orderof increasing binary value. Alice just sends the label,not the sequence itself. To deduce how well this works, it canbe shown that the typical sequences all have equal probability, and thereare $2^{n H(p)}$ of them. To communicate one of $2^{n H(p)}$ possibilities, clealy Alice must send $n H(p)$ bits. Also, Alicecannot do better than this (i.e. send fewer bits) since thetypical sequences are equiprobable: there is nothing to begained by further manipulating the information. Therefore,the information content of each value of $X$ in the originalsequence must be $H(p)$, which proves \eq{S}.The mathematical details skipped over in the above argumentall stem from the law of large numbers, which states that,given arbitrarily small $\epsilon$, $\delta$  \beqP\left( \left| m - n p \right| < n\epsilon \right) > 1 - \delta  \eeqfor sufficiently large $n$, where $m$ is the number of ones obtained in a sequence of $n$ values. For large enough $n$, the number of ones $m$ will differ from the mean $np$ by an amount arbitrarily small compared to $n$. For example, in our case the noughts and ones will be distributed according to the binomial distribution   \begin{eqnarray}P(n,m) &=& C(n,m) p^m (1-p)^{n-m}        \lab{binom}   \\  &\simeq& \frac{1}{\sigma \sqrt{2 \pi}} e^{-(m-np)^2 / 2 \sigma^2}   \end{eqnarray}where the Gaussian form is obtained in the limit $n, np \rightarrow \infty$,with the standard deviation $\sigma = \sqrt{np(1-p)}$, and$C(n,m) = n!/m!(n-m)!$. The above argument has already yielded a significant practical resultassociated with \eq{S}. This is that to communicate $n$ values of$X$, we need only send $n S(X) \le n$ bits down a communication channel. This idea is referred to as {\em data compression}, and is also called {\em Shannon's noiseless coding theorem}.The typical sequences idea has given a means to calculate informationcontent, but it is not the best way to compress information in practice,because Alice must wait for a large number of decisions to accumulatebefore she communicates anything to Bob. A better methodis for Alice to accumulate a few decisions, say 4, and communicate thisas a single `message' as best she can. Huffman derived an optimalmethod whereby Alice sends short strings to communicate the most likelymessages, and longer ones to communicate the least likely messages,see table 1 for an example. The translation process is referredto as `encoding' and `decoding' (fig. 4); this terminologydoes not imply any wish to keep information secret.For the case $p=1/4$ Shannon's noiseless coding theorem tells usthat the best possible data compression technique wouldcommunicate each message of four $X$ values by sending on average$4 H(1/4) \simeq 3.245$ bits. The Huffman code in table 1 giveson average 3.273 bits per message. This is quite closeto the minimum, showing that practical methods like Huffman'sare powerful.Data compression is a concept of great practical importance. It is used in telecommunications, for example to compress the information required to convey television pictures, and data storage in computers. From the point of view of an engineer designing a communication channel, data compression can appear miraculous. Suppose we have set up a telephone link to a mountainous area, but the communication rate is not high enough to send, say, the pixels of a live video image. The old-style engineering option would be to replace the telephone link with a faster one, but information theory suggests instead the possibility of using the same link, but adding data processing at either end (data compression and decompression). It comes as a great surprise that the usefulness of a cable can thus be improved by tinkering with the information instead of the cable. \subsection{The binary symmetric channel}  \lab{s:bin}So far we have considered the case of communication down a perfect, i.e. noise-free channel. We have gained two main results of practical value: a measure of the best possible data compression (Shannon's noiseless coding theorem), and a practical method to compress data (Huffman coding). We now turn to the important question of communication in the presence of noise. As in the last section, we will analyse the simplest case in orderto illustrate principles which are in fact more general.Suppose we have a binary channel, i.e. one which allows Alice tosend noughts and ones to Bob. The noise-free channel conveys$0 \rightarrow 0$ and $1 \rightarrow 1$, but a noisy channelmight sometimes cause 0 to become 1 and vice versa. There isan infinite variety of different types of noise. For example,the erroneous `bit flip' $0 \rightarrow 1$ might bejust as likely as $1 \rightarrow 0$, or the channel mighthave a tendency to `relax' towards 0, in which case$1 \rightarrow 0$ happens but $0 \rightarrow 1$ does not.Also, such errors might occur independently from bit tobit, or occur in bursts. A very important type of noise is one which affects different bits independently, and causesboth $0 \rightarrow 1$ and $1 \rightarrow 0$ errors. This is important because it captures the essential featuresof many processes encountered in realistic situations. If thetwo errors $0 \rightarrow 1$ and $1 \rightarrow 0$ are equally likely,then the noisy channel is called a `binary symmetric channel'.The binary symmetric channel has a single parameter, $p$,which is the error probability per bit sent.Suppose the message sent into the channel by Alice is $X$,and the noisy message which Bob receives is $Y$. Bob is then facedwith the task of deducing $X$ as best he can from $Y$. If $X$consists of a single bit, then Bob willmake use of the conditional probabilities  \begin{eqnarray*}p(x=0 | y=0) = p(x=1 | y=1) = 1-p \\p(x=0 | y=1) = p(x=1 | y=0) = p   \end{eqnarray*}giving $S(X | Y) = H(p)$ using equations (\ref{SYgX}) and (\ref{H}). Therefore, from the definition \eq{I} of mutual information, we have  \beqI(X:Y) = S(X) - H(p)   \lab{Ip}  \eeqClearly, the presence of noise in the channel limits the informationabout Alice's $X$ contained in Bob's received $Y$. Also, becauseof the data processing inequality, equation (\ref{data}), Bob cannot increase his information about $X$ by manipulating $Y$. However, \eq{Ip} shows that Alice and Bob can communicate better if $S(X)$ is large. The generalinsight is that the information communicated depends both on thesource and the properties of the channel. It would be usefulto have a measure of the channel alone, to tell us how well itconveys information. This quantity is called the {\em capacity}of the channel and it is defined to be the maximum possible mutualinformation $I(X:Y)$ between the input and output of the channel,maximised over all possible sources:  \beq\mbox{Channel capacity}\;\; C \equiv \max_{\{p(x)\}} I(X:Y)\lab{C}  \eeqChannel capacity is measured in units of `bits out per symbol in'and for binary channels must lie between zero and one. It is all very well to have a definition, but \eq{C} does not allow us to compare channels very easily, since we have to perform the maximisation over input strategies, which is non-trivial. To establishthe capacity $C(p)$ of the binary symmetric channelis a basic problem in information theory, but fortunately thiscase is quite simple. From equations (\ref{Ip}) and (\ref{C}) one may see that the answer is  \beqC(p) = 1 - H(p),   \eeqobtained when $S(X) = 1$ (i.e. $P(x=0) = P(x=1) = 1/2$).\subsection{Error-correcting codes}   \lab{s:ecc}So far we have investigated how much information gets through a noisy channel, and how much is lost. Alice cannot convey to Bob more information than $C(p)$ per symbol communicated. However, suppose Bob is busy defusing a bomb and Alice is shouting from a distance which wire to cut : she will not say``the blue wire'' just once, and hope that Bob heard correctly. She will repeatthe message many times, and Bob will wait until heis sure to have got it right. Thus error-free communication can beachieved even over a noisy channel. In this example one obtains the benefit ofreduced error rate at the sacrifice of reduced information rate. The nextstage of our information theoretic programme is to identify more powerfultechniques to circumvent noise (Hamming 1986, Hill 1986, Jones 1979,MacWilliams and Sloane 1977). We will need the following concepts. The set $\{0, 1\}$ is consideredas a group (a Galois field GF(2)) where the operations $+,-,\times,\div$are carried out modulo 2 (thus, $1+1=0$). An $n$-bit binary wordis a vector of $n$ components, for example 011 is the vector $(0,1,1)$.A set of such vectors forms a vector space under addition,since for example $011 + 101$ means $(0,1,1)+(1,0,1) = (0+1,1+0,1+1)= (1,1,0) = 110$ by the standard rules of vector addition. This is equivalent to the exclusive-or operation carried out bitwise between the two binary words. The effect of noise on a word $u$ can be expressed $u \rightarrowu' = u + e$, where the error vector $e$ indicates which bits in $u$were flipped by the noise. For example, $u = 1001101 \rightarrowu' = 1101110$ can be expressed $u' = u + 0100011$. An error correcting code $C$ is a set of words such that  \bequ + e \! \ne \! v + f  \;\; \forall u,v \in C\; (u\ne v),\;\; \forall e,f \in E \lab{code}  \eeqwhere $E$ is the set of errors correctable by $C$, which includes thecase of no error, $e=0$. To use such a code, Alice and Bob agreeon which codeword $u$ corresponds to which message, and Aliceonly ever sends codewords down the channel. Since the channelis noisy, Bob receives not $u$ but $u + e$. However, Bob can deduce$u$ unambiguously from $u+e$ since by condition(\ref{code}), no other codeword $v$ sent by Alicecould have caused Bob to receive $u+e$.An example error-correcting code is shown in the right-hand column of table 1. This is a $[7,4,3]$ Hamming code, named after its discoverer. The notation $[n,k,d]$ means that the codewords are $n$ bits long, there are $2^k$ of them, and they all differ from each other in at least $d$ places. Because of the latter feature, the condition (\ref{code}) is satisfied for any error which affects at most one bit. In other words the set $E$ of correctable errors is $\{0000000$,$1000000$,$0100000$,$0010000$, $0001000$,$0000100$,$0000010$,$0000001\}$. Note that $E$ can have at most $2^{n-k}$ members. The ratio $k/n$ is called the {\em rate} of the code, since each block of $n$ transmitted bits conveys $k$ bits of information, thus $k/n$ bitsper bit. The parameter $d$ is called the `minimum distance' of the code, and is important when encoding for noise which affects successive bits independently, as in the binary symmetric channel. For, a code of minumum distance $d$ can correct all errors affecting less than $d/2$ bits of the transmitted codeword,and for independent noise this is the {\em most likely} set of errors.In fact, the probability that an $n$-bit word receives $m$ errorsis given by the binomial distribution \eq{binom}, so if the code cancorrect more than the mean number of errors $np$, the correction ishighly likely to succeed.The central result of classical information theory is that powerfulerror correcting codes exist:\begin{quote}Shannon's theorem: If the rate $k/n < C(p)$and $n$ is sufficiently large, there exists a binarycode allowing transmission with an arbitrarily small errorprobability.\end{quote}The error probability here is the probability that an uncorrectableerror occurs, causing Bob to misinterpret the received word.Shannon's theorem is highly surprising, since it implies thatit is not necessary to engineer very low-noise communication channels,an expensive and difficult task. Instead, we can compensate noiseby error correction coding and decoding, that is, by information processing!The meaning of Shannon's theorem is illustrated by fig. 5.The main problem of coding theory is to identify codes withlarge rate $k/n$ and large distance $d$. These two conditions aremutually incompatible, so a compromise is needed. The problemis notoriously difficult and has nogeneral solution. To make connection withquantum error correction, we will need to mention one important concept,that of the {\em parity check matrix}. An error correcting codeis called linear if it is closed under addition, i.e.  $u + v\in C \; \forall u,v \in C$. Such a code is completely specified byits parity check matrix $H$, which is a set of $(n-k)$ linearly independent $n$-bit wordssatisfying $H \cdot u = 0 \; \forall u \in C$. The important propertyis encapsulated by the following equation:  \beqH \cdot (u + e) = (H \cdot u) + (H \cdot e) = H \cdot e.  \lab{syn}  \eeqThis states that if Bob evaluates $H \cdot u'$ for his noisy received word $u' = u+e$, he will obtain the same answer $H \cdot e$, no matter what word $u$ Alice sent him! If this evaluation were done automatically, Bob could learn $H \cdot e$, called the {\em error syndrome}, without learning $u$. If Bob can deduce the error $e$ from $H \cdot e$, which one can show is possible for all correctable errors, then he can correct the message (by subtracting $e$ from it) without ever learning what it was! In quantum error correction, this is the origin of the reason one can correct a quantum state withoutdisturbing it. \section{Classical theory of computation}We now turn to the theory of computation. This is mostly concerned withthe questions ``what is computable?'' and ``what resources are necessary?''The fundamental resources required for computing are a means to storeand to manipulate symbols. The important questions are such things ashow complicated must the symbols be, how many will we need, howcomplicated must the manipulations be, and how many of them will we need?The general insight is that computation is deemed {\em hard} or inefficientif the amount of resources required rises exponentially with a measureof the size of the problem to be addressed. The size of the problemis given by the amount of {\em information} required to specify theproblem. Applying this idea at the most basic level, we find thata computer must be able to manipulate binary symbols, not justunary symbols\footnote{Unary notation has a single symbol, 1.The positive integers are written 1,11,111,1111,\ldots}, otherwise the number of memory locations needed would grow exponentially with the amount of information to be manipulated.On the other hand, it is not necessary to work in decimal notation(10 symbols) or any other notation with an `alphabet' of morethan two symbols. This greatly simplifies computer design and analysis.To manipulate $n$ binary symbols, it is not necessary to manipulatethem all at once, since it can be shown that any transformation can be brought about by manipulating the binary symbols one at a time or in pairs. A binary `logic gate' takes two bits $x,y$ as inputs, and calculates a function $f(x,y)$. Since $f$ can be 0 or 1, and there are four possible inputs, there are 16 possible functions $f$. This set of 16 different logic gates is called a `universal set', since by combining such gates in series, any transformation of $n$ bits can be carried out. Futhermore, the action of some of the 16 gates can be reproduced by combining others, so we do not need all 16, and in fact only one, the {\sc nand} gate, is necessary ({\sc nand} is {\sc not and}, for which the output is 0 if and only if both inputs are 1). By concatenating logic gates, we can manipulate $n$-bitsymbols (see fig. 6). This general approach is called the network model of computation, and is useful for our purposes because it suggests the model of quantum computation which is currently most feasible experimentally. In this model, the essential components of a computer are a set of bits, many copies of the universal logic gate, and connecting wires. \subsection{Universal computer; Turing machine}  \lab{s:UTM}The word `universal' has a further significance in relation tocomputers. Turing showed thatit is possible to construct a {\em universal} computer,which can simulate the action of any other, in the following sense.Let us write $T(x)$ for the outputof a Turing machine $T$ (fig. 7) acting on input tape $x$.Now, a Turing machine can be completely specified bywriting down how it responds to 0 and 1 on the input tape,for every possible internal configuration of the machine(of which there are a finite number). This specificationcan itself be written as a binary number $d[T]$. Turing showedthat there exists a machine $U$, called a universal Turingmachine, with the properties  \beqU(d[T],x) = T(x)  \eeqand the number of steps taken by $U$ to simulate each stepof $T$ is only a polynomial (not exponential) function ofthe length of $d[T]$. In other words,if we provide $U$ with an input tape containing both a descriptionof $T$ and the input $x$, then $U$ will compute the same functionas $T$ would have done, for {\em any} machine $T$, without anexponential slow-down. To complete the argument, it can beshown that other models of computation, such as the networkmodel, are {\em computationally equivalent} to the Turing model:they permit the same functions to be computed, with thesame computational efficiency (see next section).Thus the concept of the univeral machine establishes that a certainfinite degree of complexity of construction is sufficient to allowvery general information processing. This is the fundamentalresult of computer science. Indeed, the power of the Turing machineand its cousins is so great that Church (1936) and Turing (1936) framed the ``Church-Turing thesis,'' to the effect that{\em Every function `which would naturally be regarded as computable'can be computed by the universal Turing machine}.This thesis is unproven, but has survived many attempts to finda counterexample, making it a very powerful result.To it we owe the versatility of the modern general-purpose computer,since `computable functions' include tasks such as word processing,process control, and so on. The quantum computer, to be describedin section \ref{s:uqc} willthrow new light on this central thesis.\subsection{Computational complexity}   \lab{s:cc}Once we have established the idea of a universal computer, computational tasks can be classified in terms of their difficultyin the following manner.  A given algorithm is deemed to address notjust one instance of a problem, such as ``find the square of 237,''but one class of problem, such as ``given $x$, find its square.''The amount of information given to the computer in order to specify the problem is $L = \log x$, i.e. the number of bits neededto store the value of $x$. The {\em computational complexity}of the problem is determined by the number of steps $s$ a Turingmachine must make in order to complete any algorithmic method tosolve the problem. In the network model, the complexity is determined bythe number of logic gates required. If an algorithm exists with$s$ given by any polynomial function of $L$ (eg $s \propto L^3 + L$)then the problem is deemed tractableand is placed in the complexity class ``{\sc p}''. If $s$ risesexponentially with $l$ (eg $s \propto 2^L = x$) then theproblem is hard and is in another complexity class. It is often easierto verify a solution, that is, to test whether or not it is correct,than to find one. The class ``{\sc np}'' is the set of problems forwhich solutions can be verified in polynomial time. Obviously{\sc p} $\in$ {\sc np}, and one would guess thatthere are problems in {\sc np} which are not in {\sc p},(i.e. {\sc np} $\ne$ {\sc p}) thoughsurprisingly the latter has never been proved, since it is veryhard to rule out the possible existence of as yet undiscovered algorithms.However, the important point is that the membership of these classesdoes not depend on the model of computation, i.e. the physicalrealisation of the computer, since the Turing machine cansimulate any other computer with only a polynomial, rather thanexponential slow-down.An important example of an intractable problem is that of factorisation: given a composite (i.e. non-prime) number $x$, the task is to find one of its factors. If $x$ is even, or a multiple of any small number, then it is easy to find a factor. The interesting case is when the prime factors of $x$ are all themselves large. In this case there is no known simplemethod. The best known method, the {\em number field sieve}(Menezes {\em et. al.} 1997) requires a number of computational stepsof order $s \sim \exp( 2 L^{1/3} (\log L)^{2/3} )$ where $L = \ln x$.By devoting a substantial machine network to this task, one can today factor a number of 130 decimal digits (Crandall 1997), i.e. $L \simeq 300$, giving $s \sim 10^{18}$. This is time-consuming but possible (for example 42 days at $10^{12}$ operations per second). However, if we double $L$, $s$ increases to $\sim 10^{25}$, so now the problem is intractable: it would take a million years with current technology, or would require computers running a million times faster than current ones. The lesson is an important one: a computationally `hard' problem is one which in practice is not merely difficult but impossible to solve. The factorisation problem has acquired great practical importancebecause it is at the heart of widely used cyptographic systems such asthat of Rivest, Shamir and Adleman (1979) (see Hellman 1979).For, given a message $M$ (in the form of a long binary number), it iseasy to calculate an encrypted version $E = M^s \;{\rm mod}\;c$ where$s$ and $c$ are well-chosen large integers which can be made public. To decrypt the message, the receiver calculates $E^t \;{\rm mod}\; c$ which is equal to $M$ for a value of $t$ which can be quickly deduced from $s$ and the factors of $c$ (Schroeder 1984). In practice $c=pq$ is chosen to be the product of two large primes $p,q$ known only to the user who published $c$, so only that user can read the messages---unless someone manages to factorise $c$. It is a very useful feature that no secret keys need be distributed insuch a system: the `key' $c,s$ allowing encryption is public knowledge.\subsection{Uncomputable functions}  \label{s:halting}There is an even stronger way in which a task may be impossible for a computer. In the quest to solve some problem, we could `live with' a slow algorithm, but what if one does not exist at all? Such problems are termed {\em uncomputable}. The most important example isthe ``halting problem'', a rather beautiful result.A feature of computers familiar to programmers is that theymay sometimes be thrown into a never-ending loop. Consider,for example, the instruction ``while $x > 2$,divide $x$ by 1'' for $x$ initially greater than 2.We can see that this algorithm will never halt,without actually running it.More interesting from a mathematical point of view is analgorithm such as ``while $x$ is equal to the sum oftwo primes, add 2 to $x$, otherwise print $x$ and halt'',beginning at $x=8$. The algorithm is certainly feasible since all pairs of primes less than $x$ can be found and added systematically. Will such an algorithm ever halt? If so, then a counterexample to the Goldbach conjecture exists. Using such techniques, a vast section of mathematical and physical theory could be reduced to the question``would such and such an algorithm halt if we were to run it?''If we could find a general way toestablish whether or not algorithms will halt, we would have an extremely powerful mathematical tool. In a certain sense, itwould solve all of mathematics!Let us suppose that it is possible to find a general algorithm which will work out whether any Turing machine will halt on any input. Such an algorithm solves the problem ``given $x$ and $d[T]$, would Turing machine $T$ halt if it were fed $x$ as input?''. Here $d[T]$ is the description of $T$. If such an algorithm exists, then it is possibleto make a Turing machine $T_H$ which haltsif and only if $T( d[T] )$ does not halt, where $d[T]$is the description of $T$. Here $T_H$ takes as input$d[T]$, which is sufficient to tell $T_H$ aboutboth the Turing machine $T$ and the input to $T$. Hence we have  \beqT_H( d[T] ) \;\;\mbox{halts} \leftrightarrow T( d[T] ) \;\;\mbox{does not halt}  \eeqSo far everything is ok. However, what if we feed $T_H$the description of itself, $d[T_H]$? Then  \beqT_H\left( d[T_H] \right) \;\;\mbox{halts} \leftrightarrowT_H\left( d[T_H] \right) \;\; \mbox{does not halt}  \eeqwhich is a contradiction. By this argument Turing showed that there is no automatic means to establish whether Turing machineswill halt in general: the ``halting problem'' is uncomputable.This implies that mathematics, and informationprocessing in general, is a rich body of different ideaswhich cannot all be summarised in one grandalgorithm. This liberating observation is closely related toG\"odel's theorem.