\section{Classical information theory} \lab{s:cit} This and the next section will summarise the classical theory of information and computing. This is textbook material (Minsky 1967, Hamming 1986) but is included here since it forms a background to quantum information and computing, and the article is aimed at physicists to whom the ideas may be new. \subsection{Measures of information} \lab{s:mi} The most basic problem in classical information theory is to obtain a measure of information, that is, of amount of information. Suppose I tell you the value of a number $X$. How much information have you gained? That will depend on what you already knew about $X$. For example, if you already knew $X$ was equal to $2$, you would learn nothing, no information, from my revelation. On the other hand, if previously your only knowledge was that $X$ was given by the throw of a die, then to learn its value is to gain information. We have met here a basic paradoxical property, which is that {\em information} is often a measure of {\em ignorance}: the information content (or `self-information') of $X$ is defined to be the information you would gain if you learned the value of $X$. If $X$ is a random variable which has value $x$ with probability $p(x)$, then the information content of $X$ is defined to be \beq S( \{p(x) \} ) = - \sum_x p(x) \log_2 p(x). \lab{S} \eeq Note that the logarithm is taken to base 2, and that $S$ is always positive since probabilities are bounded by $p(x) \le 1$. $S$ is a function of the {\em probability distribition} of values of $X$. It is important to remember this, since in what follows we will adopt the standard practice of using the notation $S(X)$ for $S( \{ p(x) \})$. It is understood that $S(X)$ does not mean a function of $X$, but rather the information content of the variable $X$. The quantity $S(X)$ is also referred to as an entropy, for obvious reasons. If we already know that $X=2$, then $p(2)=1$ and there are no other terms in the sum, leading to $S=0$, so $X$ has no information content. If, on the other hand, $X$ is given by the throw of a die, then $p(x)=1/6$ for $x \in \{1,2,3,4,5,6\}$ so $S = -\log_2(1/6) \simeq 2.58$. If $X$ can take $N$ different values, then the information content (or entropy) of $X$ is maximised when the probability distribution $p$ is flat, with every $p(x) = 1/N$ (for example a fair die yields $S \simeq 2.58$, but a loaded die with $p(6)=1/2, p(1 \cdots 5)=1/10$ yields $S \simeq 2.16$). This is consistent with the requirement that the information (what we would gain if we learned $X$) is maximum when our prior knowledge of $X$ is minimum. Thus the maximum information which could in principle be stored by a variable which can take on $N$ different values is $\log_2(N)$. The logarithms are taken to base 2 rather than some other base by convention. The choice dictates the unit of information: $S(X) = 1$ when $X$ can take two values with equal probability. A two-valued or binary variable thus can contain one unit of information. This unit is called a {\em bit}. The two values of a bit are typically written as the binary digits 0 and 1. In the case of a binary variable, we can define $p$ to be the probability that $X=1$, then the probability that $X=0$ is $1-p$ and the information can be written as a function of $p$ alone: \beq H(p) = -p \log_2 p - (1-p) \log_2 (1-p) \lab{H} \eeq This function is called the {\em entropy function}, $0 \le H(p) \le 1$. In what follows, the subscript 2 will be dropped on logarithms, it is assumed that all logarithms are to base 2 unless otherwise indicated. The probability that $Y=y$ given that $X=x$ is written $p(y | x)$. The {\em conditional entropy} $S(Y | X)$ is defined by \begin{eqnarray} S(Y | X) &=& -\sum_x p(x) \sum_y p(y | x) \log p(y | x) \lab{SYgX} \\ &=& -\sum_x \sum_y p(x,y) \log p(y | x) \end{eqnarray} where the second line is deduced using $p(x,y) = p(x) p(y | x)$ (this is the probability that $X=x$ {\em and} $Y=y$). By inspection of the definition, we see that $S(Y | X)$ is a measure of how much information on average would remain in $Y$ if we were to learn $X$. Note that $S(Y | X) \le S(Y)$ always and $S(Y | X) \ne S(X | Y)$ usually. The conditional entropy is important mainly as a stepping-stone to the next quantity, the {\em mutual information}, defined by \begin{eqnarray} I(X: Y) &=& \sum_x \sum_y p(x,y) \log \frac{p(x,y)}{p(x) p(y)} \\ &=& S(X) - S(X | Y) \lab{I} \end{eqnarray} From the definition, $I(X:Y)$ is a measure of how much $X$ and $Y$ contain information about each other\footnote{Many authors write $I(X;Y)$ rather than $I(X:Y)$. I prefer the latter since the symmetry of the colon reflects the fact that $I(X:Y) = I(Y:X)$.}. If $X$ and $Y$ are independent then $p(x,y) = p(x) p(y)$ so $I(X:Y) = 0$. The relationships between the basic measures of information are indicated in fig. 3. The reader may like to prove as an exercise that $S(X,Y)$, the information content of $X$ and $Y$ (the information we would gain if, initially knowing neither, we learned the value of both $X$ and $Y$) satisfies $S(X,Y) = S(X) + S(Y) - I(X:Y).$ Information can disappear, but it cannot spring spontaneously from nowhere. This important fact finds mathematical expression in the {\em data processing inequality}: \beq \mbox{if} \;\; X \rightarrow Y \rightarrow Z\;\;\; \mbox{then} \;\;\; I(X:Z) \le I(X:Y). \lab{data} \eeq The symbol $X \rightarrow Y \rightarrow Z$ means that $X, Y$ and $Z$ form a process (a Markov chain) in which $Z$ depends on $Y$ but not directly on $X$: $p(x,y,z) = p(x) p(y | x) p (z | y)$. The content of the data processing inequality is that the `data processor' $Y$ can pass on to $Z$ no more information about $X$ than it received. \subsection{Data compression} \lab{s:dc} Having pulled the definition of information content, equation \eq{S}, out of a hat, our aim is now to prove that this is a good measure of information. It is not obvious at first sight even how to think about such a task. One of the main contributions of classical information theory is to provide useful ways to think about information. We will describe a simple situation in order to illustrate the methods. Let us suppose one person, traditionally called Alice, knows the value of $X$, and she wishes to communicate it to Bob. We restrict ourselves to the simple case that $X$ has only two possible values: either `yes' or `no'. We say that Alice is a `source' with an `alphabet' of two symbols. Alice communicates by sending binary digits (noughts and ones) to Bob. We will measure the information content of $X$ by counting how many bits Alice must send, {\em on average}, to allow Bob to learn $X$. Obviously, she could just send 0 for `no' and 1 for `yes', giving a `bit rate' of one bit per $X$ value communicated. However, what if $X$ were an essentially random variable, except that it is more likely to be `no' than `yes'? (think of the output of decisions from a grant funding body, for example). In this case, Alice can communicate more efficiently by adopting the following procedure. Let $p$ be the probability that $X=1$ and $1-p$ be the probability that $X=0$. Alice waits until $n$ values of $X$ are available to be sent, where $n$ will be large. The mean number of ones in such a sequence of $n$ values is $np$, and it is likely that the number of ones in any given sequence is close to this mean. Suppose $np$ is an integer, then the probability of obtaining any given sequence containing $np$ ones is \beq p^{np} (1-p)^{n-np} = 2^{-n H(p)}. \eeq The reader should satisfy him or herself that the two sides of this equation are indeed equal: the right hand side hints at how the argument can be generalised. Such a sequence is called a {\em typical sequence}. To be specific, we define the set of typical sequences to be all sequences such that \beq 2^{-n( H(p) + \epsilon)} \le p(\mbox{sequence}) \le 2^{-n( H(p) - \epsilon)} \eeq Now, it can be shown that the probability that Alice's $n$ values actually form a typical sequence is greater than $1-\epsilon$, for sufficiently large $n$, no matter how small $\epsilon$ is. This implies that Alice need not communicate $n$ bits to Bob in order for him to learn $n$ decisions. She need only tell Bob {\em which typical sequence} she has. They must agree together beforehand how the typical sequences are to be labelled: for example, they may agree to number them in order of increasing binary value. Alice just sends the label, not the sequence itself. To deduce how well this works, it can be shown that the typical sequences all have equal probability, and there are $2^{n H(p)}$ of them. To communicate one of $2^{n H(p)}$ possibilities, clealy Alice must send $n H(p)$ bits. Also, Alice cannot do better than this (i.e. send fewer bits) since the typical sequences are equiprobable: there is nothing to be gained by further manipulating the information. Therefore, the information content of each value of $X$ in the original sequence must be $H(p)$, which proves \eq{S}. The mathematical details skipped over in the above argument all stem from the law of large numbers, which states that, given arbitrarily small $\epsilon$, $\delta$ \beq P\left( \left| m - n p \right| < n\epsilon \right) > 1 - \delta \eeq for sufficiently large $n$, where $m$ is the number of ones obtained in a sequence of $n$ values. For large enough $n$, the number of ones $m$ will differ from the mean $np$ by an amount arbitrarily small compared to $n$. For example, in our case the noughts and ones will be distributed according to the binomial distribution \begin{eqnarray} P(n,m) &=& C(n,m) p^m (1-p)^{n-m} \lab{binom} \\ &\simeq& \frac{1}{\sigma \sqrt{2 \pi}} e^{-(m-np)^2 / 2 \sigma^2} \end{eqnarray} where the Gaussian form is obtained in the limit $n, np \rightarrow \infty$, with the standard deviation $\sigma = \sqrt{np(1-p)}$, and $C(n,m) = n!/m!(n-m)!$. The above argument has already yielded a significant practical result associated with \eq{S}. This is that to communicate $n$ values of $X$, we need only send $n S(X) \le n$ bits down a communication channel. This idea is referred to as {\em data compression}, and is also called {\em Shannon's noiseless coding theorem}. The typical sequences idea has given a means to calculate information content, but it is not the best way to compress information in practice, because Alice must wait for a large number of decisions to accumulate before she communicates anything to Bob. A better method is for Alice to accumulate a few decisions, say 4, and communicate this as a single `message' as best she can. Huffman derived an optimal method whereby Alice sends short strings to communicate the most likely messages, and longer ones to communicate the least likely messages, see table 1 for an example. The translation process is referred to as `encoding' and `decoding' (fig. 4); this terminology does not imply any wish to keep information secret. For the case $p=1/4$ Shannon's noiseless coding theorem tells us that the best possible data compression technique would communicate each message of four $X$ values by sending on average $4 H(1/4) \simeq 3.245$ bits. The Huffman code in table 1 gives on average 3.273 bits per message. This is quite close to the minimum, showing that practical methods like Huffman's are powerful. Data compression is a concept of great practical importance. It is used in telecommunications, for example to compress the information required to convey television pictures, and data storage in computers. From the point of view of an engineer designing a communication channel, data compression can appear miraculous. Suppose we have set up a telephone link to a mountainous area, but the communication rate is not high enough to send, say, the pixels of a live video image. The old-style engineering option would be to replace the telephone link with a faster one, but information theory suggests instead the possibility of using the same link, but adding data processing at either end (data compression and decompression). It comes as a great surprise that the usefulness of a cable can thus be improved by tinkering with the information instead of the cable. \subsection{The binary symmetric channel} \lab{s:bin} So far we have considered the case of communication down a perfect, i.e. noise-free channel. We have gained two main results of practical value: a measure of the best possible data compression (Shannon's noiseless coding theorem), and a practical method to compress data (Huffman coding). We now turn to the important question of communication in the presence of noise. As in the last section, we will analyse the simplest case in order to illustrate principles which are in fact more general. Suppose we have a binary channel, i.e. one which allows Alice to send noughts and ones to Bob. The noise-free channel conveys $0 \rightarrow 0$ and $1 \rightarrow 1$, but a noisy channel might sometimes cause 0 to become 1 and vice versa. There is an infinite variety of different types of noise. For example, the erroneous `bit flip' $0 \rightarrow 1$ might be just as likely as $1 \rightarrow 0$, or the channel might have a tendency to `relax' towards 0, in which case $1 \rightarrow 0$ happens but $0 \rightarrow 1$ does not. Also, such errors might occur independently from bit to bit, or occur in bursts. A very important type of noise is one which affects different bits independently, and causes both $0 \rightarrow 1$ and $1 \rightarrow 0$ errors. This is important because it captures the essential features of many processes encountered in realistic situations. If the two errors $0 \rightarrow 1$ and $1 \rightarrow 0$ are equally likely, then the noisy channel is called a `binary symmetric channel'. The binary symmetric channel has a single parameter, $p$, which is the error probability per bit sent. Suppose the message sent into the channel by Alice is $X$, and the noisy message which Bob receives is $Y$. Bob is then faced with the task of deducing $X$ as best he can from $Y$. If $X$ consists of a single bit, then Bob will make use of the conditional probabilities \begin{eqnarray*} p(x=0 | y=0) = p(x=1 | y=1) = 1-p \\ p(x=0 | y=1) = p(x=1 | y=0) = p \end{eqnarray*} giving $S(X | Y) = H(p)$ using equations (\ref{SYgX}) and (\ref{H}). Therefore, from the definition \eq{I} of mutual information, we have \beq I(X:Y) = S(X) - H(p) \lab{Ip} \eeq Clearly, the presence of noise in the channel limits the information about Alice's $X$ contained in Bob's received $Y$. Also, because of the data processing inequality, equation (\ref{data}), Bob cannot increase his information about $X$ by manipulating $Y$. However, \eq{Ip} shows that Alice and Bob can communicate better if $S(X)$ is large. The general insight is that the information communicated depends both on the source and the properties of the channel. It would be useful to have a measure of the channel alone, to tell us how well it conveys information. This quantity is called the {\em capacity} of the channel and it is defined to be the maximum possible mutual information $I(X:Y)$ between the input and output of the channel, maximised over all possible sources: \beq \mbox{Channel capacity}\;\; C \equiv \max_{\{p(x)\}} I(X:Y) \lab{C} \eeq Channel capacity is measured in units of `bits out per symbol in' and for binary channels must lie between zero and one. It is all very well to have a definition, but \eq{C} does not allow us to compare channels very easily, since we have to perform the maximisation over input strategies, which is non-trivial. To establish the capacity $C(p)$ of the binary symmetric channel is a basic problem in information theory, but fortunately this case is quite simple. From equations (\ref{Ip}) and (\ref{C}) one may see that the answer is \beq C(p) = 1 - H(p), \eeq obtained when $S(X) = 1$ (i.e. $P(x=0) = P(x=1) = 1/2$). \subsection{Error-correcting codes} \lab{s:ecc} So far we have investigated how much information gets through a noisy channel, and how much is lost. Alice cannot convey to Bob more information than $C(p)$ per symbol communicated. However, suppose Bob is busy defusing a bomb and Alice is shouting from a distance which wire to cut : she will not say ``the blue wire'' just once, and hope that Bob heard correctly. She will repeat the message many times, and Bob will wait until he is sure to have got it right. Thus error-free communication can be achieved even over a noisy channel. In this example one obtains the benefit of reduced error rate at the sacrifice of reduced information rate. The next stage of our information theoretic programme is to identify more powerful techniques to circumvent noise (Hamming 1986, Hill 1986, Jones 1979, MacWilliams and Sloane 1977). We will need the following concepts. The set $\{0, 1\}$ is considered as a group (a Galois field GF(2)) where the operations $+,-,\times,\div$ are carried out modulo 2 (thus, $1+1=0$). An $n$-bit binary word is a vector of $n$ components, for example 011 is the vector $(0,1,1)$. A set of such vectors forms a vector space under addition, since for example $011 + 101$ means $(0,1,1)+(1,0,1) = (0+1,1+0,1+1) = (1,1,0) = 110$ by the standard rules of vector addition. This is equivalent to the exclusive-or operation carried out bitwise between the two binary words. The effect of noise on a word $u$ can be expressed $u \rightarrow u' = u + e$, where the error vector $e$ indicates which bits in $u$ were flipped by the noise. For example, $u = 1001101 \rightarrow u' = 1101110$ can be expressed $u' = u + 0100011$. An error correcting code $C$ is a set of words such that \beq u + e \! \ne \! v + f \;\; \forall u,v \in C\; (u\ne v), \;\; \forall e,f \in E \lab{code} \eeq where $E$ is the set of errors correctable by $C$, which includes the case of no error, $e=0$. To use such a code, Alice and Bob agree on which codeword $u$ corresponds to which message, and Alice only ever sends codewords down the channel. Since the channel is noisy, Bob receives not $u$ but $u + e$. However, Bob can deduce $u$ unambiguously from $u+e$ since by condition (\ref{code}), no other codeword $v$ sent by Alice could have caused Bob to receive $u+e$. An example error-correcting code is shown in the right-hand column of table 1. This is a $[7,4,3]$ Hamming code, named after its discoverer. The notation $[n,k,d]$ means that the codewords are $n$ bits long, there are $2^k$ of them, and they all differ from each other in at least $d$ places. Because of the latter feature, the condition (\ref{code}) is satisfied for any error which affects at most one bit. In other words the set $E$ of correctable errors is $\{0000000$,$1000000$,$0100000$,$0010000$, $0001000$,$0000100$,$0000010$, $0000001\}$. Note that $E$ can have at most $2^{n-k}$ members. The ratio $k/n$ is called the {\em rate} of the code, since each block of $n$ transmitted bits conveys $k$ bits of information, thus $k/n$ bits per bit. The parameter $d$ is called the `minimum distance' of the code, and is important when encoding for noise which affects successive bits independently, as in the binary symmetric channel. For, a code of minumum distance $d$ can correct all errors affecting less than $d/2$ bits of the transmitted codeword, and for independent noise this is the {\em most likely} set of errors. In fact, the probability that an $n$-bit word receives $m$ errors is given by the binomial distribution \eq{binom}, so if the code can correct more than the mean number of errors $np$, the correction is highly likely to succeed. The central result of classical information theory is that powerful error correcting codes exist: \begin{quote} Shannon's theorem: If the rate $k/n < C(p)$ and $n$ is sufficiently large, there exists a binary code allowing transmission with an arbitrarily small error probability. \end{quote} The error probability here is the probability that an uncorrectable error occurs, causing Bob to misinterpret the received word. Shannon's theorem is highly surprising, since it implies that it is not necessary to engineer very low-noise communication channels, an expensive and difficult task. Instead, we can compensate noise by error correction coding and decoding, that is, by information processing! The meaning of Shannon's theorem is illustrated by fig. 5. The main problem of coding theory is to identify codes with large rate $k/n$ and large distance $d$. These two conditions are mutually incompatible, so a compromise is needed. The problem is notoriously difficult and has no general solution. To make connection with quantum error correction, we will need to mention one important concept, that of the {\em parity check matrix}. An error correcting code is called linear if it is closed under addition, i.e. $u + v \in C \; \forall u,v \in C$. Such a code is completely specified by its parity check matrix $H$, which is a set of $(n-k)$ linearly independent $n$-bit words satisfying $H \cdot u = 0 \; \forall u \in C$. The important property is encapsulated by the following equation: \beq H \cdot (u + e) = (H \cdot u) + (H \cdot e) = H \cdot e. \lab{syn} \eeq This states that if Bob evaluates $H \cdot u'$ for his noisy received word $u' = u+e$, he will obtain the same answer $H \cdot e$, no matter what word $u$ Alice sent him! If this evaluation were done automatically, Bob could learn $H \cdot e$, called the {\em error syndrome}, without learning $u$. If Bob can deduce the error $e$ from $H \cdot e$, which one can show is possible for all correctable errors, then he can correct the message (by subtracting $e$ from it) without ever learning what it was! In quantum error correction, this is the origin of the reason one can correct a quantum state without disturbing it. \section{Classical theory of computation} We now turn to the theory of computation. This is mostly concerned with the questions ``what is computable?'' and ``what resources are necessary?'' The fundamental resources required for computing are a means to store and to manipulate symbols. The important questions are such things as how complicated must the symbols be, how many will we need, how complicated must the manipulations be, and how many of them will we need? The general insight is that computation is deemed {\em hard} or inefficient if the amount of resources required rises exponentially with a measure of the size of the problem to be addressed. The size of the problem is given by the amount of {\em information} required to specify the problem. Applying this idea at the most basic level, we find that a computer must be able to manipulate binary symbols, not just unary symbols\footnote{Unary notation has a single symbol, 1. The positive integers are written 1,11,111,1111,\ldots}, otherwise the number of memory locations needed would grow exponentially with the amount of information to be manipulated. On the other hand, it is not necessary to work in decimal notation (10 symbols) or any other notation with an `alphabet' of more than two symbols. This greatly simplifies computer design and analysis. To manipulate $n$ binary symbols, it is not necessary to manipulate them all at once, since it can be shown that any transformation can be brought about by manipulating the binary symbols one at a time or in pairs. A binary `logic gate' takes two bits $x,y$ as inputs, and calculates a function $f(x,y)$. Since $f$ can be 0 or 1, and there are four possible inputs, there are 16 possible functions $f$. This set of 16 different logic gates is called a `universal set', since by combining such gates in series, any transformation of $n$ bits can be carried out. Futhermore, the action of some of the 16 gates can be reproduced by combining others, so we do not need all 16, and in fact only one, the {\sc nand} gate, is necessary ({\sc nand} is {\sc not and}, for which the output is 0 if and only if both inputs are 1). By concatenating logic gates, we can manipulate $n$-bit symbols (see fig. 6). This general approach is called the network model of computation, and is useful for our purposes because it suggests the model of quantum computation which is currently most feasible experimentally. In this model, the essential components of a computer are a set of bits, many copies of the universal logic gate, and connecting wires. \subsection{Universal computer; Turing machine} \lab{s:UTM} The word `universal' has a further significance in relation to computers. Turing showed that it is possible to construct a {\em universal} computer, which can simulate the action of any other, in the following sense. Let us write $T(x)$ for the output of a Turing machine $T$ (fig. 7) acting on input tape $x$. Now, a Turing machine can be completely specified by writing down how it responds to 0 and 1 on the input tape, for every possible internal configuration of the machine (of which there are a finite number). This specification can itself be written as a binary number $d[T]$. Turing showed that there exists a machine $U$, called a universal Turing machine, with the properties \beq U(d[T],x) = T(x) \eeq and the number of steps taken by $U$ to simulate each step of $T$ is only a polynomial (not exponential) function of the length of $d[T]$. In other words, if we provide $U$ with an input tape containing both a description of $T$ and the input $x$, then $U$ will compute the same function as $T$ would have done, for {\em any} machine $T$, without an exponential slow-down. To complete the argument, it can be shown that other models of computation, such as the network model, are {\em computationally equivalent} to the Turing model: they permit the same functions to be computed, with the same computational efficiency (see next section). Thus the concept of the univeral machine establishes that a certain finite degree of complexity of construction is sufficient to allow very general information processing. This is the fundamental result of computer science. Indeed, the power of the Turing machine and its cousins is so great that Church (1936) and Turing (1936) framed the ``Church-Turing thesis,'' to the effect that {\em Every function `which would naturally be regarded as computable' can be computed by the universal Turing machine}. This thesis is unproven, but has survived many attempts to find a counterexample, making it a very powerful result. To it we owe the versatility of the modern general-purpose computer, since `computable functions' include tasks such as word processing, process control, and so on. The quantum computer, to be described in section \ref{s:uqc} will throw new light on this central thesis. \subsection{Computational complexity} \lab{s:cc} Once we have established the idea of a universal computer, computational tasks can be classified in terms of their difficulty in the following manner. A given algorithm is deemed to address not just one instance of a problem, such as ``find the square of 237,'' but one class of problem, such as ``given $x$, find its square.'' The amount of information given to the computer in order to specify the problem is $L = \log x$, i.e. the number of bits needed to store the value of $x$. The {\em computational complexity} of the problem is determined by the number of steps $s$ a Turing machine must make in order to complete any algorithmic method to solve the problem. In the network model, the complexity is determined by the number of logic gates required. If an algorithm exists with $s$ given by any polynomial function of $L$ (eg $s \propto L^3 + L$) then the problem is deemed tractable and is placed in the complexity class ``{\sc p}''. If $s$ rises exponentially with $l$ (eg $s \propto 2^L = x$) then the problem is hard and is in another complexity class. It is often easier to verify a solution, that is, to test whether or not it is correct, than to find one. The class ``{\sc np}'' is the set of problems for which solutions can be verified in polynomial time. Obviously {\sc p} $\in$ {\sc np}, and one would guess that there are problems in {\sc np} which are not in {\sc p}, (i.e. {\sc np} $\ne$ {\sc p}) though surprisingly the latter has never been proved, since it is very hard to rule out the possible existence of as yet undiscovered algorithms. However, the important point is that the membership of these classes does not depend on the model of computation, i.e. the physical realisation of the computer, since the Turing machine can simulate any other computer with only a polynomial, rather than exponential slow-down. An important example of an intractable problem is that of factorisation: given a composite (i.e. non-prime) number $x$, the task is to find one of its factors. If $x$ is even, or a multiple of any small number, then it is easy to find a factor. The interesting case is when the prime factors of $x$ are all themselves large. In this case there is no known simple method. The best known method, the {\em number field sieve} (Menezes {\em et. al.} 1997) requires a number of computational steps of order $s \sim \exp( 2 L^{1/3} (\log L)^{2/3} )$ where $L = \ln x$. By devoting a substantial machine network to this task, one can today factor a number of 130 decimal digits (Crandall 1997), i.e. $L \simeq 300$, giving $s \sim 10^{18}$. This is time-consuming but possible (for example 42 days at $10^{12}$ operations per second). However, if we double $L$, $s$ increases to $\sim 10^{25}$, so now the problem is intractable: it would take a million years with current technology, or would require computers running a million times faster than current ones. The lesson is an important one: a computationally `hard' problem is one which in practice is not merely difficult but impossible to solve. The factorisation problem has acquired great practical importance because it is at the heart of widely used cyptographic systems such as that of Rivest, Shamir and Adleman (1979) (see Hellman 1979). For, given a message $M$ (in the form of a long binary number), it is easy to calculate an encrypted version $E = M^s \;{\rm mod}\;c$ where $s$ and $c$ are well-chosen large integers which can be made public. To decrypt the message, the receiver calculates $E^t \;{\rm mod}\; c$ which is equal to $M$ for a value of $t$ which can be quickly deduced from $s$ and the factors of $c$ (Schroeder 1984). In practice $c=pq$ is chosen to be the product of two large primes $p,q$ known only to the user who published $c$, so only that user can read the messages---unless someone manages to factorise $c$. It is a very useful feature that no secret keys need be distributed in such a system: the `key' $c,s$ allowing encryption is public knowledge. \subsection{Uncomputable functions} \label{s:halting} There is an even stronger way in which a task may be impossible for a computer. In the quest to solve some problem, we could `live with' a slow algorithm, but what if one does not exist at all? Such problems are termed {\em uncomputable}. The most important example is the ``halting problem'', a rather beautiful result. A feature of computers familiar to programmers is that they may sometimes be thrown into a never-ending loop. Consider, for example, the instruction ``while $x > 2$, divide $x$ by 1'' for $x$ initially greater than 2. We can see that this algorithm will never halt, without actually running it. More interesting from a mathematical point of view is an algorithm such as ``while $x$ is equal to the sum of two primes, add 2 to $x$, otherwise print $x$ and halt'', beginning at $x=8$. The algorithm is certainly feasible since all pairs of primes less than $x$ can be found and added systematically. Will such an algorithm ever halt? If so, then a counterexample to the Goldbach conjecture exists. Using such techniques, a vast section of mathematical and physical theory could be reduced to the question ``would such and such an algorithm halt if we were to run it?'' If we could find a general way to establish whether or not algorithms will halt, we would have an extremely powerful mathematical tool. In a certain sense, it would solve all of mathematics! Let us suppose that it is possible to find a general algorithm which will work out whether any Turing machine will halt on any input. Such an algorithm solves the problem ``given $x$ and $d[T]$, would Turing machine $T$ halt if it were fed $x$ as input?''. Here $d[T]$ is the description of $T$. If such an algorithm exists, then it is possible to make a Turing machine $T_H$ which halts if and only if $T( d[T] )$ does not halt, where $d[T]$ is the description of $T$. Here $T_H$ takes as input $d[T]$, which is sufficient to tell $T_H$ about both the Turing machine $T$ and the input to $T$. Hence we have \beq T_H( d[T] ) \;\;\mbox{halts} \leftrightarrow T( d[T] ) \;\; \mbox{does not halt} \eeq So far everything is ok. However, what if we feed $T_H$ the description of itself, $d[T_H]$? Then \beq T_H\left( d[T_H] \right) \;\;\mbox{halts} \leftrightarrow T_H\left( d[T_H] \right) \;\; \mbox{does not halt} \eeq which is a contradiction. By this argument Turing showed that there is no automatic means to establish whether Turing machines will halt in general: the ``halting problem'' is uncomputable. This implies that mathematics, and information processing in general, is a rich body of different ideas which cannot all be summarised in one grand algorithm. This liberating observation is closely related to G\"odel's theorem.