The Cooley–Tukey algorithm, named after J. W. Cooley and John Tukey, is the most common fast Fourier transform (FFT) algorithm. It re-expresses the discrete Fourier transform (DFT) of an arbitrary composite size <math>N = N_1N_2</math> in terms of N<sub>1</sub> smaller DFTs of sizes N<sub>2</sub>, recursively, to reduce the computation time to O(N log N) for highly composite N (smooth numbers). Because of the algorithm's importance, specific variants and implementation styles have become known by their own names, as described below.
Because the Cooley–Tukey algorithm breaks the DFT into smaller DFTs, it can be combined arbitrarily with any other algorithm for the DFT. For example, Rader's or Bluestein's algorithm can be used to handle large prime factors that cannot be decomposed by Cooley–Tukey, or the prime-factor algorithm can be exploited for greater efficiency in separating out relatively prime factors.
The algorithm, along with its recursive application, was invented by Carl Friedrich Gauss. Cooley and Tukey independently rediscovered and popularized it 160 years later.
History
This algorithm, including its recursive application, was invented around 1805 by Carl Friedrich Gauss, who used it to interpolate the trajectories of the asteroids Pallas and Juno, but his work was not widely recognized (being published only posthumously and in Neo-Latin). Gauss did not analyze the asymptotic computational time, however. Various limited forms were also rediscovered several times throughout the 19th and early 20th centuries.
Tukey reportedly came up with the idea during a meeting of President Kennedy's Science Advisory Committee discussing ways to detect nuclear-weapon tests in the Soviet Union by employing seismometers located outside the country. These sensors would generate seismological time series. However, analysis of this data would require fast algorithms for computing DFTs due to the number of sensors and length of time. This task was critical for the ratification of the proposed nuclear test ban so that any violations could be detected without need to visit Soviet facilities. Another participant at that meeting, Richard Garwin of IBM, recognized the potential of the method and put Tukey in touch with Cooley. However, Garwin made sure that Cooley did not know the original purpose. Instead, Cooley was told that this was needed to determine periodicities of the spin orientations in a 3-D crystal of helium-3. Cooley and Tukey subsequently published their joint paper, and wide adoption quickly followed due to the simultaneous development of Analog-to-digital converters capable of sampling at rates up to 300 kHz.
The fact that Gauss had described the same algorithm (albeit without analyzing its asymptotic cost) was not realized until several years after Cooley and Tukey's 1965 paper.
The radix-2 DIT case
A radix-2 decimation-in-time (DIT) FFT is the simplest and most common form of the Cooley–Tukey algorithm, although highly optimized Cooley–Tukey implementations typically use other forms of the algorithm as described below. Radix-2 DIT divides a DFT of size N into two interleaved DFTs (hence the name "radix-2") of size N/2 with each recursive stage.
The discrete Fourier transform (DFT) is defined by the formula:
:<math> X_k = \sum_{n=0}^{N-1} x_n e^{-\frac{2\pi i}{N} nk},</math>
where <math>n</math> is an integer ranging from 0 to <math>N-1</math>.
Radix-2 DIT first computes the DFTs of the even-indexed inputs
<math>(x_{2m}=x_0, x_2, \ldots, x_{N-2})</math>
and of the odd-indexed inputs <math>(x_{2m+1}=x_1, x_3, \ldots, x_{N-1})</math>, and then combines those two results to produce the DFT of the whole sequence. This idea can then be performed recursively to reduce the overall runtime to O(N log N). This simplified form assumes that N is a power of two; since the number of sample points N can usually be chosen freely by the application (e.g. by changing the sample rate or window, zero-padding, etcetera), this is often not an important restriction.
The radix-2 DIT algorithm rearranges the DFT of the function <math>x_n</math> into two parts: a sum over the even-numbered indices <math>n={2m}</math> and a sum over the odd-numbered indices <math>n={2m+1}</math>:
:<math>
X_k = \sum_{m=0}^{N/2-1} x_{2m}e^{-\frac{2\pi i}{N} (2m)k} + \sum_{m=0}^{N/2-1} x_{2m+1} e^{-\frac{2\pi i}{N} (2m+1)k}
</math>
One can factor a common multiplier <math>e^{-\frac{2\pi i}{N}k}</math> out of the second sum, as shown in the equation below. It is then clear that the two sums are the DFT of the even-indexed part <math>x_{2m}</math> and the DFT of odd-indexed part <math>x_{2m+1}</math> of the function <math>x_n</math>. Denote the DFT of the Even-indexed inputs <math>x_{2m}</math> by <math>E_k</math> and the DFT of the Odd-indexed inputs <math>x_{2m + 1}</math> by <math>O_k</math> and we obtain:
:<math>
X_k= \underbrace{\sum \limits_{m=0}^{N/2-1} x_{2m} e^{-\frac{2\pi i}{N/2} mk_{\text{DFT of even-indexed part of } x_n} {} + e^{-\frac{2\pi i}{N}k}
\underbrace{\sum \limits_{m=0}^{N/2-1} x_{2m+1} e^{-\frac{2\pi i}{N/2} mk_{\text{DFT of odd-indexed part of } x_n} = E_k + e^{-\frac{2\pi i}{N}k} O_k\qquad\text{ for }k=0,\dots,\frac N2-1.
</math>
Note that the equalities hold for <math>k=0,\dots,N-1</math>,
but the crux is that <math>E_k</math> and <math>O_k</math> are calculated in this way for <math>k=0,\dots,\frac N2-1</math> only.
Thanks to the periodicity of the complex exponential, <math>X_{k+\frac{N}{2</math> is also obtained from <math>E_k</math> and <math>O_k</math>:
:<math>
\begin{align} X_{k + \frac{N}{2 & = \sum \limits_{m=0}^{N/2-1} x_{2m} e^{-\frac{2\pi i}{N/2} m(k + \frac{N}{2})} + e^{-\frac{2\pi i}{N}(k + \frac{N}{2})}
\sum_{m=0}^{N/2-1} x_{2m+1} e^{-\frac{2\pi i}{N/2} m(k + \frac{N}{2} )} \\
& = \sum_{m=0}^{N/2-1} x_{2m} e^{-\frac{2\pi i}{N/2} mk} e^{-2\pi m i} + e^{-\frac{2\pi i}{N}k}e^{-\pi i}
\sum_{m=0}^{N/2-1} x_{2m+1} e^{-\frac{2\pi i}{N/2} mk} e^{-2\pi m i} \\
& = \sum_{m=0}^{N/2-1} x_{2m} e^{-\frac{2\pi i}{N/2} mk} - e^{-\frac{2\pi i}{N}k}
\sum_{m=0}^{N/2-1} x_{2m+1} e^{-\frac{2\pi i}{N/2} mk} \\
& = E_k - e^{-\frac{2\pi i}{N}k} O_k
\end{align}
</math>
We can rewrite <math> X_k </math> and <math> X_{k+\frac{N}{2 </math> as:
:<math>
\begin{align}
X_k & = E_k + e^{-\frac{2\pi i}{N}{k O_k \\
X_{k+\frac{N}{2 & = E_k - e^{-\frac{2\pi i}{N}{k O_k
\end{align}
</math>
This result, expressing the DFT of length N recursively in terms of two DFTs of size N/2, is the core of the radix-2 DIT fast Fourier transform. The algorithm gains its speed by re-using the results of intermediate computations to compute multiple DFT outputs. Note that final outputs are obtained by a +/− combination of <math>E_k</math> and <math>O_k \exp(-2\pi i k/N)</math>, which is simply a size-2 DFT (sometimes called a butterfly in this context); when this is generalized to larger radices below, the size-2 DFT is replaced by a larger DFT (which itself can be evaluated with an FFT).
thumb|300px|right|Data flow diagram for N=8: a decimation-in-time radix-2 FFT breaks a length-N DFT into two length-N/2 DFTs followed by a combining stage consisting of N/2 size-2 DFTs called "butterfly" operations (so called because of the shape of the data-flow diagrams).
This process is an example of the general technique of divide and conquer algorithms; in many conventional implementations, however, the explicit recursion is avoided, and instead one traverses the computational tree in breadth-first fashion.
The above re-expression of a size-N DFT as two size-N/2 DFTs is sometimes called the Danielson–Lanczos lemma, since the identity was noted by those two authors in 1942 (influenced by Runge's 1903 work (In many textbook implementations the depth-first recursion is eliminated in favor of a nonrecursive breadth-first approach, although depth-first recursion has been argued to have better memory locality.
- Perform N<sub>1</sub> DFTs of size N<sub>2</sub>.
- Multiply by complex roots of unity (often called the twiddle factors).
- Perform N<sub>2</sub> DFTs of size N<sub>1</sub>.
Typically, either N<sub>1</sub> or N<sub>2</sub> is a small factor (not necessarily prime), called the radix (which can differ between stages of the recursion). If N<sub>1</sub> is the radix, it is called a decimation in time (DIT) algorithm, whereas if N<sub>2</sub> is the radix, it is decimation in frequency (DIF, also called the Sande–Tukey algorithm). The version presented above was a radix-2 DIT algorithm; in the final expression, the phase multiplying the odd transform is the twiddle factor, and the +/- combination (butterfly) of the even and odd transforms is a size-2 DFT. (The radix's small DFT is sometimes known as a butterfly, so-called because of the shape of the dataflow diagram for the radix-2 case.)
Variations
There are many other variations on the Cooley–Tukey algorithm.
- Mixed-radix implementations handle composite sizes with a variety of (typically small) factors in addition to two, usually (but not always) employing the O(N<sup>2</sup>) algorithm for the prime base cases of the recursion <nowiki>(</nowiki>it is also possible to employ an N log N algorithm for the prime base cases, such as Rader's or Bluestein's algorithm<nowiki>)</nowiki>.
- Split radix merges radices 2 and 4, exploiting the fact that the first transform of radix 2 requires no twiddle factor, in order to achieve what was long the lowest known arithmetic operation count for power-of-two sizes, (On present-day computers, performance is determined more by cache and CPU pipeline considerations than by strict operation counts; well-optimized FFT implementations often employ larger radices and/or hard-coded base-case transforms of significant size.) e.g. for cache optimization or out-of-core operation, and was later shown to be an optimal cache-oblivious algorithm.
The general Cooley–Tukey factorization rewrites the indices k and n as <math>k = N_2 k_1 + k_2</math> and <math>n = N_1 n_2 + n_1</math>, respectively, where the indices k<sub>a</sub> and n<sub>a</sub> run from 0..N<sub>a</sub>-1 (for a of 1 or 2). That is, it re-indexes the input (n) and output (k) as N<sub>1</sub> by N<sub>2</sub> two-dimensional arrays in column-major and row-major order, respectively; the difference between these indexings is a transposition, as mentioned above. When this re-indexing is substituted into the DFT formula for nk, the <math>N_1 n_2 N_2 k_1</math> cross term vanishes (its exponential is unity), and the remaining terms give
:<math>X_{N_2 k_1 + k_2} =
\sum_{n_1=0}^{N_1-1} \sum_{n_2=0}^{N_2-1}
x_{N_1 n_2 + n_1}
e^{-\frac{2\pi i}{N_1 N_2} \cdot (N_1 n_2 + n_1) \cdot (N_2 k_1 + k_2) }</math>
::<math>=
\sum_{n_1=0}^{N_1-1}
\left[ e^{-\frac{2\pi i}{N_1N_2} n_1 k_2 } \right]
\left( \sum_{n_2=0}^{N_2-1} x_{N_1 n_2 + n_1}
e^{-\frac{2\pi i}{N_2} n_2 k_2 } \right)
e^{-\frac{2\pi i}{N_1} n_1 k_1 }
</math>
::<math>=
\sum_{n_1=0}^{N_1-1}
\left( \sum_{n_2=0}^{N_2-1} x_{N_1 n_2 + n_1}
e^{-\frac{2\pi i}{N_2} n_2 k_2 } \right)
e^{-\frac{2\pi i}{N_1N_2} n_1(N_2k_1+k_2) }
</math>.
where each inner sum is a DFT of size N<sub>2</sub>, each outer sum is a DFT of size N<sub>1</sub>, and the <nowiki>[...]</nowiki> bracketed term is the twiddle factor.
An arbitrary radix r (as well as mixed radices) can be employed, as was shown by both Cooley and Tukey This analysis was erroneous, however: the radix-butterfly is also a DFT and can be performed via an FFT algorithm in O(r log r) operations, hence the radix r actually cancels in the complexity O(r log(r) N/r log<sub>r</sub>N), and the optimal r is determined by more complicated considerations. In practice, quite large r (32 or 64) are important in order to effectively exploit e.g. the large number of processor registers on modern processors,
algorithm iterative-fft is
input: Array a of n complex values where n is a power of 2.
output: Array A the DFT of a.
bit-reverse-copy(a, A)
n ← a.length
for s = 1 to log(n) do
m ← 2<sup>s</sup>
ω<sub>m</sub> ← exp(−2πi/m)
for k = 0 to n-1 by m do
ω ← 1
for j = 0 to m/2 – 1 do
t ← ω A[k + j + m/2]
u ← A[k + j]
A[k + j] ← u + t
A[k + j + m/2] ← u – t
ω ← ω ω<sub>m</sub>
return A
The bit-reverse-copy procedure can be implemented as follows.
algorithm bit-reverse-copy(a,A) is
input: Array a of n complex values where n is a power of 2.
output: Array A of size n.
n ← a.length
for k = 0 to n – 1 do
A[rev(k)] := a[k]
Alternatively, some applications (such as the computation of convolution products) work equally well on bit-reversed data, so one can perform forward transforms, processing, and then inverse transforms all without bit reversal to produce final results in the natural order.
Many FFT users, however, prefer natural-order outputs, and a separate, explicit bit-reversal stage can have a non-negligible impact on the computation time, Also, while the permutation is a bit reversal in the radix-2 case, it is more generally an arbitrary (mixed-base) digit reversal for the mixed-radix case, and the permutation algorithms become more complicated to implement. Moreover, it is desirable on many hardware architectures to re-order intermediate stages of the FFT algorithm so that they operate on consecutive (or at least more localized) data elements. To these ends, a number of alternative implementation schemes have been devised for the Cooley–Tukey algorithm that do not require separate bit reversal and/or involve additional permutations at intermediate stages.
The problem is greatly simplified if it is out-of-place: the output array is distinct from the input array or, equivalently, an equal-size auxiliary array is available. The algorithm Even greater potential SIMD advantages (more consecutive accesses) have been proposed for the Pease algorithm, which also reorders out-of-place with each stage, but this method requires separate bit/digit reversal and O(N log N) storage. One can also directly apply the Cooley–Tukey factorization definition with explicit (depth-first) recursion and small radices, which produces natural-order out-of-place output with no separate permutation step (as in the pseudocode above) and can be argued to have cache-oblivious locality benefits on systems with hierarchical memory.
A typical strategy for in-place algorithms without auxiliary storage and without separate digit-reversal passes involves small matrix transpositions (which swap individual pairs of digits) at intermediate stages, which can be combined with the radix butterflies to reduce the number of passes over the data.
