Now that highly sophisticated analytical Apps such as Audacity are available as a free download, anybody can take an audio track and immediately obtain a Fourier Transform of it, without having to know a thing about it. And that is fine, so long as all you want is to take a quick and dirty look at the spectral content of the music. But if you want to be able to take it beyond a superficial glance, it would help to understand some of the fundamentals about what a Fourier Transform shows you, and what it doesn’t.
The first thing you need to know is the difference between a Fourier Transform and a DTFT or Discrete Time Fourier Transform. A Fourier Transform is a mathematical operation performed upon a function. That function may well be, for example, a musical waveform. However, when we want to get that waveform into a computer to actually do the analysis, all we can import is an audio file. An audio file is not the waveform itself, but merely a representation of the waveform. Therefore we do not have the waveform itself upon which to perform the transform, only the numbers which represent it. A DTFT is the digital equivalent of a Fourier Transform, performed upon a digital representation of the waveform.
The next thing you need to know is that a Fourier Transform requires the complete waveform, which means we need to know what it looked like since time began, and what it is going to look like in the infinite future. Real waveforms - particularly digitized representations of real waveforms - are not like that. They have a defined start and a defined stop. The DTFT gets around that by assuming that the waveform between the start and stop points repeats ad infinitum into both the the past and the future. For the most part, this is a limitation that we can comfortably get around.
The above notwithstanding, for the remainder of this post I shall use the term Fourier Transform, even though everything I am discussing is in fact a DTFT. It just makes for easier reading.
The Fourier Transform (by which I mean DTFT, remember?) of a waveform represented by N samples, is another waveform of N data points. These data points represent frequencies. The lowest frequency is DC, or zero Hertz (0Hz), and the highest frequency is the sample rate of the waveform Fs. The remaining N-2 data points represent all of the frequencies between 0Hz and Fs, spaced linearly. So the more samples there are in the waveform that goes into the Fourier Transform, the more detailed will be the spectrum of frequencies in the output.
Those of you who have your antennae switched on will have noticed already that the output of the Fourier Transform includes all of the frequencies from zero to Fs, whereas, as we know, digital sampling can only record frequencies which are less than half of the sampling frequency (Fs/2). This is because all of the frequencies above Fs/2 are in fact the aliases of the frequencies below Fs/2, and if you look closely you will see that the spectrum above Fs/2 exactly mirrors that below Fs/2. For this reason, and because the alias frequencies have no analytical value whatsoever, it is normal practice for analytical software to display only those frequencies below Fs/2.
The Fourier Transform (DTFT - last reminder) is a very complicated mathematical operation, and we are indeed fortunate to live in an age where a desktop computer can perform them with ease. Nonetheless, it doesn’t hurt to perform our Fourier Transform in the most efficient manner possible. By far the most effective way to slash processing time is to restrict the Fourier Transform to snippets of music waveform where the number of samples is a power of two. The processing time tends to increase exponentially with the number of samples, as does the memory requirements to store intermediate results. This usually limits Fourier Transforms in analytical software to something like 65,536 samples (2 raised to the power 16). Many Apps limit you to less - for example Audicity limits you to 16,384.
If the music is sampled at 44.1kHz, then 16,384 samples amounts to less than half a second of music. With high-resolution music it is correspondingly less. The fewer the samples and the higher the sample rate, the shorter the duration of the sampled window. Therefore if you are making observations regarding the spectral content of a music track, you probably need to be careful to look at multiple Fourier Transforms throughout the track because the spectral content typically evolves dramatically during the course of a complete track.
Let’s consider a snippet of a music waveform comprising N samples, spaced at regular intervals of t = (1/Fs), where Fs is the sample rate. The total duration of the snippet is therefore given by T = (N-1)t = (N-1)/Fs. The Fourier Transform will comprise a spectrum of N frequencies from zero to Fs. These are spaced linearly at intervals of Fs/(N-1), which is equal to 1/T regardless of the sample rate. That’s an important result. The longer the duration of the snippet, the more accurately we can analyze its frequency content.
What does this mean if our music contains frequencies which are between those linearly spaced individual frequencies? This question goes to the root of digital audio. It means than a set of N samples, sampled at a rate of Fs, cannot tell us anything about the frequency content to an accuracy of better than Fs/(N-1) for the simple reason that it cannot contain any such information in the first place. The original waveform may contain frequencies known to a very precise degree of accuracy, but once you extract a subset of N samples, without the rest of the waveform to support them those N samples do not contain this additional information. This is perhaps a more palatable conclusion if you take it to its logical extreme. If I choose a snippet of only two samples from the original waveform and put those through a Fourier Transform, I end up with N=2 frequencies, DC and Fs, each of which is the alias of the other! It is self-evident that two consecutive samples can tell me absolutely nothing about the original waveform from which they were extracted.
It is important to realize that Fourier analysis tells you not so much how much information you can extract from the data, rather it tells you how much is actually in there in the first place.
Earlier on I mentioned that the DTFT requires us to make an assumption that the snippet of music that goes into the analysis repeats ad infinitum into both the past and the future. This is not in and of itself a problem, but there is a practical problem associated with it. Where the the repeating waveforms meet each other there is going to be some sort of discontinuity. It is possible that there will be a smooth transition, but in the general case you are going to get an abrupt mismatch. This mismatch will generate a whole bunch of spurious frequencies, and you are not going to be able to tell which ones originate in the waveform, and which are associated with the discontinuity. The effect these spurious frequencies have in practice is that they serve to broaden the existing spectral lines, and spread them out further into the spectrum. In the limit, they will set the noise floor below which the Fourier Transform cannot extract spectral information, or more accurately, below which the waveform is, as a consequence, not capable of storing information. Because these stitching effects are more or less random, there is no way to model what the characteristics of their associated spectral spuriae will be.
Since they originate at the place where the finite waveform periodically stitches together with itself, it should be possible to ‘mask’ the effect of the discontinuity by ‘windowing’ the waveform with a window function which falls to zero at the interface points. Note that this kind of treatment cannot make the problem go away, since it is inherent in the data to begin with. But every window function will have the effect of adding its own particular spuriae to the Fourier Transform. So it may be of some use to choose a window function whose spectral spuriae are of a known characteristic, which at least gives you the possibility to choose a type whose characteristics are less deleterious to a particular property that you are interested in, but perhaps more so to one you are less interested in.
For example, you may be more interested in knowing the absolute magnitude of the signal content at a certain frequency as accurately as possible, and less interested in the magnitude of the noise floor through which it rises. Or it may be the other way round. Or you may be interested to know whether a feature in the vicinity of certain frequency is just one broad spectral line or in fact comprises two frequencies close together. You may wish to know the line width of a certain frequency. All of these things can be addressed by choosing a particular window function whose properties are best suited to your requirements.
Common window functions include Rectangular, Hamming, Hanning, Bartlett, Blackman, Blackman-Harris, Gaussian, Flat-Top, Tukey, and many others. There are no real hard and fast rules, but a good choice of window for accurate resolution of frequency would be Rectangular; for lowest spectral leakage Blackman or Blackman-Harris; and for amplitude accuracy Flat-Top. Unfortunately, not all signal processing Apps offer the same selection of windowing functions, and many of those windowing functions come with their own sets of parameters and selectable types. So if you are using a Fourier Transform to make or illustrate a particular point or observation, it may be worth the effort to set about selecting an appropriate window function.
So there you are. A brief introduction to Fourier Transforms for the casual user. I hope you found it to be of some value.