Wednesday 29 July 2015

Things that lurk below the Bit Depth

In digital audio Bit Depth governs the Dynamic Range and Signal-to-Noise Ratio (SNR), but the relationships often lead to confusion.  I thought it was worth a quick discussion to see if I can maybe shed some light on that.  But then I found that my ramblings went a little further than I originally intended.  So read on…

First of all, it is clear that the bit depth sets some sort of lower limit on the magnitude of the signal that can be digitally encoded.  With a 16-bit PCM system, the magnitude of the signal must be encoded as one of 65,536 levels.  You can think of them as going from 0 to 65,535 but in practice they are used from -32,768 to +32,767 which gives us a convenient way to store both the negative and positive excursions of a typical audio signal.  If the magnitude of the signal is such that its peaks exceed ±32,767 then we have a problem because we don’t have levels available to record those values.  This sets an upper limit on the magnitude of a signal we can record.

On the other hand, if we progressively attenuate the signal, making it smaller and smaller, eventually we will get to the point where its peaks barely register at ±1.  If we attenuate it even further, then it will fail to register at all and it will be encoded as silence.  This sets the lower limit on the magnitude of the signal we can record.  Yes, there are some ifs and buts associated with both of these scenarios, but for the purpose of this post they don’t make a lot of difference.

The ratio between the upper and lower limits of the magnitudes that we can record is the Dynamic Range of the recording system.  The mathematics of this works out to be quite simple.  Each bit of the bit depth provides almost exactly 6dB of Dynamic Range.  So, if we are using a 16-bit system our Dynamic Range will be ~96dB (= 16x6).  And if we increase it to 24-bits the Dynamic Range increases to ~144dB (= 24x6).  For those of you who want the exact formula, it is 1.76 + 6.06D (where D is the bit depth).

So far, so good.  But where does the SNR come into it?  The answer, and the reason why it is the cause of so much confusion, is that both the signal and the noise are frequency dependent.  Both may be spread over several frequencies, which may be similar or different frequencies.  Sometimes you don’t actually know too much about the frequency distributions of either.  Therefore, in order to be able to analyze and measure the ratios of one to the other, you often need to be able to look at the frequency distributions of both.

The way to do that is to take the audio data and use your computer to put it through a Fourier Transform.  This breaks the audio data down into individual frequencies, and for each frequency it tells you how much of that particular frequency is present in the audio data.  If you plot all these data points on a graph you get the audio data’s Frequency Spectrum.  In digital audio, we use a variant of the Fourier Transform called a DFT, which takes as its input a specific part of the audio data comprising a number of consecutive samples.  With a DFT the number of audio samples ends up being the same as the number of frequencies in the resulting Frequency Spectrum, so if we use a lot of audio data we can obtain very good resolution in the frequency spectrum.  However, if we use too may samples it can make the calculation itself excessively laborious, so most audio DFTs are usually derived from between 1,000 and 65,000 samples.

In principle, we can synthesize an audio data file containing nothing but a 1kHz pure tone, with no noise whatsoever.  If we looked at the DFT of that data file we would see a signal at the 1kHz frequency point, and absolutely nothing everywhere else.  This makes sense, because we have some signal, and no noise at all.  I can also synthesize a noise file by filling each audio sample with random numbers.  If the numbers are truly random, we get White Noise.  I can encode my white noise at full power (where the maximum positive and negative going encoded values are ±32,767), or I can attenuate it by 96dB so that the maximum positive and negative going encoded values are ±1.  If I attenuate it by more than that I only get silence.

Suppose I look at an DFT of my synthesized music data file containing white noise at -96dB.  Suppose my DFT uses 8,196 samples, and consequently I end up with a Frequency Response with 8,196 frequencies.  What do we expect to see?  Most people would expect to see the noise value at each frequency to be -96dB, but they would be wrong.  The value is much lower than that.  [Furthermore, there is a lot of “noise” in the frequency response itself, although for the purposes of this post we are going to ignore that aspect of it.]  Basically, the noise is more or less equally shared out among the 8,192 frequencies, so the noise at each frequency is approximately 1/8192 of the total noise, or about 38dB down.  The important result here is that the level of the noise floor in the DFT plot is a long way below the supposed -96dB noise floor, and how far below depends on the number of frequencies in the DFT.  And there is more.  DFTs use a thing called a ‘window function’ for reasons I have described in a previous post, and the choice of window function significantly impacts the level where the noise floor sits in the DFT plot.

If I make a synthesized music file containing a combination of a 1kHz pure tone and white noise at -96dB, and look at that in a DFT, what would we see?  The answer is that the noise behaves exactly as I have previously described, with the level of the noise floor on the plot varying according to both the number of frequencies in the DFT and the choice of window function.  The 1kHz pure tone is not affected, though.  Because it is a 1kHz pure tone, its energy only appears at the one frequency in the DFT corresponding to 1kHz, and it really doesn’t matter that much how many frequencies there are in the DFT.  [The choice of window function does impact both of those things, but for the purposes of this post I want to ignore that.]

The Signal-to-Noise Ratio (SNR) is exactly what it says.  It is the ratio of the the signal to the noise.  If those values are expressed in a dB scale, then it is the difference between the two dB values.  So if the signal is at -6dB and the noise is a -81dB, then the SNR will be 75dB, which is the difference between the two.  But since we have seen that the actual measured value of the noise level varies depending on how we do the DFT, yet the measured value of the signal pretty much does not, then an SNR value derived from an DFT is not very useful when it comes to quoting numbers for comparison purposes.  It is only useful when comparing two measurements made using the same DFT algorithms, set up with the exact same number of samples and the same window function.

Sometimes the SNR has to be measured purely in analog space.  For example, you might measure the overall background hiss on a blank analog tape before recording anything on it.  When you then make your recording, one measure of the SNR will be ratio between the level of the recording and the level of the tape hiss.  Or you can measure the output of a microphone in a silent anechoic chamber before using the same microphone to make a recording.  One measure of the SNR of the microphone would be the ratio between the two recorded levels.  I use the term “one measure” of the SNR intentionally - because any time you measure SNR, whatever the tools and methods you use to make the measurement, the result is only of relevance if the methodology is well thought out and fully disclosed along with the results.  In reality, the nuances and variables as such that you can write whole books about how to specify and measure SNR.

Clearly, if the noise component of the SNR is a broadband signal, such as the hiss from a blank tape or the signal from a microphone in a silent room, then my ability to accurately represent that noise in a PCM format is going to be limited by the bit depth and therefore by its Dynamic Range.  But if I use a DFT to examine the spectral content of the noise signal, then, as I have just described, the noise is going to be spread over all of the frequencies and the component of the noise at each frequency will be proportionately lower.  What does it mean, then, if the noise content at a given frequency is below the minimum level represented by the Dynamic Range?  For example, in a 16-bit system, where the Dynamic Range is about 96dB, what does it mean if the noise level at any given frequency is measured using an DFT to be a long way below that - for example at -120dB?  Clearly, that noise is being encoded, so we must conclude that a 16-bit system can encode noise at levels several 10’s of dB below what we thought was the minimum signal level that could be encoded.  The question then arises, if we can encode noise at those levels, can we also encode signals?

The answer is yes we can, but at this point my challenge is to explain how this is possible in words of one proverbial syllable.  My approach is to propose a thought experiment.  Let’s take the white noise signal we talked about previously - the one at a level of -96dB which is just about encodable in a 16-bit PCM format.  We took our DFT of this signal and found that the noise component at each frequency was way lower than -96dB - lets say that it was 30dB down at -126dB.  Therefore the frequency content of the noise signal at one specific frequency - say, 1kHz - was at a level of -126dB.  Let us therefore apply a hypothetical filter to the input signal such that we strip out every frequency except 1kHz.  So now, we have taken our white noise signal at -96dB and filtered it to become a 1kHz signal at -126dB.  Our DFT previously showed that we had managed to encode that signal, in the sense that our DFT registered its presence and measured its intensity.  But, with the rest of the white noise filtered out, our input signal now comprises nothing but a single frequency at a level 30dB **below** the minimum level that can be represented by a 16-bit PCM system, and the result is pure, unadulterated, digital silence.

What happened there?  When the 1kHz component was part of the noise, we could detect its presence in the encoded signal, but when the rest of the noise was stripped out leaving only the 1kHz component behind, that 1kHz component vanished also.  It is clear that the presence of the totality of the noise is critical in permitting each of its component frequencies to be encoded.  There is something about the presence of noise that enables information to be encoded in a PCM system at levels far below those determined solely by the bit depth.

Exactly what it is, though is beyond the scope of this post.  I’m sorry, because I know you were salivating to hear the answer!  But from this point on it boils down to ugly mathematics.  However, this result forms the basis of a principle that can be used to accomplish a number of party tricks in the digital audio domain.  These tricks include dithering, noise shaping, and sigma-delta modulation.  With dithering, we can add a very small amount of noise in order for a larger amount of quantization-error-induced distortion to be eliminated.  With noise shaping, we can reduce the noise level at frequencies where we are sensitive to noise, at the expense of increasing the noise at frequencies where it is less audible.  And with sigma-delta modulation we can obtain state-of-the-art audio performance from a bit depth of as little as 1-bit, at the expense of a dramatic increase in the sample rate.

With DSD, for example, an entire audio stream with state-of-the-art performance can be made to lurk below the 1-bit Bit Depth.