Monday, 11 November 2013

What, exactly, is DSD? - I. Opinion Polls

Being strictly accurate, DSD (Direct Stream Digital) is a term coined by Sony and Phillips, and refers to a very specific audio protocol. It is a 1-bit Sigma-Delta Modulated data stream encoded at a sample rate of 2.8224MHz. However, the term has now been arbitrarily widened by the audio community at large, to the point where we find it employed to apply generically to an ever-widening family of Sigma-Delta Modulated audio data streams. We read the terms Double-DSD, Quadruple DSD, and “DSD-Wide” applied to various SDM-based audio formats, so that DSD has become a catch-all term somewhat like PCM. There are many flavours of it, and some are claimed to be better than others.

So, time to take a closer look at DSD in its broadest sense, and hopefully wrap some order around the confusion.

Strangely enough, the best place to start is via a detour into the topic of dither which I discussed a couple of weeks back. You will recall how I showed that a 16-bit audio signal with a maximum dynamic range of 96dB can, when appropriately dithered, be shown using Fourier Analysis to have a noise floor that can be as low as -120dB. I dismissed that as a digital party trick, which in that context it is. But this time it is apropos to elaborate on that.

The question is, can I actually encode a waveform that has an amplitude below -96dB using 16-bit data? Yes I can, but only if I take advantage of a process called “oversampling”. Oversampling works a bit like an opinion poll. If I ask your opinion on whether Joe or Fred will win the next election, your response may be right or may be wrong, but it has limited value as a predictor of outcome. However, if I ask 10,000 people, their collective opinions may prove to be a more reliable measure. What I have done in asking 10,000 people is to “oversample” the problem. The more people I poll, the more accurate the outcome should be. Additionally, instead of just predicting that Joe will win (sorry, Fred), I start to be able to predict exactly how many points he will win by, even though my pollster never even asked that question in the first place!

In digital audio, you will recall that I showed how an audio signal needs to be sampled at a frequency which is at least twice the highest frequency in the audio signal. I can, of course, sample it at any frequency higher than that. Sampling at a higher frequency than is strictly necessary is called “oversampling”. There is a corollary to this. All frequencies in the audio signal that are lower than the highest frequency are therefore inherently being oversampled. The lowest frequencies are being oversampled the most, and highest frequencies the least. Oversampling gives me “information space” I can use to encode a “sub-dynamic” (my term) signal. Here’s how…

At this point I wrote and then deleted three very dense and dry paragraphs which described, and illustrated with examples, the mathematics of how oversampling works. But I had to simplify it too much to make it readable, in which form it was too easy to misinterpret, so they had to go. Instead, I will somewhat bluntly present the end result: The higher the oversampling rate, the deeper we can go below the theoretical PCM limit. More precisely, each time we double the sample rate, we can encode an additional 3dB of dynamic range. But there’s no free lunch to be had. Simple information theory says we can’t encode something below the level of the Least Significant Bit (LSB), and yet that’s what we appear to have done. The extra “information” must be encoded elsewhere in the data, and it is. In this case it is encoded as high levels of harmonic distortion. The harmonic distortion is the mathematical price we pay for encoding our “sub-dynamic” signal. This is a specific example of a more general mathematical consequence, which says that if we use the magic of oversampling to encode signals below the level of the LSB, other signals - think of them as aliases if you like - are going to appear at higher frequencies, and there is nothing we can do about that.

Let’s go back again to dither, and consider a technique - Noise Shaping - that I mentioned in a previous post. Noise shaping relates to the fact that when we quantize a signal in a digital representation, the resultant quantization error looks like a noise signal added to the waveform. What is the spectrum of this noise signal? It turns out that we have a significant level of control over what it can look like. At lower frequencies we can squeeze that noise down to levels way below the value of the LSB, lower even than can be achieved by oversampling alone, but at the expense of huge amounts of additional noise popping up at higher frequencies. That high-frequency noise is the "aliases" of the sub-dynamic "low frequency" information that our Noise Shaping has encoded - even if that low frequency information is silence(!). This is what we mean by Noise Shaping - we “shape” the quantization noise so that it is lower at low frequencies and higher at high frequencies. For CD audio, those high frequencies must all be within the audio frequency range, and as a consequence, you have to be very careful in deciding where and when (and even whether) to use it, and what “shape” you want to employ. Remember - no free lunch.

But if we increase the sample rate we also increase the high frequency space above the limit of audibility. Perhaps we can use it as a place to park all that “shaped” high-frequency noise? Tomorrow, we’ll find out.