Being strictly accurate, DSD (Direct Stream Digital) is a term coined
by Sony and Phillips, and refers to a very specific audio protocol. It
is a 1-bit Sigma-Delta Modulated data stream encoded at a sample rate of
2.8224MHz. However, the term has now been arbitrarily widened by the
audio community at large, to the point where we find it employed to apply
generically to an ever-widening family of Sigma-Delta Modulated audio
data streams. We read the terms Double-DSD, Quadruple DSD, and
“DSD-Wide” applied to various SDM-based audio formats, so that DSD has
become a catch-all term somewhat like PCM. There are many flavours of
it, and some are claimed to be better than others.
So, time to take a closer look at DSD in its broadest sense, and hopefully wrap some order around the confusion.
Strangely enough, the best place to start is via a detour into the
topic of dither which I discussed a couple of weeks back. You will
recall how I showed that a 16-bit audio signal with a maximum dynamic
range of 96dB can, when appropriately dithered, be shown using Fourier
Analysis to have a noise floor that can be as low as -120dB. I
dismissed that as a digital party trick, which in that context it is.
But this time it is apropos to elaborate on that.
The question
is, can I actually encode a waveform that has an amplitude below -96dB
using 16-bit data? Yes I can, but only if I take advantage of a process
called “oversampling”. Oversampling works a bit like an opinion poll.
If I ask your opinion on whether Joe or Fred will win the next
election, your response may be right or may be wrong, but it has limited
value as a predictor of outcome. However, if I ask 10,000 people,
their collective opinions may prove to be a more reliable measure. What
I have done in asking 10,000 people is to “oversample” the problem.
The more people I poll, the more accurate the outcome should be.
Additionally, instead of just predicting that Joe will win (sorry,
Fred), I start to be able to predict exactly how many points he will win
by, even though my pollster never even asked that question in the first
place!
In digital audio, you will recall that I showed how an
audio signal needs to be sampled at a frequency which is at least twice
the highest frequency in the audio signal. I can, of course, sample it
at any frequency higher than that. Sampling at a higher frequency than
is strictly necessary is called “oversampling”. There is a corollary to
this. All frequencies in the audio signal that are lower than the
highest frequency are therefore inherently being oversampled. The
lowest frequencies are being oversampled the most, and highest
frequencies the least. Oversampling gives me “information space” I can
use to encode a “sub-dynamic” (my term) signal. Here’s how…
At this point I wrote and then deleted three very dense and dry
paragraphs which described, and illustrated with examples, the
mathematics of how oversampling works. But I had to simplify it too
much to make it readable, in which form it was too easy to misinterpret,
so they had to go. Instead, I will somewhat bluntly present the end
result: The higher the oversampling rate, the deeper we can go below
the theoretical PCM limit. More precisely, each time we double the
sample rate, we can encode an additional 3dB of dynamic range. But
there’s no free lunch to be had. Simple information theory says we
can’t encode something below the level of the Least Significant Bit
(LSB), and yet that’s what we appear to have done. The extra
“information” must be encoded elsewhere in the data, and it is. In this
case it is encoded as high levels of harmonic distortion. The harmonic
distortion is the mathematical price we pay for encoding our
“sub-dynamic” signal. This is a specific example of a more general
mathematical consequence, which says that if we use the magic of
oversampling to encode signals below the level of the LSB, other signals
- think of them as aliases if you like - are going to appear at higher
frequencies, and there is nothing we can do about that.
Let’s
go back again to dither, and consider a technique - Noise Shaping - that
I mentioned in a previous post. Noise shaping relates to the fact that
when we quantize a signal in a digital representation, the resultant
quantization error looks like a noise signal added to the waveform.
What is the spectrum of this noise signal? It turns out that we have a
significant level of control over what it can look like. At lower
frequencies we can squeeze that noise down to levels way below the value
of the LSB, lower even than can be achieved by oversampling alone, but
at the expense of huge amounts of additional noise popping up at higher
frequencies. That high-frequency noise is the "aliases" of the
sub-dynamic "low frequency" information that our Noise Shaping has
encoded - even if that low frequency information is silence(!). This is
what we mean by Noise Shaping - we “shape” the quantization noise so
that it is lower at low frequencies and higher at high frequencies. For
CD audio, those high frequencies must all be within the audio frequency
range, and as a consequence, you have to be very careful in deciding
where and when (and even whether) to use it, and what “shape” you want
to employ. Remember - no free lunch.
But if we increase the
sample rate we also increase the high frequency space above the limit of
audibility. Perhaps we can use it as a place to park all that “shaped”
high-frequency noise? Tomorrow, we’ll find out.