Thursday, 30 July 2015

Got a Question?

Got a question you want answered on this blog?  Or a topic you would like to see discussed?  Or maybe you have some feedback you would like to give us?  Just send me an e-mail using blog@bitperfectsound.com.

Here are the rules of engagement:

  1. Depending on how many e-mails I receive, I probably won’t reply to your e-mail.
  2. It is entirely up to me whether I address your suggestion on the blog.
  3. It may be a long time before I get round to doing it.
  4. I may be inspired by your suggestion, but end up addressing a different point entirely.
I have kept a copy of this post among the 'stickies' on the right of the page.

Wednesday, 29 July 2015

Things that lurk below the Bit Depth

In digital audio Bit Depth governs the Dynamic Range and Signal-to-Noise Ratio (SNR), but the relationships often lead to confusion.  I thought it was worth a quick discussion to see if I can maybe shed some light on that.  But then I found that my ramblings went a little further than I originally intended.  So read on…

First of all, it is clear that the bit depth sets some sort of lower limit on the magnitude of the signal that can be digitally encoded.  With a 16-bit PCM system, the magnitude of the signal must be encoded as one of 65,536 levels.  You can think of them as going from 0 to 65,535 but in practice they are used from -32,768 to +32,767 which gives us a convenient way to store both the negative and positive excursions of a typical audio signal.  If the magnitude of the signal is such that its peaks exceed ±32,767 then we have a problem because we don’t have levels available to record those values.  This sets an upper limit on the magnitude of a signal we can record.

On the other hand, if we progressively attenuate the signal, making it smaller and smaller, eventually we will get to the point where its peaks barely register at ±1.  If we attenuate it even further, then it will fail to register at all and it will be encoded as silence.  This sets the lower limit on the magnitude of the signal we can record.  Yes, there are some ifs and buts associated with both of these scenarios, but for the purpose of this post they don’t make a lot of difference.

The ratio between the upper and lower limits of the magnitudes that we can record is the Dynamic Range of the recording system.  The mathematics of this works out to be quite simple.  Each bit of the bit depth provides almost exactly 6dB of Dynamic Range.  So, if we are using a 16-bit system our Dynamic Range will be ~96dB (= 16x6).  And if we increase it to 24-bits the Dynamic Range increases to ~144dB (= 24x6).  For those of you who want the exact formula, it is 1.76 + 6.06D (where D is the bit depth).

So far, so good.  But where does the SNR come into it?  The answer, and the reason why it is the cause of so much confusion, is that both the signal and the noise are frequency dependent.  Both may be spread over several frequencies, which may be similar or different frequencies.  Sometimes you don’t actually know too much about the frequency distributions of either.  Therefore, in order to be able to analyze and measure the ratios of one to the other, you often need to be able to look at the frequency distributions of both.

The way to do that is to take the audio data and use your computer to put it through a Fourier Transform.  This breaks the audio data down into individual frequencies, and for each frequency it tells you how much of that particular frequency is present in the audio data.  If you plot all these data points on a graph you get the audio data’s Frequency Spectrum.  In digital audio, we use a variant of the Fourier Transform called a DFT, which takes as its input a specific part of the audio data comprising a number of consecutive samples.  With a DFT the number of audio samples ends up being the same as the number of frequencies in the resulting Frequency Spectrum, so if we use a lot of audio data we can obtain very good resolution in the frequency spectrum.  However, if we use too may samples it can make the calculation itself excessively laborious, so most audio DFTs are usually derived from between 1,000 and 65,000 samples.

In principle, we can synthesize an audio data file containing nothing but a 1kHz pure tone, with no noise whatsoever.  If we looked at the DFT of that data file we would see a signal at the 1kHz frequency point, and absolutely nothing everywhere else.  This makes sense, because we have some signal, and no noise at all.  I can also synthesize a noise file by filling each audio sample with random numbers.  If the numbers are truly random, we get White Noise.  I can encode my white noise at full power (where the maximum positive and negative going encoded values are ±32,767), or I can attenuate it by 96dB so that the maximum positive and negative going encoded values are ±1.  If I attenuate it by more than that I only get silence.

Suppose I look at an DFT of my synthesized music data file containing white noise at -96dB.  Suppose my DFT uses 8,196 samples, and consequently I end up with a Frequency Response with 8,196 frequencies.  What do we expect to see?  Most people would expect to see the noise value at each frequency to be -96dB, but they would be wrong.  The value is much lower than that.  [Furthermore, there is a lot of “noise” in the frequency response itself, although for the purposes of this post we are going to ignore that aspect of it.]  Basically, the noise is more or less equally shared out among the 8,192 frequencies, so the noise at each frequency is approximately 1/8192 of the total noise, or about 38dB down.  The important result here is that the level of the noise floor in the DFT plot is a long way below the supposed -96dB noise floor, and how far below depends on the number of frequencies in the DFT.  And there is more.  DFTs use a thing called a ‘window function’ for reasons I have described in a previous post, and the choice of window function significantly impacts the level where the noise floor sits in the DFT plot.

If I make a synthesized music file containing a combination of a 1kHz pure tone and white noise at -96dB, and look at that in a DFT, what would we see?  The answer is that the noise behaves exactly as I have previously described, with the level of the noise floor on the plot varying according to both the number of frequencies in the DFT and the choice of window function.  The 1kHz pure tone is not affected, though.  Because it is a 1kHz pure tone, its energy only appears at the one frequency in the DFT corresponding to 1kHz, and it really doesn’t matter that much how many frequencies there are in the DFT.  [The choice of window function does impact both of those things, but for the purposes of this post I want to ignore that.]

The Signal-to-Noise Ratio (SNR) is exactly what it says.  It is the ratio of the the signal to the noise.  If those values are expressed in a dB scale, then it is the difference between the two dB values.  So if the signal is at -6dB and the noise is a -81dB, then the SNR will be 75dB, which is the difference between the two.  But since we have seen that the actual measured value of the noise level varies depending on how we do the DFT, yet the measured value of the signal pretty much does not, then an SNR value derived from an DFT is not very useful when it comes to quoting numbers for comparison purposes.  It is only useful when comparing two measurements made using the same DFT algorithms, set up with the exact same number of samples and the same window function.

Sometimes the SNR has to be measured purely in analog space.  For example, you might measure the overall background hiss on a blank analog tape before recording anything on it.  When you then make your recording, one measure of the SNR will be ratio between the level of the recording and the level of the tape hiss.  Or you can measure the output of a microphone in a silent anechoic chamber before using the same microphone to make a recording.  One measure of the SNR of the microphone would be the ratio between the two recorded levels.  I use the term “one measure” of the SNR intentionally - because any time you measure SNR, whatever the tools and methods you use to make the measurement, the result is only of relevance if the methodology is well thought out and fully disclosed along with the results.  In reality, the nuances and variables as such that you can write whole books about how to specify and measure SNR.

Clearly, if the noise component of the SNR is a broadband signal, such as the hiss from a blank tape or the signal from a microphone in a silent room, then my ability to accurately represent that noise in a PCM format is going to be limited by the bit depth and therefore by its Dynamic Range.  But if I use a DFT to examine the spectral content of the noise signal, then, as I have just described, the noise is going to be spread over all of the frequencies and the component of the noise at each frequency will be proportionately lower.  What does it mean, then, if the noise content at a given frequency is below the minimum level represented by the Dynamic Range?  For example, in a 16-bit system, where the Dynamic Range is about 96dB, what does it mean if the noise level at any given frequency is measured using an DFT to be a long way below that - for example at -120dB?  Clearly, that noise is being encoded, so we must conclude that a 16-bit system can encode noise at levels several 10’s of dB below what we thought was the minimum signal level that could be encoded.  The question then arises, if we can encode noise at those levels, can we also encode signals?

The answer is yes we can, but at this point my challenge is to explain how this is possible in words of one proverbial syllable.  My approach is to propose a thought experiment.  Let’s take the white noise signal we talked about previously - the one at a level of -96dB which is just about encodable in a 16-bit PCM format.  We took our DFT of this signal and found that the noise component at each frequency was way lower than -96dB - lets say that it was 30dB down at -126dB.  Therefore the frequency content of the noise signal at one specific frequency - say, 1kHz - was at a level of -126dB.  Let us therefore apply a hypothetical filter to the input signal such that we strip out every frequency except 1kHz.  So now, we have taken our white noise signal at -96dB and filtered it to become a 1kHz signal at -126dB.  Our DFT previously showed that we had managed to encode that signal, in the sense that our DFT registered its presence and measured its intensity.  But, with the rest of the white noise filtered out, our input signal now comprises nothing but a single frequency at a level 30dB **below** the minimum level that can be represented by a 16-bit PCM system, and the result is pure, unadulterated, digital silence.

What happened there?  When the 1kHz component was part of the noise, we could detect its presence in the encoded signal, but when the rest of the noise was stripped out leaving only the 1kHz component behind, that 1kHz component vanished also.  It is clear that the presence of the totality of the noise is critical in permitting each of its component frequencies to be encoded.  There is something about the presence of noise that enables information to be encoded in a PCM system at levels far below those determined solely by the bit depth.

Exactly what it is, though is beyond the scope of this post.  I’m sorry, because I know you were salivating to hear the answer!  But from this point on it boils down to ugly mathematics.  However, this result forms the basis of a principle that can be used to accomplish a number of party tricks in the digital audio domain.  These tricks include dithering, noise shaping, and sigma-delta modulation.  With dithering, we can add a very small amount of noise in order for a larger amount of quantization-error-induced distortion to be eliminated.  With noise shaping, we can reduce the noise level at frequencies where we are sensitive to noise, at the expense of increasing the noise at frequencies where it is less audible.  And with sigma-delta modulation we can obtain state-of-the-art audio performance from a bit depth of as little as 1-bit, at the expense of a dramatic increase in the sample rate.

With DSD, for example, an entire audio stream with state-of-the-art performance can be made to lurk below the 1-bit Bit Depth.

Monday, 20 July 2015

How Do DACs Work?

All digital audio whether PCM or DSD stores the analog audio signal as a stream of numbers, each one representing an instantaneous snapshot of its continuously evolving value.  With either format, the digital bit pattern is its best representation of the analog signal value at each instant in time.  With PCM the bit pattern typically comprises either 16- (or 24-bit) numbers each representing the exact value of analog signal value to a precision of one part in 65,535 (or one part in 16,777,216).  With DSD the precision is 1 bit, which means that it encodes the instantaneous analog voltage as either maximum positive or maximum negative with nothing in between (and you may well wonder how that manages to represent anything, which is a different discussion entirely, but nevertheless it does).  In either case, though, the primary task of the DAC is to generate those output voltages in response to the incoming bitstream.  Lets take a look at how that is done.

For the purposes of this post I am going to focus exclusively on the core mechanisms involved in transforming a bit stream into an analog signal.  Aside from these core mechanisms there are further mission-critical issues such as clock timing and noise, but these are not the subject of this post.  At some point I will write another post on clocks, timing, and jitter.

The most conceptually simple way of converting digital to analog, is to use something called an R-2R ladder.  This is a simple sequence of resistors of alternating values ‘R’ and ‘2R’, wired together in a ‘ladder’-like configuration.  There’s nothing more to it than that.  Each ‘2R’ resistor has exactly twice the resistance value as each ‘R’ resistor, and all the ‘R’s and all the ‘2R’s are absolutely identical.  Beyond that, the actual value of the resistances is not crucial.  Each R-2R pair, if turned “on” by its corresponding PCM bit, contributes the exact voltage to the output which is encoded by that bit.  It is very simple to understand, and in principle is trivial to construct, but in practice it suffers from a very serious drawback.  You see, the resistors have to be accurate to a phenomenal degree.  For 16-bit PCM that means an accuracy of one part in 65 thousand, and for 24-bit PCM one part in 16 million.  If you want to make your own R-2R ladder-DAC you need to be able to go out and buy those resistors.

As best as I can tell, the most accurate resistors available out there on a regular commercial basis are accurate to ±0.005% which is equivalent to one part in 20,000.  Heaven knows what they cost.  And that’s not the end of the story.  The resistance value is very sensitive to temperature, which means you have to mount them in a carefully temperature-controlled environment.  And even if you do that, the act of passing the smallest current through it will heat it sufficiently to change its resistance value.  [Note:  In fact this tends to be what limits the accuracy of available resistors - the act of measuring the resistance actually perturbs the resistance by more than the accuracy to which you’re trying to measure it!  Imagine what that means when you try to deploy the resistor in an actual circuit…]  The resistor’s inherent inductance (even straight wires have inductance) also affects the DAC ladder when such phenomenal levels of precision enter the equation.  And we’re still not done yet
unfortunately the resistance values drift with time, so your precision assembled, thermally cushioned and inductance-balanced R-2R network may leave the factory operating to spec, but may well be out of spec by the time it has broken in at the customer’s system.  These are the problems that a putative R-2R ladder DAC designer must be willing and able to face up to.  Which is why there are so few of them on the market.

Manufacturers of some R-2R ladder-DACs use the term ‘NOS’ (Non-Over-Sampling) to describe their architecture.  I don’t much like that terminology because it is a rather vague piece of jargon and can in principle be used to mean other things, but the blame lies at the feet of many modern DAC chipset manufacturers (and the DAC product manufacturers who use them) who describe their architectures as "Over-Sampling", hence the use of the term NOS as a distinction.

Before moving on, we’ll take an equally close look at how DSD gets converted to analog.  In principle, the incoming bit stream can be fed into its own 1-bit R-2R ladder, which, being 1-bit, is no longer a ladder and comprises only the first resistor R, whose precision no longer really matters.  And that’s all there is to it.  Easy, in comparison to PCM.  Something which has not gone unnoticed … and which we’ll come back to again later.

Aside from what I have just described, for both PCM and DSD three major things are left for the designer to deal with.  First is to make sure the output reference voltages are stable and with as little noise as possible.  Second is to ensure that the switching of the analog voltages in response to the incoming digital bit stream is done in a consistent manner and with sufficient timing accuracy.  Third is to remove any unwanted noise that might be present in the analog signal that has just been created.  These are the implementation areas in which a designer generally has the most freedom and opportunity to bring his own skills to bear.

The third of these is the most interesting in the sense that it differs dramatically between 1-bit (DSD) and multi-bit (PCM) converters.  Although in both cases the noise that needs to be removed lives at inaudible ultrasonic frequencies, with PCM there is not much of it at all, whereas with DSD there is so much of it that the noise power massively overwhelms the signal power.  With PCM, there are even some DACs which dispense with analog filtering entirely, working on the basis that the noise is both inaudible, and at too low a level to be able to upset the downstream electronics.  With DSD, though, removing this noise is a necessary and significant requirement.

Regarding the analog filters, most designers are agreed that although different audio stream formats can be optimized such that each format has its own ideal analog filter, if a DAC is designed to support multiple stream formats it is impractical to provide multiple analog filters and switch them in and out of circuit according to the format currently being played.  Therefore most DACs will have a single analog output filter which is used for every incoming stream format.

The developers of the original SACD players noted that the type of analog filter that was required to perform this task was more or less the same as the anti-aliasing filters used in the output of the CD format, which they were trying to improve upon.  They recognized that those filters degraded the sound.  So instead, in the earliest players, they decided to upconvert the DSD from what we today call DSD64 to what we would now call DSD128.  With DSD128 the ultrasonic filter was found to be less of a problem and was considered not to affect the sound in the same way.  Bear in mind, though, that in doing the upconversion from DSD64 to DSD128 you still have to filter out the DSD64’s ultrasonic noise.  However, this can be done in the digital domain, and (long story short) digital filters almost always sound better than their analog counterparts.

As it happens, similar techniques had already been in use with PCM DACs for over a decade.  Because R-2R ladder DACs were so hard to work with, it was much easier to convert the incoming PCM to a DSD-like format and perform the physical D-to-A conversion step in a 1-bit format.  Although the conversion of PCM to DSD via an SDM is technically very complex and elaborate, it can be done entirely in the digital realm which means that it can also be done remarkably inexpensively.

When I say "DSD-like" what do I mean?  DSD, strictly speaking, is a trademark developed by Sony and Philips (and currently owned by Sonic Studio, LLC).  It stands for Direct Stream Digital and refers specifically to a 1-bit format at a sample rate of 2.8224MHz.  But the term is now being widely used to refer to a broad class of formats which encode the audio signal using the output of a Sigma-Delta Modulator (SDM).  An SDM can be configured to operate at any sample rate you like and with any bit depth you like.  For example, the output of an SDM could even be a conventional PCM bitstream and such an SDM can actually pass a PCM bitstream through unchanged.  A key limitation of an SDM is that they can be unstable when configured with a 1-bit output stream.  However, this instability can be practically eliminated by using a multi-bit output.  For this reason, most modern PCM DACs will upconvert (or ‘Over-Sample’) the incoming PCM before passing it through an SDM with an output bit depth of between 3 and 5 bits.  This means that the physical D-to-A conversion is done with a 3- to 5-stage resistor ladder, which can be easily implemented.

These SDM-based DACs are so effective that today there are hardly any R-2R ladder DACs in production, and those that are
such as the Light Harmonic Da Vinci can be eye-wateringly expensive.  The intermediate conversion of an incoming signal to a DSD-like format means that, in principle, any digital format (including DSD) can be readily supported, as evidenced by the plethora of DSD-compatible DACs on the market today.  Because these internal conversions are performed entirely in the digital domain, manufacturers typically produce complete chip sets capable of performing all of the conversion functionality on-chip, driving the costs down considerably when compared to an R-2R ladder approach.  The majority of DACs on the market today utilize chip sets from one of five major suppliers ESS, Wolfson, Burr-Brown (TI), AKM, and Philips although there are others as well.

Interestingly, all of this is behind the recent emergence of DSD as a niche in-demand consumer format.  In a previous post I showed that almost all ADCs in use today use an SDM-based structure to create a ‘DSD-like’ intermediate format which is then digitally converted to PCM.  Today I showed the corollary in DAC architectures where incoming PCM is digitally converted to a ‘DSD-like’ intermediate format which is then converted to analog.  The idea behind DSD is that you get to ‘cut out the middlemen’ - in this case the digital conversions to and from the ‘DSD-like’ intermediate formats.  Back when SACD was invented the only way to handle and distribute music data which required 3-5GB of storage space was using optical disks.  Today, not only do we have hard disks that can hold the contents of hundreds upon hundreds of SACDs, but we have an internet infrastructure in place that allows people to download such files as a matter of convenience.  So if we liked the sound of SACD, but wanted to implement it in the more modern world of computer-based audio, the technological wherewithal now exists to support a file-based paradigm similar to what we have become used to with PCM.  This is what underpins the current interest in DSD.

To be sure, the weak link of the above argument is that DSD is not the same as ‘DSD-like’, and in practice you still have to convert digitally between ‘DSD-like’ and DSD in both the ADC and the DAC.  But a weak is link is not the same thing as a fatal flaw, and DSD as a consumer format remains highly regarded in many discerning quarters.