Thursday, 30 July 2015

Got a Question?

Got a question you want answered on this blog?  Or a topic you would like to see discussed?  Or maybe you have some feedback you would like to give us?  Just send me an e-mail using

Here are the rules of engagement:

  1. Depending on how many e-mails I receive, I probably won’t reply to your e-mail.
  2. It is entirely up to me whether I address your suggestion on the blog.
  3. It may be a long time before I get round to doing it.
  4. I may be inspired by your suggestion, but end up addressing a different point entirely.
I have kept a copy of this post among the 'stickies' on the right of the page.

Wednesday, 29 July 2015

Things that lurk below the Bit Depth

In digital audio Bit Depth governs the Dynamic Range and Signal-to-Noise Ratio (SNR), but the relationships often lead to confusion.  I thought it was worth a quick discussion to see if I can maybe shed some light on that.  But then I found that my ramblings went a little further than I originally intended.  So read on…

First of all, it is clear that the bit depth sets some sort of lower limit on the magnitude of the signal that can be digitally encoded.  With a 16-bit PCM system, the magnitude of the signal must be encoded as one of 65,536 levels.  You can think of them as going from 0 to 65,535 but in practice they are used from -32,768 to +32,767 which gives us a convenient way to store both the negative and positive excursions of a typical audio signal.  If the magnitude of the signal is such that its peaks exceed ±32,767 then we have a problem because we don’t have levels available to record those values.  This sets an upper limit on the magnitude of a signal we can record.

On the other hand, if we progressively attenuate the signal, making it smaller and smaller, eventually we will get to the point where its peaks barely register at ±1.  If we attenuate it even further, then it will fail to register at all and it will be encoded as silence.  This sets the lower limit on the magnitude of the signal we can record.  Yes, there are some ifs and buts associated with both of these scenarios, but for the purpose of this post they don’t make a lot of difference.

The ratio between the upper and lower limits of the magnitudes that we can record is the Dynamic Range of the recording system.  The mathematics of this works out to be quite simple.  Each bit of the bit depth provides almost exactly 6dB of Dynamic Range.  So, if we are using a 16-bit system our Dynamic Range will be ~96dB (= 16x6).  And if we increase it to 24-bits the Dynamic Range increases to ~144dB (= 24x6).  For those of you who want the exact formula, it is 1.76 + 6.06D (where D is the bit depth).

So far, so good.  But where does the SNR come into it?  The answer, and the reason why it is the cause of so much confusion, is that both the signal and the noise are frequency dependent.  Both may be spread over several frequencies, which may be similar or different frequencies.  Sometimes you don’t actually know too much about the frequency distributions of either.  Therefore, in order to be able to analyze and measure the ratios of one to the other, you often need to be able to look at the frequency distributions of both.

The way to do that is to take the audio data and use your computer to put it through a Fourier Transform.  This breaks the audio data down into individual frequencies, and for each frequency it tells you how much of that particular frequency is present in the audio data.  If you plot all these data points on a graph you get the audio data’s Frequency Spectrum.  In digital audio, we use a variant of the Fourier Transform called a DFT, which takes as its input a specific part of the audio data comprising a number of consecutive samples.  With a DFT the number of audio samples ends up being the same as the number of frequencies in the resulting Frequency Spectrum, so if we use a lot of audio data we can obtain very good resolution in the frequency spectrum.  However, if we use too may samples it can make the calculation itself excessively laborious, so most audio DFTs are usually derived from between 1,000 and 65,000 samples.

In principle, we can synthesize an audio data file containing nothing but a 1kHz pure tone, with no noise whatsoever.  If we looked at the DFT of that data file we would see a signal at the 1kHz frequency point, and absolutely nothing everywhere else.  This makes sense, because we have some signal, and no noise at all.  I can also synthesize a noise file by filling each audio sample with random numbers.  If the numbers are truly random, we get White Noise.  I can encode my white noise at full power (where the maximum positive and negative going encoded values are ±32,767), or I can attenuate it by 96dB so that the maximum positive and negative going encoded values are ±1.  If I attenuate it by more than that I only get silence.

Suppose I look at an DFT of my synthesized music data file containing white noise at -96dB.  Suppose my DFT uses 8,196 samples, and consequently I end up with a Frequency Response with 8,196 frequencies.  What do we expect to see?  Most people would expect to see the noise value at each frequency to be -96dB, but they would be wrong.  The value is much lower than that.  [Furthermore, there is a lot of “noise” in the frequency response itself, although for the purposes of this post we are going to ignore that aspect of it.]  Basically, the noise is more or less equally shared out among the 8,192 frequencies, so the noise at each frequency is approximately 1/8192 of the total noise, or about 38dB down.  The important result here is that the level of the noise floor in the DFT plot is a long way below the supposed -96dB noise floor, and how far below depends on the number of frequencies in the DFT.  And there is more.  DFTs use a thing called a ‘window function’ for reasons I have described in a previous post, and the choice of window function significantly impacts the level where the noise floor sits in the DFT plot.

If I make a synthesized music file containing a combination of a 1kHz pure tone and white noise at -96dB, and look at that in a DFT, what would we see?  The answer is that the noise behaves exactly as I have previously described, with the level of the noise floor on the plot varying according to both the number of frequencies in the DFT and the choice of window function.  The 1kHz pure tone is not affected, though.  Because it is a 1kHz pure tone, its energy only appears at the one frequency in the DFT corresponding to 1kHz, and it really doesn’t matter that much how many frequencies there are in the DFT.  [The choice of window function does impact both of those things, but for the purposes of this post I want to ignore that.]

The Signal-to-Noise Ratio (SNR) is exactly what it says.  It is the ratio of the the signal to the noise.  If those values are expressed in a dB scale, then it is the difference between the two dB values.  So if the signal is at -6dB and the noise is a -81dB, then the SNR will be 75dB, which is the difference between the two.  But since we have seen that the actual measured value of the noise level varies depending on how we do the DFT, yet the measured value of the signal pretty much does not, then an SNR value derived from an DFT is not very useful when it comes to quoting numbers for comparison purposes.  It is only useful when comparing two measurements made using the same DFT algorithms, set up with the exact same number of samples and the same window function.

Sometimes the SNR has to be measured purely in analog space.  For example, you might measure the overall background hiss on a blank analog tape before recording anything on it.  When you then make your recording, one measure of the SNR will be ratio between the level of the recording and the level of the tape hiss.  Or you can measure the output of a microphone in a silent anechoic chamber before using the same microphone to make a recording.  One measure of the SNR of the microphone would be the ratio between the two recorded levels.  I use the term “one measure” of the SNR intentionally - because any time you measure SNR, whatever the tools and methods you use to make the measurement, the result is only of relevance if the methodology is well thought out and fully disclosed along with the results.  In reality, the nuances and variables as such that you can write whole books about how to specify and measure SNR.

Clearly, if the noise component of the SNR is a broadband signal, such as the hiss from a blank tape or the signal from a microphone in a silent room, then my ability to accurately represent that noise in a PCM format is going to be limited by the bit depth and therefore by its Dynamic Range.  But if I use a DFT to examine the spectral content of the noise signal, then, as I have just described, the noise is going to be spread over all of the frequencies and the component of the noise at each frequency will be proportionately lower.  What does it mean, then, if the noise content at a given frequency is below the minimum level represented by the Dynamic Range?  For example, in a 16-bit system, where the Dynamic Range is about 96dB, what does it mean if the noise level at any given frequency is measured using an DFT to be a long way below that - for example at -120dB?  Clearly, that noise is being encoded, so we must conclude that a 16-bit system can encode noise at levels several 10’s of dB below what we thought was the minimum signal level that could be encoded.  The question then arises, if we can encode noise at those levels, can we also encode signals?

The answer is yes we can, but at this point my challenge is to explain how this is possible in words of one proverbial syllable.  My approach is to propose a thought experiment.  Let’s take the white noise signal we talked about previously - the one at a level of -96dB which is just about encodable in a 16-bit PCM format.  We took our DFT of this signal and found that the noise component at each frequency was way lower than -96dB - lets say that it was 30dB down at -126dB.  Therefore the frequency content of the noise signal at one specific frequency - say, 1kHz - was at a level of -126dB.  Let us therefore apply a hypothetical filter to the input signal such that we strip out every frequency except 1kHz.  So now, we have taken our white noise signal at -96dB and filtered it to become a 1kHz signal at -126dB.  Our DFT previously showed that we had managed to encode that signal, in the sense that our DFT registered its presence and measured its intensity.  But, with the rest of the white noise filtered out, our input signal now comprises nothing but a single frequency at a level 30dB **below** the minimum level that can be represented by a 16-bit PCM system, and the result is pure, unadulterated, digital silence.

What happened there?  When the 1kHz component was part of the noise, we could detect its presence in the encoded signal, but when the rest of the noise was stripped out leaving only the 1kHz component behind, that 1kHz component vanished also.  It is clear that the presence of the totality of the noise is critical in permitting each of its component frequencies to be encoded.  There is something about the presence of noise that enables information to be encoded in a PCM system at levels far below those determined solely by the bit depth.

Exactly what it is, though is beyond the scope of this post.  I’m sorry, because I know you were salivating to hear the answer!  But from this point on it boils down to ugly mathematics.  However, this result forms the basis of a principle that can be used to accomplish a number of party tricks in the digital audio domain.  These tricks include dithering, noise shaping, and sigma-delta modulation.  With dithering, we can add a very small amount of noise in order for a larger amount of quantization-error-induced distortion to be eliminated.  With noise shaping, we can reduce the noise level at frequencies where we are sensitive to noise, at the expense of increasing the noise at frequencies where it is less audible.  And with sigma-delta modulation we can obtain state-of-the-art audio performance from a bit depth of as little as 1-bit, at the expense of a dramatic increase in the sample rate.

With DSD, for example, an entire audio stream with state-of-the-art performance can be made to lurk below the 1-bit Bit Depth.

Monday, 20 July 2015

How Do DACs Work?

All digital audio whether PCM or DSD stores the analog audio signal as a stream of numbers, each one representing an instantaneous snapshot of its continuously evolving value.  With either format, the digital bit pattern is its best representation of the analog signal value at each instant in time.  With PCM the bit pattern typically comprises either 16- (or 24-bit) numbers each representing the exact value of analog signal value to a precision of one part in 65,535 (or one part in 16,777,216).  With DSD the precision is 1 bit, which means that it encodes the instantaneous analog voltage as either maximum positive or maximum negative with nothing in between (and you may well wonder how that manages to represent anything, which is a different discussion entirely, but nevertheless it does).  In either case, though, the primary task of the DAC is to generate those output voltages in response to the incoming bitstream.  Lets take a look at how that is done.

For the purposes of this post I am going to focus exclusively on the core mechanisms involved in transforming a bit stream into an analog signal.  Aside from these core mechanisms there are further mission-critical issues such as clock timing and noise, but these are not the subject of this post.  At some point I will write another post on clocks, timing, and jitter.

The most conceptually simple way of converting digital to analog, is to use something called an R-2R ladder.  This is a simple sequence of resistors of alternating values ‘R’ and ‘2R’, wired together in a ‘ladder’-like configuration.  There’s nothing more to it than that.  Each ‘2R’ resistor has exactly twice the resistance value as each ‘R’ resistor, and all the ‘R’s and all the ‘2R’s are absolutely identical.  Beyond that, the actual value of the resistances is not crucial.  Each R-2R pair, if turned “on” by its corresponding PCM bit, contributes the exact voltage to the output which is encoded by that bit.  It is very simple to understand, and in principle is trivial to construct, but in practice it suffers from a very serious drawback.  You see, the resistors have to be accurate to a phenomenal degree.  For 16-bit PCM that means an accuracy of one part in 65 thousand, and for 24-bit PCM one part in 16 million.  If you want to make your own R-2R ladder-DAC you need to be able to go out and buy those resistors.

As best as I can tell, the most accurate resistors available out there on a regular commercial basis are accurate to ±0.005% which is equivalent to one part in 20,000.  Heaven knows what they cost.  And that’s not the end of the story.  The resistance value is very sensitive to temperature, which means you have to mount them in a carefully temperature-controlled environment.  And even if you do that, the act of passing the smallest current through it will heat it sufficiently to change its resistance value.  [Note:  In fact this tends to be what limits the accuracy of available resistors - the act of measuring the resistance actually perturbs the resistance by more than the accuracy to which you’re trying to measure it!  Imagine what that means when you try to deploy the resistor in an actual circuit…]  The resistor’s inherent inductance (even straight wires have inductance) also affects the DAC ladder when such phenomenal levels of precision enter the equation.  And we’re still not done yet
unfortunately the resistance values drift with time, so your precision assembled, thermally cushioned and inductance-balanced R-2R network may leave the factory operating to spec, but may well be out of spec by the time it has broken in at the customer’s system.  These are the problems that a putative R-2R ladder DAC designer must be willing and able to face up to.  Which is why there are so few of them on the market.

Manufacturers of some R-2R ladder-DACs use the term ‘NOS’ (Non-Over-Sampling) to describe their architecture.  I don’t much like that terminology because it is a rather vague piece of jargon and can in principle be used to mean other things, but the blame lies at the feet of many modern DAC chipset manufacturers (and the DAC product manufacturers who use them) who describe their architectures as "Over-Sampling", hence the use of the term NOS as a distinction.

Before moving on, we’ll take an equally close look at how DSD gets converted to analog.  In principle, the incoming bit stream can be fed into its own 1-bit R-2R ladder, which, being 1-bit, is no longer a ladder and comprises only the first resistor R, whose precision no longer really matters.  And that’s all there is to it.  Easy, in comparison to PCM.  Something which has not gone unnoticed … and which we’ll come back to again later.

Aside from what I have just described, for both PCM and DSD three major things are left for the designer to deal with.  First is to make sure the output reference voltages are stable and with as little noise as possible.  Second is to ensure that the switching of the analog voltages in response to the incoming digital bit stream is done in a consistent manner and with sufficient timing accuracy.  Third is to remove any unwanted noise that might be present in the analog signal that has just been created.  These are the implementation areas in which a designer generally has the most freedom and opportunity to bring his own skills to bear.

The third of these is the most interesting in the sense that it differs dramatically between 1-bit (DSD) and multi-bit (PCM) converters.  Although in both cases the noise that needs to be removed lives at inaudible ultrasonic frequencies, with PCM there is not much of it at all, whereas with DSD there is so much of it that the noise power massively overwhelms the signal power.  With PCM, there are even some DACs which dispense with analog filtering entirely, working on the basis that the noise is both inaudible, and at too low a level to be able to upset the downstream electronics.  With DSD, though, removing this noise is a necessary and significant requirement.

Regarding the analog filters, most designers are agreed that although different audio stream formats can be optimized such that each format has its own ideal analog filter, if a DAC is designed to support multiple stream formats it is impractical to provide multiple analog filters and switch them in and out of circuit according to the format currently being played.  Therefore most DACs will have a single analog output filter which is used for every incoming stream format.

The developers of the original SACD players noted that the type of analog filter that was required to perform this task was more or less the same as the anti-aliasing filters used in the output of the CD format, which they were trying to improve upon.  They recognized that those filters degraded the sound.  So instead, in the earliest players, they decided to upconvert the DSD from what we today call DSD64 to what we would now call DSD128.  With DSD128 the ultrasonic filter was found to be less of a problem and was considered not to affect the sound in the same way.  Bear in mind, though, that in doing the upconversion from DSD64 to DSD128 you still have to filter out the DSD64’s ultrasonic noise.  However, this can be done in the digital domain, and (long story short) digital filters almost always sound better than their analog counterparts.

As it happens, similar techniques had already been in use with PCM DACs for over a decade.  Because R-2R ladder DACs were so hard to work with, it was much easier to convert the incoming PCM to a DSD-like format and perform the physical D-to-A conversion step in a 1-bit format.  Although the conversion of PCM to DSD via an SDM is technically very complex and elaborate, it can be done entirely in the digital realm which means that it can also be done remarkably inexpensively.

When I say "DSD-like" what do I mean?  DSD, strictly speaking, is a trademark developed by Sony and Philips (and currently owned by Sonic Studio, LLC).  It stands for Direct Stream Digital and refers specifically to a 1-bit format at a sample rate of 2.8224MHz.  But the term is now being widely used to refer to a broad class of formats which encode the audio signal using the output of a Sigma-Delta Modulator (SDM).  An SDM can be configured to operate at any sample rate you like and with any bit depth you like.  For example, the output of an SDM could even be a conventional PCM bitstream and such an SDM can actually pass a PCM bitstream through unchanged.  A key limitation of an SDM is that they can be unstable when configured with a 1-bit output stream.  However, this instability can be practically eliminated by using a multi-bit output.  For this reason, most modern PCM DACs will upconvert (or ‘Over-Sample’) the incoming PCM before passing it through an SDM with an output bit depth of between 3 and 5 bits.  This means that the physical D-to-A conversion is done with a 3- to 5-stage resistor ladder, which can be easily implemented.

These SDM-based DACs are so effective that today there are hardly any R-2R ladder DACs in production, and those that are
such as the Light Harmonic Da Vinci can be eye-wateringly expensive.  The intermediate conversion of an incoming signal to a DSD-like format means that, in principle, any digital format (including DSD) can be readily supported, as evidenced by the plethora of DSD-compatible DACs on the market today.  Because these internal conversions are performed entirely in the digital domain, manufacturers typically produce complete chip sets capable of performing all of the conversion functionality on-chip, driving the costs down considerably when compared to an R-2R ladder approach.  The majority of DACs on the market today utilize chip sets from one of five major suppliers ESS, Wolfson, Burr-Brown (TI), AKM, and Philips although there are others as well.

Interestingly, all of this is behind the recent emergence of DSD as a niche in-demand consumer format.  In a previous post I showed that almost all ADCs in use today use an SDM-based structure to create a ‘DSD-like’ intermediate format which is then digitally converted to PCM.  Today I showed the corollary in DAC architectures where incoming PCM is digitally converted to a ‘DSD-like’ intermediate format which is then converted to analog.  The idea behind DSD is that you get to ‘cut out the middlemen’ - in this case the digital conversions to and from the ‘DSD-like’ intermediate formats.  Back when SACD was invented the only way to handle and distribute music data which required 3-5GB of storage space was using optical disks.  Today, not only do we have hard disks that can hold the contents of hundreds upon hundreds of SACDs, but we have an internet infrastructure in place that allows people to download such files as a matter of convenience.  So if we liked the sound of SACD, but wanted to implement it in the more modern world of computer-based audio, the technological wherewithal now exists to support a file-based paradigm similar to what we have become used to with PCM.  This is what underpins the current interest in DSD.

To be sure, the weak link of the above argument is that DSD is not the same as ‘DSD-like’, and in practice you still have to convert digitally between ‘DSD-like’ and DSD in both the ADC and the DAC.  But a weak is link is not the same thing as a fatal flaw, and DSD as a consumer format remains highly regarded in many discerning quarters.

Thursday, 25 June 2015

On DSD vs PCM … again

Mark Waldrep (aka ‘Dr. AIX’) has put a couple of DSD posts on his RealHD-Audio web site this month.  Mark writes quite knowledgeably on audiophile matters, but is prone to a ‘you-can’t-argue-with-the-facts’ attitude predicated on an overly simplistic subset of what actually comprises ‘the facts’.   In particular, Mark insists that 24-bit, 96kHz PCM is better than DSD, and one of the posts I am referring to discusses his abject bewilderment that 530 people (and counting) on the ‘Computer Audiophile’ blog would go to the trouble of participating in a thread which actively debates this assertion.  He writes as though it were a self-evident ‘night-follows-day’ kind of an issue, almost a point of theology.

Let’s look at some of those facts.  First of all, properly-dithered 24-bit PCM has a theoretical background noise signal within a dB or so of 144dB, whereas DSD64 rarely approaches within even 20dB of that.  No argument from me there.  Also, he points out that DSD64’s noise shaping process produces a massive amount of ultrasonic noise, which starts to appear just above the audio band and continues at a very, very high level all the way out to over 1MHz, which, he argues, all but subsumes the audio signal unless it is filtered out.  We’ll grant him some hyperbolic license, and agree that, technically, what he says is correct.

Another ‘fact’ is, though, that much to Waldrep’s chagrin, there is a substantial body of opinion out there that would prefer to listen to DSD over 24/96.  Why should this be, given that the above technical arguments (and others that you could also add into the mix with which I might also tend to agree) evidently set forth ‘the facts’?  Yes, why indeed… and the answer is simple to state, but complex in scope.  The main reason is that the pro-PCM arguments conveniently ignore the most critical aspect that differentiates the sound quality, which is the business of getting the audio signal into the PCM format in the first place.  Let’s take a look at that.

If we are to encode an audio signal in PCM format, the most obvious way to approach the problem is using a sample-and hold circuit.  This circuit looks at the incoming waveform, grabs hold of it at one specific instant, and ‘holds’ that value for the remainder of the sampling period.  By ‘holding’ the signal, what we are doing is zeroing in on the value that we actually want to measure long enough to actually measure it.

Next we have to assign a digital value to this sampled voltage, and there are a couple of distinct ways to do this.  One technique involves comparing the sampled signal level to the instantaneous value of a sawtooth waveform generated by a precision clock.  As soon as the comparator detects that the instantaneous value of the sawtooth waveform has exceeded the value of the sampled waveform, by looking at the number of clock cycles that have passed we can calculate a digital value for the sampled waveform.  Another technique is a ‘flash ADC’ where a number of simultaneous comparisons are made to precise DC values, each being a unique digital level.  Obviously, for a 16-bit DAC this would mean 65,535 comparator circuits!  That’s doable, but rather expensive.  Think of it as the ADC equivalent of an R-2R ladder DAC.  Yet another method is a hybrid of the two, where a sequence of comparators successively home in on the final result through a series of successive approximations whose logic I won’t attempt to unravel here.  Each of these methods is limited by the accuracy of both the timer clock and the reference voltage levels.

Ultimately, in mixed-signal electronics (circuits with both analog and digital functions), it ends up being far easier to achieve a clock of arbitrary precision than a reference voltage of arbitrary precision.  Way more so, in fact.  For this reason, sample-and-hold ADC architectures have fallen from favour in the world of high-end audio.  Instead, a technique called Sigma-Delta Modulation is used.  You will recognize this term - it is the architecture that is used to create the 1-bit bitstream used in DSD.  The SDM-ADC has for all practical purposes totally eliminated the sample-and-hold architectures in audio applications.

In an SDM-ADC, the trade-off between clock precision and reference voltage precision is resolved entirely in favour of the clock, which can be made as accurate as we want.  In effect, we increase the sample rate to something many, many times higher than what is actually required, and accept a significantly reduced measurement accuracy.  The inaccuracy of the instantaneous measurements are taken care of by a combination of averaging due to massive over-sampling and local feedback with the SDM.  That will have to do in terms of an explanation, because an SDM is a conceptually complex beast, particularly in its analog form.  In any case, the output of the SDM is a digital bitstream which can be 1-bit, but in reality is often 3-5 bits deep.  The PCM output data is obtained on-chip by a digital conversion process similar to that which happens within DSD Master.

As you know, if you are going to encode an analog signal in a PCM format, the price you have to pay is to strictly band-limit the signal to less than one half of the sample rate prior to encoding it.  This involves putting the signal through a ‘brick wall’ filter which removes all of the signal above a certain frequency while leaving everything below that frequency unchecked.  In a sample-and-hold ADC this is performed using an all-analog filter located within the input stage of the ADC.  In the SDM-ADC it is performed in the digital domain during the conversion from the 1-bit (or 3-5 bit) bitstream to the PCM output.

Brick wall filters are nasty things.  Let’s look at a loudspeaker crossover filter as an example of a simple low-pass analog filter that generally can’t be avoided in our audio chain.  The simplest filter is a single-stage filter with a cut-off slope of 6dB per octave (6dB/8ve).  Steeper filters are considered to be progressively more intrusive due to phase disturbances which they introduce, although in practical designs steeper filters are often necessary to get around still greater issues elsewhere.  Now compare that to a brick-wall ‘anti-aliasing’ filter.  For 16/44.1 audio, this needs to pass all frequencies up to 20kHz, yet attenuate all frequencies above 22.05kHz by at least 96dB.  That means a slope of at least 300dB/8ve is required.

If we confine ourselves purely to digital anti-aliasing filters used in a SDM-ADC, a slope of 300dB/8ve inevitably requires an ‘elliptic’ filter.  Whole books have been devoted to elliptic filters, so I shall confine myself to saying that these filters have rather ugly phase responses.  In principle they also have a degree of pass-band ripple, but I am willing to stipulate to an argument that such ripple is practically inaudible.  The phase argument is another matter, though.  Although conventional wisdom has it that phase distortion is inaudible, there is an increasing body of anecdotal evidence that suggests the opposite is the case.  One of the core pillars of Meridian’s recent MQA initiative is based on the assumed superiority of “minimum phase” filter architectures, for example.

By increasing the sample rate of PCM we can actually reduce the aggression required of our anti-aliasing filters.  I have written a previous post on this subject, but the bottom line is that only at sample rates above the 8Fs family (352.8/384kHz) can anti-aliasing filters be implemented with sufficiently low phase distortion.  And Dr. AIX poo-poohs even 24/352.8 (aka ‘DXD’) as a credible format for high-end audio.  Here at BitPerfect we are persuaded by the notion that the sound of digital audio is actually the sound of the anti-aliasing filters that are necessary for its existence, and that the characteristic that predominantly governs this is their phase response.

PCM requires an anti-aliasing filter, whereas DSD does not (actually, strictly speaking it does, but it is such a gentle filter that you could not with any kind of a straight face describe it as a ‘brick-wall’ filter).  DSD has no inherent phase distortion resulting from a required filter.  Instead, it has ultrasonic noise, and this is where Dr. AIX’s argument encounters difficulties.  The simple solution is to filter it out.  However, if we read his post, he seems to think that no such filtering is used … I quote: "It’s supposed to be out of the audio band but there is no ‘audio band’ for your playback equipment".  Seriously?  All it calls for is a filter similar to PCM’s ‘anti-aliasing’ filter, except not nearly as rigorous in its requirements.

Let me tell you how DSD Master approaches this in our DSD-to-PCM conversions.  We know that, for 24/176.4 PCM conversions for example, we need only concern ourselves in a strict sense with that portion of the ultrasonic noise above 88.2kHz.  It needs to be filtered out by at least 144dB or we will get aliasing.  However, the steepness of the filter and its phase response are governed by the filter’s cut-off frequency.  For the filters we use, the phase response remains pretty much linear up to about 80% of this frequency.  Therefore we have some design freedom to push this frequency out as far as we want, and we choose to place it at a high enough frequency that the phase response remains quasi-linear across the entire audio band.  Of course, the further we push it out, the more of the ultrasonic noise is allowed to remain in the encoded PCM data.

As an aside, you might well ask: If the ultrasonic noise is inaudible, then why do we have to filter it out in the first place?  And that would indeed be a good question.  According to auditory measurements, it is simple to determine that humans can’t hear anything above 20kHz - or even less as we age.  However, more elaborate investigations indicate that we do respond subconsciously to ultrasonic stimuli that we cannot otherwise demonstrate that we hear.  So it remains an interesting open question whether the presence of heavy ultrasonic content would actually have an impact on our perception of the sound.  On the other hand, a lot of audio equipment is not designed to handle a heavy ultrasonic signal content.  We know of one high-end TEAC DAC that could not lock onto a signal that contained even a modest -60dB of ultrasonic content (that problem, once identified, was quickly fixed with a firmware update).  Such are probably as good reasons as any to want to filter it out.

So what do we do with the DSD content above 20kHz?  In developing DSD Master we take the view that the content of this frequency range contains both the high-frequency content of the original signal (if any), plus the added high frequency noise created by the SDM’s noise-shaping process.  We try to maintain any high frequency content within the signal flat up to 30kHz, and then begin our roll-off above that.  Consequently, our DSD conversions at high sample rates (88.2kHz and above) do contain a significant ultrasonic peak in the 35-40kHz range.  However, that peak is limited to about -80dB, which is way too low to either be audible(!) or to cause instability in anyone’s electronics.  Meanwhile, the phase response is quasi-linear up to the point at which the ultrasonic noise rises above the signal level.

In designing DSD Master, we make those design compromises on the basis that the purpose of these conversions is to be used for final listening purposes.  But if a similar functionality is being designed for the internal conversion stage of a PCM SDM-ADC then we know that a residual ultrasonic noise peak in the output data is not going to be acceptable.  In our view, this means that design choices will be made which do not necessary coincide with the best possible sound quality.

As a final point, all the above observations are specific to ‘regular’ DSD (aka ‘DSD64’).  The problem with ultrasonic noise pretty much goes away with DSD128 and above, something I have also written about in detail in a previous post.

So, from the foregoing, purely from a logical point of view, it seems somewhat contradictory for Dr. AIX to suggest that 24/96 PCM is inherently better than DSD, since DSD comes directly out of a SDM in its native form, whereas PCM is derived through digital manipulation of an SDM output with, among other things, a ‘brick-wall’ filter with a less-than-optimal configuration.  I’ll also point out that his argument suggests that DSD (i.e the output of an SDM) will not deliver the full bit depth that he offers up as a key distinguishing feature of 24/96.  Of course, those arguments apply only to ‘purist’ recordings which seek to capture the microphone output as naturally as possible.  In that way the discussion is not coloured by any post-processing of the signal, which in any case is not possible in the native DSD domain.

Monday, 22 June 2015

Day One - Intellectual Property

I have mentioned before that I subscribe to B&W’s Society of Sound, and have done so for the last five years.  It costs me $60 for a 12-month subscription for which I get to download 24 high-resolution albums, two per month.  I think it’s a great deal.  Each month, I get one album from London Symphony Orchestra’s LSO Live label, and one from Peter Gabriel’s RealWorld label.  For me, the classical downloads are the major pull, but occasionally the RealWorld offering turns out to be the bigger gem.  Such was the case this month, when the offering was Day One’s new album Intellectual Property.

Day One is a band I have never heard of.  Over the course of 15 years, this would appear to be only the English Duo's third album, but they have managed to make their mark with contributions to a number of TV and movie soundtracks.  Check out the link below.

How would I describe the latest album “Intellectual Property”?  On one hand there are a number of apparent influences which include David Bowie, Peter Gabriel, Lou Reed, Ian Dury, and the Red Hot Chilli Peppers.  On the other hand I detect stylistic tips of the hat to Motown, Country, and New Age, all of which underly an overall vibe of something you might call “stoner hip-hop”.  Maybe it is all those 70’s and 80’s influences that appeal to me.  What’s that you said?…. the stoner element?

Anyway, call it what you will, it is a superb album.  The songwriting is sharp and observant without trying to be too deep.  Each and every track has a clear hook, and the recording is clean and full, with a sensitive hand on the production levers, although the sound overall does fall short of the highest audiophile standards.  At any rate, it is quite simply a first-rate album.  And, at the moment, I think the only way to get hold of it is via B&W’s Society of Sound.  So, if nothing else, it is a good excuse for you to check it out.  There is even a Free Trial option, so what’s holding you up?  :)

Dynamic Compression

Most computer users will be familiar with data compression.  This was a godsend at the dawn of the internet age when internet connections were achieved via dial-up modems with bandwidths restricted to clockwork speeds.  The first ever document I received over the internet was a 1MB WordPerfect file, and the transmission took about three hours using a modem that set me back $280.  Even so, this was still a great thing, given that the file wouldn’t fit on a 5.25” floppy disk which could otherwise have been posted to me.  I didn’t have PKZip at that time, but eventually a colleague introduced me to it.  For many years thereafter nobody would ever consider sending an e-mail attachment without first “zipping” it.  Zipping a word processor file could reduce the file size by enormous factors of 5X or more.  Data compression was, and still is, a great thing.  In audio, formats like FLAC and Apple Lossless use data compression to reduce the size of an audio file without compromising its audio content.  By contrast, formats like MP3 and AAC go a step further and irretrievably delete some of the audio content to make the file smaller yet.  But dynamic compression is a different beast entirely. 

When an audio signal continues to increase in volume, at some point you will run into a limitation.  For example, beyond a certain (catastrophically loud) volume, air itself loses the ability to faithfully transmit a sound.  If you drive your loudspeaker with too many Watts, the drive units will self destruct.  If you feed your amplifier’s inputs with too large of a signal, its outputs will clip.  If you try to record too loud of a signal onto an analog tape, the tape will distort.  And if you try to encode too large of a signal in a digital format … well, you can’t get there from here, and you just have to encode something else instead - typically digital hard clipping.

Therefore, whether in today’s digital age or in the analog age of yore, anybody who is tasked with capturing and recording an analog signal has to be concerned with level matching.  If you turn the signal level up too high, you will encounter one of the previously mentioned problems (hopefully one of the last two).  If you turn it down too low, the sound will eventually descend into the noise and be lost.  However, analog tape had a built-in antidote.  It turns out that if you overload an analog tape, the overload is managed ‘gracefully’, which means that you could record at a level higher than the linear maximum and it wouldn’t sound too bad.

In fact, not only does it not sound too bad, but if you play back the resultant recording over a low-fidelity system like a radio or a boom box, it can actually sound better than a recording that properly preserves the full dynamic range.  This is because the dynamic range within a high quality recording is greater than the ability of the low-fi system to reproduce it, and the result can be a sound that appears to be quiet and lifeless.  By allowing the analog tape to saturate, the dynamic range of the recorded signal is effectively reduced (or ‘compressed’), and better matched to that of a low-fi system.  In fact, in all but the very finest systems, a little bit of dynamic compression is found by most people to be slightly preferable to none at all.  Which is a problem for those of us fortunate enough to enjoy the finest systems, whose revealing nature tends to deliver the opposite result.

With analog tape, managing dynamic compression through tape saturation is a finely balanced skill.  It is not something that you can easily bend to your design.  It’s sometimes considered to be more of an art than a science.  On the other hand, in the digital domain, dynamic compression can be tailored umpteen different ways according to your whim, and you can dial in just the right amount if you believe your recording needs it.  Most digital dynamic compression algorithms are seriously simple, being nothing more than a non-linear transfer function based on Quadratic, Cubic, Sinusoidal, Exponential, Hyperbolic tangent, or Reciprocal functions (to name but a few).  Ideally, the transfer function would remain linear up to a point, above which the non-linearity would progressively kick in, and the better regarded algorithms (such as the Cubic) do behave like that.  But most serious listeners agree that digital dynamic compression never sounds as good as ‘natural’ dynamic compression from magnetic tape.  Maybe this is one of the reasons analog still has its strong adherents.

The thing about digital dynamic compression is that, once it kicks in, its effect on the sound is rather drastic.  Harmonic distortion components at levels as high as -20dB are common.  Moreover, the technique can create substantial harmonic distortion components above the Nyquist frequency, which get mirrored down into the audio band where they appear as inharmonic frequencies which are subjectively a lot more discomforting than harmonic frequencies.  It also creates huge intermodulation distortion artifacts, also highly undesirable.

There are papers out there which do a very thorough job of analyzing what various dynamic compression systems, both real and theoretical, could do if they were implemented, and the conclusions they come to are pretty consistent.  Digital dynamic compression fundamentally sucks, and there’s not much you can do about it.  But having said that, if you have some understanding of how compression works, are willing to limit the amount of applied compression judiciously, and have sufficient computing power available, you can bring to bear a whole grab-bag of tricks to try to minimize them.  Such techniques include side-chain processing (where several analyses of the signal happen in parallel as inputs to the core compression tool), look-ahead (analysis of the future input signal, obviously not for real-time applications), advanced filtering (seeks to reduce unwanted distortions by filtering them out), and active attack/release control (governs the extent to which the sudden onset of compression is audible).  Sophisticated pro-audio tools can bring all these techniques - and more - to the party.

Dynamic compression as a serious issue of sound quality came to a head (or descended to its depths, depending on your viewpoint) during the early 2000’s with the so-called “loudness wars”.  The music industry was coming to terms with the notion that a lot of popular music was being listened to in MP3 format on portable players of limited fidelity.  While with their left hands they were trying their best to prevent the proliferation of music in the MP3 format, with their right hands they were recognizing that if music was going to be listened to on portable systems with restricted dynamic range it might sound better if the recordings themselves had a similarly restricted dynamic range.  It is a well known psychoacoustic effect that, when comparing two similar recordings, people overwhelmingly tend to perceive the louder one to be better, and dynamic compression is a way to increase the perceived loudness of a recording.  The labels therefore started falling over themselves to release recordings with more and more “loudness”, or put another way, with more and more dynamic compression.

Take U2’s “How to Dismantle an Atomic Bomb”, released in 2004.  This album is a downright disgrace.  It sounds absolutely appalling.  I bought it when it came out and haven’t listened seriously to it since.  And if there is any doubt as to why that might be, just take a look at the attached screenshot image.  These are waveform envelopes obtained using Adobe Audition.  The top track is “Vertigo” from this album.  The bottom track is “With or Without You” from their 1988 release Joshua Tree.  Both are ripped from the standard commercial CD releases.  The difference is laughable.  You can clearly see how the one on the top has been driven deeply into dynamic compression.

To attempt to quantify this effect, the “Loudness War” website endorses a free tool called the Tischmeyer Technology (TT) Loudness Meter.  This measures Vertigo as DR5 which it classifies as “Bad” (DR0 - DR7), and With or Without You as DR12 which is in the “Transition” range (DR8 - DR13), but getting close to Good (which starts at DR14).  All else being equal, the higher the number the better the sound, but the numerical result is quite dependent on the program material.  Next time you play an album, see if it is listed on and check its rating.  If it isn’t listed, it is a simple job to download the free TT Loudness Meter tool, measure the album yourself, and upload the data.

And it isn’t just the music business that faces this issue.  Incredibly, I also encounter it in the ultra-low-fi world of the TV sound track.  Just when you thought plain old dynamic compression was bad enough, the more aggressive “loudness shaping” algorithms also heavily modulate the volume of the sound track, winding it up during “quiet” passages when there is no dialog, or even between breaths during the dialog itself.  This has the effect of raising the background noise to the same loudness level as the dialog itself - and you can plainly hear it winding up and down - making watching the TV show a most unpleasant experience.  For me, for example, it ruined the last season of “House”.  I can’t begin to imagine how bad a TV set would have to be for such measures to be remotely beneficial.

As a final observation, for the purists who like to work in DSD, there are a couple of important considerations to bear in mind.  The first is that, in native DSD mode, you simply cannot do any sort of signal processing whatsoever - not even something as trivial as volume control (fade-in/fade-out for example), let alone dynamic compression.  You have to convert to PCM to do that and then convert back to DSD, which most DSD purists find unacceptable.  The other interesting thing is in the Sigma-Delta Modulators which convert analog (or PCM digital) to DSD format, which warrants a discussion all of its own.

As you increase the signal level in these modulators the result is far from deterministic.  Overloading the modulator can make it go unstable in an unpredictable manner.  For that reason, the SACD standard requires the analog signal level encoded in DSD to be 6dB below the theoretical maximum that the format can support.  But interesting things happen if you over-drive the modulator.  Most contain special circuits or algorithms which detect the onset of instability and apply corrective measures.  This means that the modulators can normally accept inputs that exceed the supposed -6dB limit, with a penalty limited to a slight increase in distortion.  Keep pushing it further, though, and the modulator self-resets, resulting in an audible click.

In a sense, if you are a recording engineer, DSD is a bit like analog tape on steroids.  If your signal exceeds the -6dB limit then to a large degree you are going to be able to get away with it, unlike the situation with PCM digital, where the signal will either clip, or the dynamic compressor will to cut in.  With DSD you get the ‘graceful’ overload of analog tape, but without the associated dynamic compression.  The result is probably the best of all worlds.  Interestingly, with our DSD Master tool, it gives us an accurate view into whether or not the recording/mastering engineer has “pushed” the recording beyond the -6dB guideline, and you would be seriously surprised at the extent to which such behaviour appears to be the norm.

Friday, 19 June 2015

Dark Mode Icons

Is anybody out there experienced in designing menu bar icons for OS X Yosemite's 'Dark Mode'?  We are having a spot of trouble and need some sage advice.  E-mail me.