Wednesday, 26 August 2015

Phase Value

We already know that a digital waveform can be transformed, using a Fourier Transform, into a different representation where each data point represents a certain particular frequency, and the magnitude of the transform at that data point represents the amount of that frequency that is present in the original signal.

This is interesting, because we humans are able to perceive both of these aspects of a sound’s frequency content.  If the frequency itself changes - increases or decreases - we perceive the pitch to go up or down.  And if the magnitude changes - increases or decreases - we perceive the volume to get louder or quieter.  Between them, these two things would appear to totally define how we perceive (or, if you prefer, “hear”) audio signals.  Interestingly enough, a physical analysis of how the human hearing system actually works suggests that it is those separate individual frequencies, rather than the waveform itself in its full complexity, that our ears respond to.

If we take all the frequencies in the Fourier Transform and create a sine wave from each one, whose magnitude is the magnitude of the Fourier Transform, and add them all together, the sum total of all these sine waves will be the exact original waveform.  But there are a couple of wrinkles to bear in mind.  The first is that this is only strictly true if the original waveform used to create the Fourier Transform was of infinite duration, producing a Fourier Transform with an infinite number of frequencies.  For the purposes of this post we can safely ignore that limitation.  The second is that we need to know the relative phase of each frequency component.

I wrote in a previous post how we can decompose a square wave into its constituent frequency components and use those to reconstruct the square wave.  However, if we change the phase of these individual frequency components - which describes how the individual sine waves “line up” against each other - then we end up changing the shape of the original square wave.  Indeed, the change can be rather dramatic.  In other words, changing the phases of a waveform’s component frequencies can significantly alter the waveform’s shape without changing any of its component frequencies or their magnitudes.  To a first approximation, changes in the phase response of an audio system are considered not to be audible.  However, at the bleeding edge where audiophiles live that is not so clear.

The Fourier Transform I mentioned in fact encodes both the magnitude and the phase information because the transformation actually produces complex numbers (numbers having two components which we term Real and Imaginary).  We can massage these two components to yield both the phase and the magnitude.  This is one example of how the phase and frequency responses of an audio system are tightly intertwined.

We are used to demanding that anything which affects an audio system has a frequency response that meets our objectives.  This applies equally in the analog domain (whether we apply it to circuits such as amplifiers or components such as transistors) as in the digital domain (where we can apply it to simple filters or elaborate MP3 encoders).  We are familiar with the common requirement for flat frequency response across the audio bandwidth because we know that we can “hear” these frequencies clearly.  But all of those systems, analog and digital, also have an associated phase response.

Some types of phase response are quite trivial.  For example, if the phase response is linear, which means that the phase is linear with frequency, this means simply that the signal has been delayed by a fixed amount of time.  More generally if we look at the phase response plot (phase vs frequency), the slope of the line at any frequency tells us how much that frequency is delayed by.  Clearly, if the slope is linear, all frequencies will be delayed by the same amount, and the effect will be a fixed delay applied to the entire signal.  However, if the slope is anything other than linear, it means that different delays apply to each frequency and the result will be a degree of waveform distortion as discussed regarding the square wave.

So, we have clear ideas about errors in the magnitude of the frequency response.  We classify these as dips, humps, roll-offs, etc, in the frequency response, and we have expectations as to how we expect these defects to sound, plus a reasonably well-cultivated language with which to describe those sounds.  But we are still trying to develop an equivalent understanding of phase responses.

One development I don’t like is to focus on the impulse response, and to ascribe features of the impulse response to corresponding qualities in the output waveform.  So, for example, pre-ringing in the impulse response is imagined to give rise to “pre-ringing” in the output waveform, which is presumed to be a BAD THING.  This loses sight of a simple truth.  If you mathematically analyze a pure perfect square wave and remove all of its components above a certain frequency, what you get is pre-ringing before each step, and post-ringing after it.  We’re not talking about a filter here, we’re talking about what the waveform inherently looks like if its high frequency components were absent, which they need to be if we are going to encode it digitally.

You might argue that a perfect phase response would be a zero-phase response, where there is no phase error whatsoever at each and every frequency.  Such characteristics cannot be achieved at all in the analog domain, but in the digital domain there are various ways of accomplishing it.  However, it can be shown mathematically that all zero-phase filters must have a symmetrical impulse response.  In other words, whatever post-ring your filter has, it will have the exact same pre-ring before the impulse.  This, by the way, is another way of describing what happened to the pure perfect square wave.

Another impulse response characteristic that gets a lot of favourable press is the Minimum Phase filter.  This is a misleading title because, although it does mathematically minimize the net phase error, it lacks a theoretical basis upon which to suppose a monotonic relationship exists between the accumulated net phase error and an observed deterioration in the sound quality.  For example, linear phase filters exhibiting no waveform distortion can in principle have significant different fixed delays, with corresponding significant differences in their net phase error, yet with no difference whatsoever in the fidelity of their output signals.  On the other hand, Minimum Phase filters do concentrate the filter’s “energy” as much as possible into the “early” part of its impulse response, which can mean that it is more mathematically “efficient”, which may make for either a better-designed filter, or a more accurate implementation of the filter’s design (sorry for the “air quotes”, but this is a topic that could take up a whole post of its own).

One thing I must be clear on is that this discussion is purely a technical one.  I discuss the technical properties of phase and impulse responses, but I don’t hold up a hand and claim that one thing is better than the other.  Someone may state an opinion that such-and-such a filter sounds better than so-and-so’s filter because it has a “better” impulse response.  I might agree or disagree with the opinion regarding which filter sounds best, but I will argue against attributing the finding to certain properties of the impulse response without a good model to account for why the properties advocated should be beneficial.  As regards the impulse responses no such “good” model yet exists (that I know of).

Where I do stand from a philosophical standpoint is that I like zero-phase responses and linear phase responses because these contribute no waveform distortion at the output.  For that reason, we are, here at BitPerfect, developing a zero-phase DSP engine that, if successful, we will be able to apply quite broadly.  We will try it out first in our DSD Master DSD-to-PCM conversion engine, where I am convinced that it will provide PCM conversions that are, finally, indistinguishable from the DSD originals.  If listening tests prove us out, we will release it.  From there it will migrate to SRC, where I believe it will deliver an SRC solution superior to the industry-leading Izotope product (which is too expensive for us to use cost-effectively).  Finally, it will appear in our new design for a seriously good graphical equalizer package that is in early-stage development, with possible application to room-correction technology.

Thursday, 13 August 2015

Audio Files for Audiophiles

A few years back I purchased a Windows App called dBpoweramp.  It met my needs for a while.  Upon installation, I learned that the App supports a huge number of different music file formats.  Today, that list reads:  AIFF, ALAC, CDA, FLAC, MP3, WAV, AC3, AAC, SND, DFF, DSF, MID, APE, MPP, OGG, OPUS, WVC, W64, WMA, OFR, RA, SHN, SPX, TTA, plus a number of variants.  Who knew there were so many audio formats?  I for one have never heard of most of these.  Counting through them, I have only ever used eight of ’em, and of the rest I have only ever come across three.  Well, good for dBpoweramp!  I can sleep comfortably knowing that if I ever want to convert a TTA file to OFR I probably have just the the tool for the job.

Music file formats arise to fill a need, and each and every one of those file formats I mentioned represents a need which went unmet at the time the format was devised.  Actually, I even invented an audio file format of my own, way back in 1979.  In my lab at work I had a Commodore Pet computer which was attached to an X-Y graphic printer.  I used the Pet to control a laser test apparatus and had the printer output the results graphically.  As the printer’s two stepper motors (one for each axis) drove the pen holder across the paper, the tone of each motor would sound a certain note.  By having the printer draw out a certain pattern I could get it to play “God Save the Queen”.  Not very imaginative, I agree, but it was quite a party trick in its day.  I then wrote a program that would allow you to compose a tune which you could then play on the printer.  Finally, I devised a simple format with which to store those instructions in a file which the Commodore Pet saved on its audio-cassette tape drive.  I could conceivably claim to have developed one of the world’s first audio file formats!  Looking back, the Zeitgeist was quite delicious - a computer audio file stored in digital form on an analog audio cassette tape.

But back to the myriad file formats supported by dBpoweramp.  Each one has a purpose, and I suppose not all of those involve the distribution of music for commercial or recreational purposes.  For what it’s worth, the developers of iTunes could have arranged for it to support all of these weird and wonderful file formats too, but they didn’t.  In some cases there are good technical reasons why they would elect not to support a particular file type.  In others it is a matter of choice.  Some of those formats are Audio-Video formats, and iTunes is, after all, a multi-media platform.  But for the purposes of this post I am going to constrain the discussion to audio-only playback.

Not just the developers of iTunes, but every developer who writes an audio playback App has to decide for themselves which of those (and, perhaps others too) file formats their App is going to support.  I am going to break these formats down into four camps - Uncompressed, Lossless Compressed, Lossy Compressed, and DSD.  Lets look at each one, and discuss how they handle the audio data.

The simplest audio file formats contain uncompressed audio data.  The actual audio data itself is written straight into the file.  It is not manipulated or massaged in any way.  The advantage of doing it this way is that the audio data can be both written and read with the minimum of fuss.  The two most commonly used examples of this type of file format are AIFF (released by Apple in 1987) and WAV (released by Microsoft in 1991).  iTunes will happily load either file type.

Back in those days the file size of a AIFF or WAV file was utterly prohibitive.  A five-minute track ripped from a CD would be require a file size of 53MB which represented something like three times the capacity of a good-sized hard disk drive at that time.  Clearly, if computers were going to be able to handle digital audio something needed to be done to reduce the file size.  To address this problem, during the early 1990’s the Fraunhofer Institute in Germany developed what we now call the MP3 file format.  What this does is, effectively, to figure out which parts of an audio signal are the least audible and throw them away.  By throwing away more and more of the audio signal the file size can be reduce rather dramatically.  This approach is referred to as Lossy Compression, because it compresses the file size but loses data (and therefore sound quality) along the way.

The first MP3 codec was released in 1995.  In 1997 Apple introduced their own version of MP3 called AAC.  Structurally, AAC is very similar to MP3 but has some significant differences aimed at improving the subjective audio quality.  However, each format requires a separate codec to be able to read it.

By the end of the decade the combination of the MP3 codec and the ready availability of hard discs with capacities exceeding 100MB had ushered in the age of computer audio.  As always, there was a fringe element who still preferred the improved sound quality of uncompressed WAV and AIFF files, but who were still troubled by the enormous file sizes.  Programs like PKZip proved that ordinary computer files could be compressed to a smaller file size and subsequently regenerated in their exact original form.  However, PKZip did not do a very good job of reducing the file size of audio files.  A dedicated lossless compressor was needed, one specifically optimized around the characteristics of audio data.  In 2001 the first FLAC format specification was released.  The FLAC codec could produce compressed files that are approximately 50% of the size of the original WAV or AIFF file.  In 2004 Apple introduced their own lossless compression format ALAC (or Apple Lossless).

In 1999, Sony and Philips tried and failed to launch the SACD format as a successor to the ubiquitous CD.  SACD uses a radically different form of audio encoding called DSD.  Ultimately, the SACD launch flopped, although the format has never actually gone away, and the DSD format acquired its own band of loyal followers.  The developers of SACD each developed a file format that could handle DSD data - the DFF format developed by Philips, and the DSF format developed by Sony.  By 2011, DSD enthusiasts had demonstrated the ability to manage DFF and DSF files on their computers, and to transmit DSD data to a DAC, and the first DSD-compatible DACs trickled onto the market.  Consumer-level DSD recording equipment is also now available, and produces output files in either DSF or DFF format - bizarrely, they rarely offer a choice of formats.

Today, although other file formats do persist, the computer audio market has more or less settled down to four format types, with two competing format offerings for each type.  AIFF (Apple) and WAV (everybody else) for uncompressed audio; ALAC (Apple) and FLAC (everybody else) for lossless compression; and AAC (Apple) and MP3 (everybody else) for lossy compression.  DSF and DFF continue to duke it out in the DSD world.  Note that, except for DSD which Apple does not support in any form, the formats have shaken down into pairs of Apple and everybody else.  Why is this?

Frankly, there is absolutely no reason why any software player should not be able to support all of these file formats.  The process of reading (or writing) any of them is quite straightforward.  Yet, Apple originally refused to support WAV and MP3 formats in its iTunes software and iPod players, instead requiring users to use its own AIFF and AAC formats.  In fact, to this day Apple products continue to refuse to support FLAC files, instead requiring its customers to use ALAC.  From a functionality viewpoint none of this really matters.  ALAC and FLAC can be seamlessly transformed from one to the other and back again using high quality free software (as can AIFF and WAV, AAC and MP3).  But this is not what customers want.  So why is it that Apple takes this unhelpful stance?

The reason is simple.  From a business perspective, Apple’s entire iTunes ecosystem exists not to provide you with a platform on which to manage and play your music, but as a platform to sell you the music that you listen to.  Apple’s business model is for you to buy your music from them rather than from anybody else.  Therefore when you buy music from the iTunes Store it comes in AAC format only and not in MP3 or FLAC.  But if you buy your music virtually anywhere else, it only comes in the MP3 and FLAC formats.  Virtually nobody outside of Apple is interested in selling AAC or ALAC files.

But why bother in the first place?  Apple isn’t actually selling any ALAC files on its iTunes Store, so you have to wonder what their thinking is.  Do they consider that they are motivating me to buy AAC files from Apple instead of FLAC files from someone else?  Really?  Hey, maybe they’re right.  Maybe that’s exactly what we do.  It has also been suggested that Apple is scared of becoming targets of a patent troll if they start offering FLAC support, but that seems to be an even more feeble explanation.  Google have been supporting FLAC in Android for some time now, and have not attracted any trolls’ attention that we know of.  In any case, nobody is sure what patents FLAC might possibly be infringing, given that it is all open-source.  But given the size of Big Apple, they would certainly make for a tasty target.

Interestingly, with the overwhelming consumer embrace of MP3, Apple realized very early on that if they were going to continue refusing to support MP3 they could risk losing out on the whole mobile music opportunity to one of the competing platforms such as Rio, Zune, Nomad/Zen and others.  Deciding to support MP3 was a key tactical business decision that took the air out of their competitors’ sails and ultimately paved the way for the total dominance of iPod and iTunes.  Today, despite the overwhelming consumer embrace of FLAC, there is no such pressure on Apple to encourage them towards supporting FLAC.

At one time there was an App called Fluke which allowed users to import FLAC files into iTunes.  Unfortunately, that loophole relied on a 32-bit OS kernel, and as a result Fluke no longer works with OS X 10.7 (Lion) and up.  Just to be clear, there are absolutely no technical reasons whatsoever that prevent Apple from supporting FLAC files.  It would be a trivial move for them to make, if they wanted to.  Their refusal to support FLAC is entirely a tactical decision on their part.

The situation with DSD is significantly different.  OS X and iOS are both fundamentally incapable of supporting DSD.  It would require significant changes to the way their audio subsystems work in order for that to happen, and, being honest, I see some fundamental issues that they would face if they ever considered doing that.  Consequently, I don’t see DSD being supported by Apple in any form for the foreseeable future.  The way the audio industry has got around that is with the DoP data transmission format.  This dresses up native DSD data so that it looks like PCM, which OS X can then be fooled into sending to your DAC, but it means that any Mac Apps which support DSD would have to be extremely careful how they went about it.  BitPerfect, for example, can do that, and iTunes can’t.  This is different from the situation with FLAC files.  Whereas iTunes would have no problems reading a FLAC file if Apple chose to let it, it would have absolutely no idea what to make of a DSD file.  You might as well ask it to load an Excel spreadsheet.

In order for BitPerfect to manage DSD playback, we have created what we call the Hybrid-DSD file format.  Hybrid-DSD files are ALAC files that iTunes recognizes, and can import and play normally.  However they also contain the native DSD audio data as a sort of “trojan horse” payload.  If iTunes plays a Hybrid-DSD file it plays the ordinary ALAC content.  But if BitPerfect plays the file it plays the DSD content.  We really like that system.  Other software players have instead adopted the idea of a “proxy” file.  This is a similar thing, but instead of containing ordinary ALAC music plus the DSD payload, they contain no music and include information that enables the playback software to locate the original DSF or DFF file.  Some may like the proxy file format, indeed some may prefer it, but we don’t, and this isn’t the place to discuss that.

It has often been suggested that BitPerfect could adopt a mechanism similar to either the Hybrid-DSD file or the proxy file to import FLAC files into iTunes.  And yes, we could do that.  But frankly, we believe the proper solution to that problem is to simply transcode the FLAC files to ALAC using a free App such as XLD.  It is simple and effective, and the ALAC files can just as easily be transcoded back into FLAC form if needed.

The final topic I want to cover in this post is Digital Rights Management (DRM).  This is a method by which the audio content in the file is encrypted in such a way as to prevent someone who does not “own” an audio file from playing it.  In other words, it is an anti-piracy technique.  Files containing DRM are pretty much indistinguishable from files that do not contain it, and most audio file formats support the inclusion of DRM (I am given to understand that FLAC does not, but I am not 100% sure).  For example, Apple included DRM in almost all of the music downloads sold on iTunes between 2004 and 2009.   

DRM is something that tends to get forced on the distributors (i.e. the iTunes Store) by content providers (i.e. the record labels), and is a major inconvenience for absolutely everybody involved in the playback chain.  Between 2004 and 2009 Apple had grown to hold sufficient clout that they could dictate to the content providers their intention to discontinue supporting DRM.  Today, DRM is a non-factor, although the new Apple Music service, plus TIDAL, and other streaming-based services which offer off-line storage are now re-introducing it.  The advance and retreat of DRM is an interesting barometer of who has the upper hand at any time in the music business between the distributors and the content providers.

Wednesday, 12 August 2015

Sigma-Delta Modulators - Part II.

Yesterday, we saw how a SDM can be used to faithfully reconstruct an incoming signal, even if the output is constrained to an apparently hopelessly reduced bit depth.  We do this by ensuring that the Signal Transfer Function (STF) and Noise Transfer Function (NTF) have appropriate characteristics.  This, of course, is a lot harder to achieve that you might have concluded from the expansive tone of yesterday’s post, which we concluded with the open question of how to design an appropriate loop filter.

Addressing those issues remains at the bleeding edge of today’s digital audio technology.  The best approach to understanding the design of an SDM remains the “Linear Model” I alluded to yesterday, where we treat the quantization error introduced at the quantizer stage as a noise source.  This model ought to be as accurate as its limiting assumption, which is that the quantization error is well represented by a noise source.  Unfortunately, the results don’t appear to bear that out.  According to this model, relatively simple SDMs should exhibit stunningly good performance, where in reality they do not.  In fact they fall very substantially short of the mark.  Clearly, the noise source is not as good a substitute for the quantization error as we thought.  Furthermore the reasons why are not clear, and we don’t have a better candidate available.

In the absence of a good guiding model, SDM designers stick to an empirical methodology based on the well-known “suck-it-and-see” approach.  The most successful approach is based on increasing the “order” of the modulator.  The simple SDM I described yesterday has a single Sigma stage, and is called a “first order” SDM.  If we simply add a second Sigma stage we get a “second order” SDM.  We can add as many Sigma stages as we like, and however many we add, that’s the “order” of the SDM.  The higher the “order” of the SDM, the better its performance ought to be.  I make that sound so much easier than it actually is, particularly when it comes to the task of fine-tuning the SDM’s noise-shaping (or the NTF if you like) performance.

In practice, real-world SDM designs run into problems.  Lots of them.  First of these is overloads.  If the signal fed into the quantizer overloads the quantizer then the SDM will go unstable.  This is the same as any PCM representation - if the signal level is too high, then the PCM format, due to its fixed bit depth, will not have an available level with which to represent the signal, and something has to give (typically, a simple PCM encoder will allow the signal to hard clip).  In a SDM, because a Sigma modulator is in fact a very simple IIR filter, the result of such an overload will reverberate within the output of the SDM for a very considerable time.

The second problem is that high-order digital filters can themselves be rather unstable, not so much because of any inherent instability, but generally because of CPU truncation errors in the processing and execution of the filter.  Proper filter design tools can identify and optimize for these errors, but can never make them go away entirely.  Unstable filters can cause all sorts of problems in SDMs, from the addition of noise and distortion to total malfunction.

The third problem is that SDMs are found to have any number of unexpected error or fault loops in which they can find themselves trapped, which are not yet adequately explained or predicted by any theoretical treatment.  These include phenomena known as “limit cycles”, “birdies”, “idle tones” and others.  They can be astonishingly difficult to detect, or even to describe, let alone to design around.

Real-world high performance SDMs for DSD applications are typically between 5th and 10th order.  Below 5th order the performance is inadequate, and above 10th order they are rarely sufficiently stable.  The professional audio product Weiss Saracon, for example, contains a choice of loop filters in its SDM, having orders 6, 8, and 10.  Each loop filter produces a DSD output file with subtly different sonic characteristics, differences which many well-tuned audiophile ears can reliably detect.  And, as with religion, the fact that there are several of them from which to choose doesn’t guarantee that one of them is correct!

Interestingly enough, one of those limitations can be readily made to go away.  The problem of overloads can be entirely eliminated by using a multi-bit quantizer.  This approach is used in almost all commercial ADCs which use an analog SDM in the input stage, configured to provide a 3-bit to 5-bit intermediate result.  This intermediate result is then converted in the digital domain to the desired output format, whether PCM or DSD.  Likewise, almost all commercial DACs employ a digital SDM in the input stage, configured to provide a 3-bit to 5-bit intermediate result which is then converted to analog using a 3-bit to 5-bit R-2R ladder.  SDMs are therefore deeply involved at both ends of the PCM audio chain, though they mostly don’t use the 1-bit bit depth of DSD (or, for that matter, its 2.8MHz sample rate).  When you listen to PCM, you cannot escape the fact that you are listening to SDMs.

The key takeaway from the study of SDMs is that while their performance can indeed be extremely good, the current state-of-the-art does not permit us to quantify that performance on an a priori basis to a high degree of accuracy.  Instead, SDMs must be evaluated phenomenologically.  In other words we must carefully measure their characteristics - linearity, distortion, noise, dynamic range, phase response, etc.  In this regard, SDMs are very much like analog electronic devices such as amplifiers.  We can bring a lot of design intelligence to bear, but at the end of the day those designs cannot tell us all we need to know about their performance, and the skill of the designer
(not to mention the keen ear of the person making the final voicing decisions) becomes the critical differentiating factor.

At this point I promised to conclude by touching on some of the differences between DSD and PCM formats.  Much has been written about this, and it can tend to confuse and obfuscate.  Frankly, I'm not so sure this will help much.  On one hand, with a PCM data stream, the specific purpose of every single bit in the context of the encoded signal is clear and unambiguous.  Each bit is a known part of a digital word, and each word stipulates the exact magnitude of the encoded signal at a known instant in time.  The format responds to random access, by which I mean that if we want to know the exact magnitude of the encoded signal at some stipulated moment in time, we can go right in there and grab it.  Of course, when we say “exact” we understand that to be limited by the bit depth of the PCM word.

The situation with SDM bitstreams is slightly different, and I will illustrate this with the extreme example of a DSD 1-bit bitstream.  On one level, we can see the DSD bitstream as being exactly identical to what I have just described.  Each bit is a known part of a digital word, except that in this case the single bit comprises the entire word!  This word then represents the exact magnitude of the encoded signal at a known instant in time - but this time to a resolution of only 1-bit.  That is because the DSD bitstream has encoded not only the signal, but also the heavy dose of shaped noise that we have been describing in noxious detail.  That noise gets in the way of our ability to interpret an individual word in the light of the original encoded signal.  By examining one word in isolation we cannot determine how much of it is signal and how much is noise.

If we want to extract the original signal from the DSD bitstream, we must pass the entire bitstream through a filter which will eliminate the noise.  And because we have already stipulated that the SDM is capable of encoding the original signal with a very high degree of fidelity, it stands to reason that we will require a bit depth much greater than 1-bit to store the result of doing so.  In effect, by passing the DSD bitstream through a low-pass filter, we end up converting it to PCM.  This is how DSD-to-PCM conversion is done.  You simply pass it through a low-pass filter.  The quality of the resultant PCM representation can be very close to a perfect copy of the original signal component in the DSD file.  It will be limited only by the accuracy of the low-pass filter used.

When we started developing our product DSD Master, we realized very quickly that the choice of filter was the most critically important factor in getting the best possible DSD-to-PCM conversions.  A better choice of filter gave rise to a better-sounding conversion.  FYI, we continue to work on better and improved filters for our DSD Master product, and for our next release we will be introducing a new class of filter that we believe will make virtually perfect PCM conversions!

Unlike SDMs, digital filters are very well understood.  There is virtually no significant aspect of a digital filter’s performance which has not been successfully analyzed to the Nth degree.  The filter’s amplitude and phase responses are fundamentally known.  We can stipulate with certainty the extent to which computer rounding errors are going impact the filter’s real-world performance, and take measures to get around that if necessary.  In other words, if we know what is in the filter’s input signal, then we know exactly, and I mean EXACTLY, what is going to be in the filter’s output signal.  SDMs, as we have seen above, are not like that.

What does that mean for the DSD-vs-PCM argument?

I really don’t know the answer to that!  On one hand, I am convinced that before too long I will be able to make conversions from DSD to PCM which are virtually perfect, at least to the extent that any PCM representation can be perfect.  On the other hand, I am equally convinced that conversions from PCM to DSD are less perfect, and that SDM technology still has some major advances to be made.  Here at BitPerfect we are working on Look-Ahead SDMs (which, being pedantic, are not strictly speaking SDMs at all) which have the potential to take some small steps forward.  The problem is, they require phenomenal computing power, so our LA-SDM remains very much a lab curiosity.  My feeling is that when each is performed in accordance with the current state-of-the-art, PCM-to-DSD conversion lags DSD-to-PCM conversion in the ultimate quality stakes.

So why - and I’ve said this before - do I still have a lingering preference for DSD over PCM?  I have come to the following conclusion.

DSD is primarily listened to by audio enthusiasts.  The market for DSD comprises people who like music, but still want to hear it well recorded.  It is still a small market, and it is served almost exclusively by specialist providers why are happy to put in the time, expense, and inconvenience required to generate quality product for that market.  People like Cookie Marenco at Blue Coast Records, Jared Sacks at Channel Classics, Morten Lindberg at 2L, Todd Garfinkel at MA Recordings, Gus Skinas at Super Audio Centre and many others, focus on delivering to consumers truly exceptional recordings of uncompromised quality.  DSD, for those people, drives three things, aside from the fact that some of them have their own firmly-established preference for DSD.

First, because of the issues described at length above, tools do not exist to do even the simplest of studio work in the DSD domain.  Even panning and fading require conversion to an intermediate PCM format.  Forget added reverb, pitch correction, and any number of studio tricks of the Pro-Tools ilk.  Recording to DSD forces recordists to strip everything down to its basics, and capture the music in the simplest and most natural manner possible.  That alone usually results in significant increases in the sort of qualities that appeal to audiophiles.

Second, when remastering old recordings for re-release on SACD, or even for digital download as DSD files, mastering engineers will typically pay a lot more attention to details than would normally be the case for a CD release.  Gone will be the demands for compression (or loudness).  The mastering engineer will get the opportunity to dust off that old preamp he always wanted to use, or those old tube amplifiers that he only brings out when the twenty-something suits from the label are not prowling around.  Try Dire Straits’ classic “Brothers In Arms”, which sounds a million times better when specially remastered for SACD (I love the Japanese SHM-SACD remastering) than it ever did on any CD, even though the master tape was famously recorded in 16-bit PCM and mixed down to DAT.  Go figure.

Third, unless you are using one of the few remaining ancient Sonoma DSD recording desks, if you are recording to DSD you will be using some of the latest and highest-spec studio equipment.  That’s where the DSD options are all positioned.  You will be using top-of-the-line mics, mic preamps,
ADCs, cables, etc.  As with most things in life, you tend to get what you pay for, and if you are using the best equipment your chances of laying down the best recording can only improve.

So I like DSD, I continue to look out for it, and it continues to sound dramatically better than the vast majority of PCM audio that comes my way.  Is that due to some fundamental advantages of the DSD format, or is it that PCM offers a million new and exciting ways to shoot a recording in the foot?  I’ll leave to others decide.

Tuesday, 11 August 2015

Sigma-Delta Modulators - Part I.

I have mentioned SDMs many times in the past.  These are, in effect, complex filter structures that are used to produce DSD and other bitstreams.  I know I talk about DSD a lot, and I also know that digital audio is way more about PCM that it it ever is - or ever will be - about DSD.  But, as I have already written, SDMs are in fact core to both ADCs and DACs and therefore also, I think, to resolving (or maybe just understanding) the debate concerning the relationship of DSD to PCM.  So I thought I would devote a post to an attempt to explain what SDMs are, how they work, and what their limitations are.  This will be doubly taxing, because I am far from being any sort of expert, and this is a deeply technical subject.  Finally, I will attempt to place the results of my ramblings in the context of the PCM-vs-DSD debate, with perhaps a surprising result.

The words Sigma and Delta refer to two Greek letters, Σ and Δ, which are used by convention in mathematics to denote addition (Σ) and subtraction (Δ).  Negative feedback, where the output signal is subtracted from the input signal, is a form of Delta modulation.  Similarly, in an unstable amplifier, where the output signal is inadvertently added to the input signal causing it to increase uncontrollably, this is a form of Sigma modulation.  Sigma Delta Modulators work by combining those functions into a single complex structure.  I use the term ‘structure’ intentionally, because SDMs can be implemented both in the analog domain (where they would be referred to as circuits) and in the digital domain (where they would be referred to as algorithms).  In this context, analog and digital refer only to the inputs of the SDM.  An SDM’s output is always digital.  For the remainder of this post I will refer only to digital SDMs, mainly because it is easier to describe.  But you should read it all as being equally applicable to the analog case.

At the core of an SDM lies the basic concept of a negative feedback loop.  This is where you take the output of the SDM and subtract it from its input.  We’ll call that the Delta stage.  If the output of the SDM is identical to its input, then the output of this Delta stage will always be zero.  Between the Delta Stage and the SDM output is a Sigma stage.  A Sigma stage works by maintaining an accumulated value to which it adds every input value it receives.  This accumulated value then becomes its output and therefore also the output of the SDM itself.  Therefore, so long as the output of the SDM remains identical to its input, the output of the Delta stage will always be zero, and consequently will continue to add zero to the accumulated output of the Sigma stage which will therefore also remain unchanged.  This is what we call the “steady-state case”.

But music is not steady-state.  It is always changing.  Let’s look at what happens when the input to the SDM increases slightly.  This results in a small difference between the input and the output of the SDM.  This difference appears at the output of the SDM’s Delta stage, and, consequently, at the input of it’s Sigma stage.  This causes the output of the Sigma stage to increase slightly.  The output of the Sigma stage is also the output of the SDM, and so the SDM’s output also increases slightly.  Now, the output of the SDM is once more identical to its input.  The same argument can be followed for a small decrease in the input to the SDM.  The SDM as described here is basically a structure whose output follows its input.  Which makes it a singularly useless construct.

So now we will modify the SDM described above in order to make it useful.  What we will do is to place a Quantizer between the output of the Sigma stage and the output of the SDM, so that the output of the SDM is now the quantized output of the Sigma stage.  This apparently minor change will have dramatic implications - for a start, this is what gives it its digital-only output.  To illustrate this, we will take it to its logical extreme.  Although we can choose to quantize the output to any bit depth we like, we will elect to quantize it to 1-bit, which means the output can only take on one of two values.  We’ll call these +1 and -1 although we will represent them digitally using the binary digits 1 and 0.  One result of this is that now the input and output values of the SDM will always be different, and the output of the Delta stage will never be zero.  The SDM is still trying to do the same job, which is to try to make the output signal as close as possible to the input signal.  However, since the output signal is now constrained to taking on only the values +1 or -1 it would appear that the SDM is going to flounder.

At this point, mathematics takes over, and it no longer becomes practical to reduce what I am going to describe to simple illustrative concepts.  I hope you will bear with me.

In order to understand what the SDM is actually doing, we need to make some sort of model.  In other words we’ll need a set of equations which describe the SDM’s behaviour.  By solving those equations we can then gain an understanding of what the SDM is and is not capable of doing.  There is a problem, though.  The quantizer introduces a non-linear element.  If we know what the input value to the quantizer is, we can determine precisely what the output value will be.  However, the opposite is not true.  If we know the output of the quantizer, we cannot deduce what the input value was that resulted in that output value.  The way we treat problems such as this is to consider the quantizer instead as a noise source.  We consider that we are instead adding noise (i.e. random values) to the output of the Sigma stage, such that the output values of the SDM end up being either +1 or -1.

The next thing we do is to observe that one thing we have said about how the SDM works is not entirely correct.  We said that at the input to the Delta stage we take the SDM’s input and subtract from it the SDM’s output.  In fact what we subtract is the SDM’s output at the previous time step.  This is very important, because it means that we can use this one-step delay to express the SDM’s behaviour in terms of a digital transfer function, using theories developed to understand how filters work.  I have mentioned such matters before in my previous posts on “Pole Dancing”.  Transfer functions allow you to calculate the structure’s frequency response, and when we apply this approach to the SDM we come up with two equations which we call the Signal Transfer Function (STF) and the Noise Transfer Function (NTF).  These are two very useful properties.

The STF tells us how much of the signal applied to the input of the SDM makes it through and appears in the output, whereas the NTF tells us how much of the quantization noise generated by the quantizer makes it to the the SDM’s output.  Both of these properties are strongly inter-related, and are strongly frequency dependent.  Generally, we would like to see STF~1 at low frequencies.  By contrast, we would like to have NTF~0 at the low frequencies but transition to NTF~1 at the high frequencies.  What exactly does all that gobbledygook mean?

The important thing is that at low frequencies we want the combination of STF~1 and NTF~0.  This means that at these low frequencies the output of the SDM contains all of the signal and none of the quantization noise.  However, at high frequencies we would like the opposite to be true, so that the output of the SDM contains none of the signal and all of the quantization noise.  If we can arrange it such that those so-called “low frequencies” actually comprises the audio frequency band, then our SDM can be capable of encoding that music signal with surprising precision even though the format has a bit depth of only 1-bit.  Analysis of the STF and NTF enables us to figure out how high the sample rate rate must be in order for the full 20kHz+ of the audio frequency bandwidth to fit into the “low frequency” part of the STF/NTF spectrum where sufficiently good performance can be obtained.  The answer, not surprisingly, is what drives DSD to use a sample rate of 2.8MHz.

A simpler way for the performance potential of this SDM to be viewed is to consider only the quantization noise.  This is nothing more than the difference between what the ideal (not quantized) output signal would look like and what the actual (quantized) output signal actually does looks like.  If those differences could be stripped off, then what we would end up with is the ideal output signal in all its glory.  What the NTF of the SDM has done is to arrange for all of those differences to be concentrated into a certain band of high frequencies which are quite separate from the audio frequency band containing the ideal output.  By the simple expedient of applying a suitable low-pass filter, we can filter them out completely, and thereby faithfully reconstruct the ideal output signal.

Unfortunately, the simplistic SDM I have just described is not quite up to the task I set for it.  The NTF is not good enough to meet our requirements.  In reality, there is a final step in the design of the SDM where we need to be able to fine tune the STF and NTF to acquire the characteristics needed to make a high-performance SDM.  What we do is to replace the Sigma modulator with a filter, which is generally termed the Loop Filter.  The transfer function of the loop filter then determines the actual STF and NTF of the final SDM.  Designing the SDM then becomes the task of designing the loop filter.  This is a big challenge.

In Part II I will discuss some of the limitations and challenges of SDM design, and conclude by attempting to place my observations in the context of the ongoing PCM-vs-DSD debate.

Thursday, 30 July 2015

Got a Question?

Got a question you want answered on this blog?  Or a topic you would like to see discussed?  Or maybe you have some feedback you would like to give us?  Just send me an e-mail using blog@bitperfectsound.com.

Here are the rules of engagement:

  1. Depending on how many e-mails I receive, I probably won’t reply to your e-mail.
  2. It is entirely up to me whether I address your suggestion on the blog.
  3. It may be a long time before I get round to doing it.
  4. I may be inspired by your suggestion, but end up addressing a different point entirely.
I have kept a copy of this post among the 'stickies' on the right of the page.

Wednesday, 29 July 2015

Things that lurk below the Bit Depth

In digital audio Bit Depth governs the Dynamic Range and Signal-to-Noise Ratio (SNR), but the relationships often lead to confusion.  I thought it was worth a quick discussion to see if I can maybe shed some light on that.  But then I found that my ramblings went a little further than I originally intended.  So read on…

First of all, it is clear that the bit depth sets some sort of lower limit on the magnitude of the signal that can be digitally encoded.  With a 16-bit PCM system, the magnitude of the signal must be encoded as one of 65,536 levels.  You can think of them as going from 0 to 65,535 but in practice they are used from -32,768 to +32,767 which gives us a convenient way to store both the negative and positive excursions of a typical audio signal.  If the magnitude of the signal is such that its peaks exceed ±32,767 then we have a problem because we don’t have levels available to record those values.  This sets an upper limit on the magnitude of a signal we can record.

On the other hand, if we progressively attenuate the signal, making it smaller and smaller, eventually we will get to the point where its peaks barely register at ±1.  If we attenuate it even further, then it will fail to register at all and it will be encoded as silence.  This sets the lower limit on the magnitude of the signal we can record.  Yes, there are some ifs and buts associated with both of these scenarios, but for the purpose of this post they don’t make a lot of difference.

The ratio between the upper and lower limits of the magnitudes that we can record is the Dynamic Range of the recording system.  The mathematics of this works out to be quite simple.  Each bit of the bit depth provides almost exactly 6dB of Dynamic Range.  So, if we are using a 16-bit system our Dynamic Range will be ~96dB (= 16x6).  And if we increase it to 24-bits the Dynamic Range increases to ~144dB (= 24x6).  For those of you who want the exact formula, it is 1.76 + 6.06D (where D is the bit depth).

So far, so good.  But where does the SNR come into it?  The answer, and the reason why it is the cause of so much confusion, is that both the signal and the noise are frequency dependent.  Both may be spread over several frequencies, which may be similar or different frequencies.  Sometimes you don’t actually know too much about the frequency distributions of either.  Therefore, in order to be able to analyze and measure the ratios of one to the other, you often need to be able to look at the frequency distributions of both.

The way to do that is to take the audio data and use your computer to put it through a Fourier Transform.  This breaks the audio data down into individual frequencies, and for each frequency it tells you how much of that particular frequency is present in the audio data.  If you plot all these data points on a graph you get the audio data’s Frequency Spectrum.  In digital audio, we use a variant of the Fourier Transform called a DFT, which takes as its input a specific part of the audio data comprising a number of consecutive samples.  With a DFT the number of audio samples ends up being the same as the number of frequencies in the resulting Frequency Spectrum, so if we use a lot of audio data we can obtain very good resolution in the frequency spectrum.  However, if we use too may samples it can make the calculation itself excessively laborious, so most audio DFTs are usually derived from between 1,000 and 65,000 samples.

In principle, we can synthesize an audio data file containing nothing but a 1kHz pure tone, with no noise whatsoever.  If we looked at the DFT of that data file we would see a signal at the 1kHz frequency point, and absolutely nothing everywhere else.  This makes sense, because we have some signal, and no noise at all.  I can also synthesize a noise file by filling each audio sample with random numbers.  If the numbers are truly random, we get White Noise.  I can encode my white noise at full power (where the maximum positive and negative going encoded values are ±32,767), or I can attenuate it by 96dB so that the maximum positive and negative going encoded values are ±1.  If I attenuate it by more than that I only get silence.

Suppose I look at an DFT of my synthesized music data file containing white noise at -96dB.  Suppose my DFT uses 8,196 samples, and consequently I end up with a Frequency Response with 8,196 frequencies.  What do we expect to see?  Most people would expect to see the noise value at each frequency to be -96dB, but they would be wrong.  The value is much lower than that.  [Furthermore, there is a lot of “noise” in the frequency response itself, although for the purposes of this post we are going to ignore that aspect of it.]  Basically, the noise is more or less equally shared out among the 8,192 frequencies, so the noise at each frequency is approximately 1/8192 of the total noise, or about 38dB down.  The important result here is that the level of the noise floor in the DFT plot is a long way below the supposed -96dB noise floor, and how far below depends on the number of frequencies in the DFT.  And there is more.  DFTs use a thing called a ‘window function’ for reasons I have described in a previous post, and the choice of window function significantly impacts the level where the noise floor sits in the DFT plot.

If I make a synthesized music file containing a combination of a 1kHz pure tone and white noise at -96dB, and look at that in a DFT, what would we see?  The answer is that the noise behaves exactly as I have previously described, with the level of the noise floor on the plot varying according to both the number of frequencies in the DFT and the choice of window function.  The 1kHz pure tone is not affected, though.  Because it is a 1kHz pure tone, its energy only appears at the one frequency in the DFT corresponding to 1kHz, and it really doesn’t matter that much how many frequencies there are in the DFT.  [The choice of window function does impact both of those things, but for the purposes of this post I want to ignore that.]

The Signal-to-Noise Ratio (SNR) is exactly what it says.  It is the ratio of the the signal to the noise.  If those values are expressed in a dB scale, then it is the difference between the two dB values.  So if the signal is at -6dB and the noise is a -81dB, then the SNR will be 75dB, which is the difference between the two.  But since we have seen that the actual measured value of the noise level varies depending on how we do the DFT, yet the measured value of the signal pretty much does not, then an SNR value derived from an DFT is not very useful when it comes to quoting numbers for comparison purposes.  It is only useful when comparing two measurements made using the same DFT algorithms, set up with the exact same number of samples and the same window function.

Sometimes the SNR has to be measured purely in analog space.  For example, you might measure the overall background hiss on a blank analog tape before recording anything on it.  When you then make your recording, one measure of the SNR will be ratio between the level of the recording and the level of the tape hiss.  Or you can measure the output of a microphone in a silent anechoic chamber before using the same microphone to make a recording.  One measure of the SNR of the microphone would be the ratio between the two recorded levels.  I use the term “one measure” of the SNR intentionally - because any time you measure SNR, whatever the tools and methods you use to make the measurement, the result is only of relevance if the methodology is well thought out and fully disclosed along with the results.  In reality, the nuances and variables as such that you can write whole books about how to specify and measure SNR.

Clearly, if the noise component of the SNR is a broadband signal, such as the hiss from a blank tape or the signal from a microphone in a silent room, then my ability to accurately represent that noise in a PCM format is going to be limited by the bit depth and therefore by its Dynamic Range.  But if I use a DFT to examine the spectral content of the noise signal, then, as I have just described, the noise is going to be spread over all of the frequencies and the component of the noise at each frequency will be proportionately lower.  What does it mean, then, if the noise content at a given frequency is below the minimum level represented by the Dynamic Range?  For example, in a 16-bit system, where the Dynamic Range is about 96dB, what does it mean if the noise level at any given frequency is measured using an DFT to be a long way below that - for example at -120dB?  Clearly, that noise is being encoded, so we must conclude that a 16-bit system can encode noise at levels several 10’s of dB below what we thought was the minimum signal level that could be encoded.  The question then arises, if we can encode noise at those levels, can we also encode signals?

The answer is yes we can, but at this point my challenge is to explain how this is possible in words of one proverbial syllable.  My approach is to propose a thought experiment.  Let’s take the white noise signal we talked about previously - the one at a level of -96dB which is just about encodable in a 16-bit PCM format.  We took our DFT of this signal and found that the noise component at each frequency was way lower than -96dB - lets say that it was 30dB down at -126dB.  Therefore the frequency content of the noise signal at one specific frequency - say, 1kHz - was at a level of -126dB.  Let us therefore apply a hypothetical filter to the input signal such that we strip out every frequency except 1kHz.  So now, we have taken our white noise signal at -96dB and filtered it to become a 1kHz signal at -126dB.  Our DFT previously showed that we had managed to encode that signal, in the sense that our DFT registered its presence and measured its intensity.  But, with the rest of the white noise filtered out, our input signal now comprises nothing but a single frequency at a level 30dB **below** the minimum level that can be represented by a 16-bit PCM system, and the result is pure, unadulterated, digital silence.

What happened there?  When the 1kHz component was part of the noise, we could detect its presence in the encoded signal, but when the rest of the noise was stripped out leaving only the 1kHz component behind, that 1kHz component vanished also.  It is clear that the presence of the totality of the noise is critical in permitting each of its component frequencies to be encoded.  There is something about the presence of noise that enables information to be encoded in a PCM system at levels far below those determined solely by the bit depth.

Exactly what it is, though is beyond the scope of this post.  I’m sorry, because I know you were salivating to hear the answer!  But from this point on it boils down to ugly mathematics.  However, this result forms the basis of a principle that can be used to accomplish a number of party tricks in the digital audio domain.  These tricks include dithering, noise shaping, and sigma-delta modulation.  With dithering, we can add a very small amount of noise in order for a larger amount of quantization-error-induced distortion to be eliminated.  With noise shaping, we can reduce the noise level at frequencies where we are sensitive to noise, at the expense of increasing the noise at frequencies where it is less audible.  And with sigma-delta modulation we can obtain state-of-the-art audio performance from a bit depth of as little as 1-bit, at the expense of a dramatic increase in the sample rate.

With DSD, for example, an entire audio stream with state-of-the-art performance can be made to lurk below the 1-bit Bit Depth.

Monday, 20 July 2015

How Do DACs Work?

All digital audio whether PCM or DSD stores the analog audio signal as a stream of numbers, each one representing an instantaneous snapshot of its continuously evolving value.  With either format, the digital bit pattern is its best representation of the analog signal value at each instant in time.  With PCM the bit pattern typically comprises either 16- (or 24-bit) numbers each representing the exact value of analog signal value to a precision of one part in 65,535 (or one part in 16,777,216).  With DSD the precision is 1 bit, which means that it encodes the instantaneous analog voltage as either maximum positive or maximum negative with nothing in between (and you may well wonder how that manages to represent anything, which is a different discussion entirely, but nevertheless it does).  In either case, though, the primary task of the DAC is to generate those output voltages in response to the incoming bitstream.  Lets take a look at how that is done.

For the purposes of this post I am going to focus exclusively on the core mechanisms involved in transforming a bit stream into an analog signal.  Aside from these core mechanisms there are further mission-critical issues such as clock timing and noise, but these are not the subject of this post.  At some point I will write another post on clocks, timing, and jitter.

The most conceptually simple way of converting digital to analog, is to use something called an R-2R ladder.  This is a simple sequence of resistors of alternating values ‘R’ and ‘2R’, wired together in a ‘ladder’-like configuration.  There’s nothing more to it than that.  Each ‘2R’ resistor has exactly twice the resistance value as each ‘R’ resistor, and all the ‘R’s and all the ‘2R’s are absolutely identical.  Beyond that, the actual value of the resistances is not crucial.  Each R-2R pair, if turned “on” by its corresponding PCM bit, contributes the exact voltage to the output which is encoded by that bit.  It is very simple to understand, and in principle is trivial to construct, but in practice it suffers from a very serious drawback.  You see, the resistors have to be accurate to a phenomenal degree.  For 16-bit PCM that means an accuracy of one part in 65 thousand, and for 24-bit PCM one part in 16 million.  If you want to make your own R-2R ladder-DAC you need to be able to go out and buy those resistors.

As best as I can tell, the most accurate resistors available out there on a regular commercial basis are accurate to ±0.005% which is equivalent to one part in 20,000.  Heaven knows what they cost.  And that’s not the end of the story.  The resistance value is very sensitive to temperature, which means you have to mount them in a carefully temperature-controlled environment.  And even if you do that, the act of passing the smallest current through it will heat it sufficiently to change its resistance value.  [Note:  In fact this tends to be what limits the accuracy of available resistors - the act of measuring the resistance actually perturbs the resistance by more than the accuracy to which you’re trying to measure it!  Imagine what that means when you try to deploy the resistor in an actual circuit…]  The resistor’s inherent inductance (even straight wires have inductance) also affects the DAC ladder when such phenomenal levels of precision enter the equation.  And we’re still not done yet
unfortunately the resistance values drift with time, so your precision assembled, thermally cushioned and inductance-balanced R-2R network may leave the factory operating to spec, but may well be out of spec by the time it has broken in at the customer’s system.  These are the problems that a putative R-2R ladder DAC designer must be willing and able to face up to.  Which is why there are so few of them on the market.

Manufacturers of some R-2R ladder-DACs use the term ‘NOS’ (Non-Over-Sampling) to describe their architecture.  I don’t much like that terminology because it is a rather vague piece of jargon and can in principle be used to mean other things, but the blame lies at the feet of many modern DAC chipset manufacturers (and the DAC product manufacturers who use them) who describe their architectures as "Over-Sampling", hence the use of the term NOS as a distinction.

Before moving on, we’ll take an equally close look at how DSD gets converted to analog.  In principle, the incoming bit stream can be fed into its own 1-bit R-2R ladder, which, being 1-bit, is no longer a ladder and comprises only the first resistor R, whose precision no longer really matters.  And that’s all there is to it.  Easy, in comparison to PCM.  Something which has not gone unnoticed … and which we’ll come back to again later.

Aside from what I have just described, for both PCM and DSD three major things are left for the designer to deal with.  First is to make sure the output reference voltages are stable and with as little noise as possible.  Second is to ensure that the switching of the analog voltages in response to the incoming digital bit stream is done in a consistent manner and with sufficient timing accuracy.  Third is to remove any unwanted noise that might be present in the analog signal that has just been created.  These are the implementation areas in which a designer generally has the most freedom and opportunity to bring his own skills to bear.

The third of these is the most interesting in the sense that it differs dramatically between 1-bit (DSD) and multi-bit (PCM) converters.  Although in both cases the noise that needs to be removed lives at inaudible ultrasonic frequencies, with PCM there is not much of it at all, whereas with DSD there is so much of it that the noise power massively overwhelms the signal power.  With PCM, there are even some DACs which dispense with analog filtering entirely, working on the basis that the noise is both inaudible, and at too low a level to be able to upset the downstream electronics.  With DSD, though, removing this noise is a necessary and significant requirement.

Regarding the analog filters, most designers are agreed that although different audio stream formats can be optimized such that each format has its own ideal analog filter, if a DAC is designed to support multiple stream formats it is impractical to provide multiple analog filters and switch them in and out of circuit according to the format currently being played.  Therefore most DACs will have a single analog output filter which is used for every incoming stream format.

The developers of the original SACD players noted that the type of analog filter that was required to perform this task was more or less the same as the anti-aliasing filters used in the output of the CD format, which they were trying to improve upon.  They recognized that those filters degraded the sound.  So instead, in the earliest players, they decided to upconvert the DSD from what we today call DSD64 to what we would now call DSD128.  With DSD128 the ultrasonic filter was found to be less of a problem and was considered not to affect the sound in the same way.  Bear in mind, though, that in doing the upconversion from DSD64 to DSD128 you still have to filter out the DSD64’s ultrasonic noise.  However, this can be done in the digital domain, and (long story short) digital filters almost always sound better than their analog counterparts.

As it happens, similar techniques had already been in use with PCM DACs for over a decade.  Because R-2R ladder DACs were so hard to work with, it was much easier to convert the incoming PCM to a DSD-like format and perform the physical D-to-A conversion step in a 1-bit format.  Although the conversion of PCM to DSD via an SDM is technically very complex and elaborate, it can be done entirely in the digital realm which means that it can also be done remarkably inexpensively.

When I say "DSD-like" what do I mean?  DSD, strictly speaking, is a trademark developed by Sony and Philips (and currently owned by Sonic Studio, LLC).  It stands for Direct Stream Digital and refers specifically to a 1-bit format at a sample rate of 2.8224MHz.  But the term is now being widely used to refer to a broad class of formats which encode the audio signal using the output of a Sigma-Delta Modulator (SDM).  An SDM can be configured to operate at any sample rate you like and with any bit depth you like.  For example, the output of an SDM could even be a conventional PCM bitstream and such an SDM can actually pass a PCM bitstream through unchanged.  A key limitation of an SDM is that they can be unstable when configured with a 1-bit output stream.  However, this instability can be practically eliminated by using a multi-bit output.  For this reason, most modern PCM DACs will upconvert (or ‘Over-Sample’) the incoming PCM before passing it through an SDM with an output bit depth of between 3 and 5 bits.  This means that the physical D-to-A conversion is done with a 3- to 5-stage resistor ladder, which can be easily implemented.

These SDM-based DACs are so effective that today there are hardly any R-2R ladder DACs in production, and those that are
such as the Light Harmonic Da Vinci can be eye-wateringly expensive.  The intermediate conversion of an incoming signal to a DSD-like format means that, in principle, any digital format (including DSD) can be readily supported, as evidenced by the plethora of DSD-compatible DACs on the market today.  Because these internal conversions are performed entirely in the digital domain, manufacturers typically produce complete chip sets capable of performing all of the conversion functionality on-chip, driving the costs down considerably when compared to an R-2R ladder approach.  The majority of DACs on the market today utilize chip sets from one of five major suppliers ESS, Wolfson, Burr-Brown (TI), AKM, and Philips although there are others as well.

Interestingly, all of this is behind the recent emergence of DSD as a niche in-demand consumer format.  In a previous post I showed that almost all ADCs in use today use an SDM-based structure to create a ‘DSD-like’ intermediate format which is then digitally converted to PCM.  Today I showed the corollary in DAC architectures where incoming PCM is digitally converted to a ‘DSD-like’ intermediate format which is then converted to analog.  The idea behind DSD is that you get to ‘cut out the middlemen’ - in this case the digital conversions to and from the ‘DSD-like’ intermediate formats.  Back when SACD was invented the only way to handle and distribute music data which required 3-5GB of storage space was using optical disks.  Today, not only do we have hard disks that can hold the contents of hundreds upon hundreds of SACDs, but we have an internet infrastructure in place that allows people to download such files as a matter of convenience.  So if we liked the sound of SACD, but wanted to implement it in the more modern world of computer-based audio, the technological wherewithal now exists to support a file-based paradigm similar to what we have become used to with PCM.  This is what underpins the current interest in DSD.

To be sure, the weak link of the above argument is that DSD is not the same as ‘DSD-like’, and in practice you still have to convert digitally between ‘DSD-like’ and DSD in both the ADC and the DAC.  But a weak is link is not the same thing as a fatal flaw, and DSD as a consumer format remains highly regarded in many discerning quarters.