I want to address a critical phenomenon for which there isn't an adequate explanation, and provide a rationale for it in terms of another phenomenon for which there isn't an adequate explanation. Pointless, perhaps, but it is the sort of thing that tends to keep me up at nights. Maybe some of you too!
Most of you, being BitPerfect Users, will already know that while BitPerfect achieves "Bit Perfect" playback (when configured to do so), so can iTunes (although configuring it can be a real pain in the a$$). Yet, I am sure you will agree, they manage to sound different. Other "Bit Perfect" software players also manage to sound different. Moreover, BitPerfect has various settings within its "Bit Perfect" repertoire - such as Integer Mode - which can make a significant difference by themselves. What is the basis for this unexpected phenomenon?
First of all, we must address the "Flat Earth" crowd who will insist that there cannot possibly BE any difference, and that if you say you can hear one, you must be imagining it. You can spot them a mile away. They will invoke the dreaded "double-blind test" at the drop of a hat, even though few of them actually understand the purpose and rationale behind a double-blind test, and have neither organized nor ever participated in one. I tried to set up a series of publicly-accessible double-blind tests at SSI 2012 with the assistance of a national laboratory's audio science group. They couldn't have shown less interest if I proposed to infect them with anthrax. Audio professionals generally won't touch a double-blind test with a ten foot pole. Anyway, as far as the Flat Earth crowd are concerned, this post, and those that follow, are all about discussing something that doesn't exist. Unfortunately, I cannot make the Flat Earthers vanish simply by taking the position that they don't exist!
For the rest of you - BitPerfect Users, plus anyone else who might end up reading this - the effect is real enough, and a suitable explanation would definitely be in order. That is, if we had one for you.
If it is not the data itself (because the data is "Bit Perfect"), then we must look elsewhere. But before we do, some of you will ask "How do we know that the data really is Bit Perfect?", which is a perfectly reasonable question. But it is not one I am going to dwell on here, except to say that it has been thoroughly shaken down. Using USB it is actually quite easy to do (from the perspective of not being technically challenging), although using S/PDIF requires an investment in very specific test equipment. Bottom line, though, is that this has been done and nobody holds any lingering concerns over it. I won't address it further.
As Sherlock Holmes might observe, once we accept that the data is indeed "Bit Perfect", the only thing that is left is a phenomenon most of us have heard of, but few of us understand - jitter. Jitter was first introduced to audiophiles in the very early 1990's as an explanation for why so many people professed a dislike for the CD sound. Digital audio comprises a bunch of numbers that represent the amplitude of a musical waveform, measured ("sampled" is the term we use) many thousands of times per second. Some simple mathematical theorems can tell us how often we need to sample the waveform, and how accurately we need those sample measurements to be, in order to achieve specific objectives. Those theorems led the developers of the CD to select a sample rate of 44,100 times per second, and a measurement precision of 16-bits. We can play back the recorded sound by using those numbers - one every 1/44100th of a second - to regenerate the musical waveform. This where jitter comes in. Jitter reflects a critical core fact - "The Right Number At The Wrong Time Is The Wrong Number".
Jitter affects both recording and playback, and only those two stages. Unfortunately, once it has been embedded into the recording you can't do anything about it, so we tend to think of it only in terms of playback. But I am going to describe it in terms of recording, because it is easier to grasp that way.
Imagine a theoretically perfect digital audio recorder recording in the CD format. It is measuring the musical waveform 44,100 times a second. That's one datapoint every 23 microseconds (23 millionths of a second). At each instant in time it has to measure the magnitude of the waveform, and store the result as a 16-bit number. Then it waits another 23 microseconds and does it again. And again, and again, and again. Naturally, the musical waveform is constantly changing. Now imagine that the recorder by mistake measures the reading a smidgeon too early or too late. It will measure the waveform at the wrong time. The result will not be the same as it would have been if it had been measured at the right time, even though when the measurement was taken, it was taken accurately. We have measured the right number at the wrong time, and as a result it is the wrong number. When it comes time to playback, all the DAC knows is that the readings were taken 44,100 times a second. It has no way of knowing whether any individual readings were taken a smidgeon too early or too late. A perfect DAC would therefore replay the wrong number at the right time, and as a result it will create a "wrong" waveform. These timing errors - these smidgeons of time - are what we describe as "Jitter". Playback jitter is an identical problem. If the replay timing in an imperfect real-world DAC is off by a smidgeon, then the "right" incoming number will be replayed at the "wrong" time, and the result will likewise be a wrong waveform.
Just how much jitter is too much? Lets examine a 16-bit, 44.1kHz recording. Such a recording will be bandwidth limited theoretically to 22.05kHz (practically, to a lower value). We need to know how quickly the musical waveform could be changing between successive measurements. The most rapid changes generally occur when the signal comprises the highest possible frequency, modulated at the highest possible amplitude. Under these circumstances, the waveform can change from maximum to minimum between adjacent samples. A "right" number becomes a "wrong" number when the error exceeds the precision with which we can record it. A number represented by a 16-bit integer can take on one of 65,536 possible values. So, a 16-bit number which changes from maximum to minimum between adjacent samples, cycles through 65,536 distinct values between samples. Therefore, in this admittedly worst-case scenario, we will record the "wrong" number if our "smidgeon of time" exceeds 1/65535 of the time between samples, which you will recall was 23 millionths of a second. That puts the value of our smidgeon at 346 millionths of a millionth of a second. In engineering-speak that is 346ps (346 picoseconds). That's a very, very short time indeed. In 346ps, light travels 4 inches. And a speeding bullet will traverse 1/300 of the diameter of a human hair.
I have just described jitter in terms of recording, but the exact same conditions apply during playback, and the calculations are exactly the same. If you want to guarantee that jitter will not affect CD playback, it has to be reduced to less that 346ps. However, in the real world, there are thing we can take into account to alleviate that requirement. For example, real-world signals do not typically encode components at the highest frequencies at the highest levels, and there are various sensible theories as to how to better define our worst-case scenario. I won’t go into any of them. There are also published results of real-world tests which purport to show that for CD playback, jitter levels below 10ns (ten nanoseconds; a nanosecond is a thousand picoseconds) are inaudible. But these tests are 20 years old now, and many audiophiles take issue with them. Additionally, there are arguments that higher-resolution formats, such as 24-bit 96kHz, have correspondingly tighter jitter requirements. Lets just say that it is generally taken to be desirable to get jitter down below 1ns.
If you require the electronics inside your DAC to deliver timing precision somewhere between 10ns and 346ps, this implies that those electronics must have a bandwidth of somewhere from 100MHz to 3GHz. That is RF (Radio Frequency) territory, and we will come back to it again later. Any electronics engineer will tell you that electrical circuits stop behaving sensibly, logically and rationally once you start playing around in the RF. The higher the bandwidth, the more painful the headaches. Electronics designer who work in the RF are in general a breed apart from those who work in the AF (the Audio Frequency band).
The bottom line here is that digital playback is a lot more complicated than just getting the exact right bits to the DAC. They have to be played back with a timing precision which invokes unholy design constraints.
Tomorrow I will talk about the audible and measurable effects of jitter.