Bell helped teach the deaf to speak aloud, and had a passionate interest in the reproduction and transmission of spoken words. Yet he ushered in a long era in which POTS (Plain Old Telephone Service) provided a scratchy, low-fidelity, cold rendition of how we sound. Mobile phones didn’t do much better. Using early encoding techniques designed for slow mobile processors, cell phones were often far worse than POTS in carrying the nuance of our speech.
While today’s public switched telephone network (PSTN) is digital at its core, the last bit (known as the final mile) between phone exchanges and homes or businesses is analog, just like it was in early phone networks. We speak into a modern phone that almost certainly no longer uses the compression properties of carbon granules to create directly the electrical signal that goes over the wire, but nonetheless uses a digital facsimile of same. (Business may use digital exchanges, but the outcome is fed into the same digital meatgrinder as analog voice connections.)
The analog system uses filters to capture a range of sound from about 300 hertz (Hz) to 3300 Hz. The lower number, measured in cycles per second, represents deeper sounds (a slower cycling) and the higher, high-pitched ones. Most of the primary sound and amplitude of human speech is at the lower end of that spectrum, whether the voice is male or female. (Wherever analog voice terminates in the PSTN at a digital gateway, it’s converted into a standard form that’s the equivalent of about 12 to 13 bits per sample at 8,000 samples per second. Modern cell phones capture approximately the same frequencies and digital sampling rates. Sprint may have trumpeted the “pin drop” in ads in the mid-1980s, noting the lack of noise in its fiber-connected network, but it didn’t improve the frequency range.)
You have to look to the harmonics of a voice to understand why the cut off at the lower and upper ends make it both difficult to understand what people say over a phone, and why they don’t sound really present to you. Harmonics are an artifact of vibrations; almost anything that oscillates has harmonics. Take a piece of string, stretch it, and thrum it, and you might even see the fundamental frequency, the main or base oscillation on which most of the energy is present. But the overall vibration carries with it multiples of that fundamental one. We hear a single sound composed of all the overlaid harmonics at once, although we can train our ear to pick among them. (Encyclopedia Britannica provides a nice explanation.)
With speech, the fundamental frequency can be centered below 300 Hz, while overtones can reach over 10 kHz. Harmonics from normal speech are quieter (and physiologically sound quieter to us) the higher they go. Trained singers can control some of their overtones, while harmony singers can produce marvelous new sounds at higher pitches from the intersection of harmonics. Polyphonic singers, like Tuvan throat singers, can module fundamental tones and harmonics simultaneously. (You can find a marvelously clear explanation with illustrations of frequency limits in voice communications from a 2006 white paper of a firm that was at the time promoting wider dynamic range VoIP in its products.)
The frequencies captured also define the dynamic range: not just which frequencies, but the difference in expressiveness by tone. In photography, dynamic range is the gradation of all the grays captured from lightest to darkest. The greater the dynamic range, and the more real (or even hyperreal, with high-dynamic range imaging) that pictures appear. Further, the gap between each step in capturing dynamic range (from one tone to the next adjacent one) defines how smooth the audio sounds. In a photo, it’s the difference between images with gray banding and ones that appear to have a continuous tone. Beyond dynamic range lies the difference between louds and softs. Phone calls compress amplitude, missing the softest sounds and turning everything largely into a muddle in the middle.
Photo: Paul Van Damme
This is why you when you listen to broadcast FM radio, even with any scratchiness that eats into the signal, you feel like you are physically co-located with the sound. FM radio doesn’t have a sample frequency as such, because it’s continuous and analog, but it has a dynamic range of 30 Hz to 15 kHz, which covers most spoken, sung, and musical tones.
But here’s the thing. If the PSTN is all digital in its core, why can’t we just stick digital filters on both ends that let us capture a greater range of audible frequencies with greater accuracy and greater clarity? The PSTN allots 64 Kbps in its circuit-switched (dedicated capacity) approach to each voice call, but modern compression is much better. GSM cell networks use a standard that can stream at from about 5 Kbps to 12 Kbps.
An AAC file at nearly the same quality as an uncompressed audio CD recording can encode roughly 20 Hz to 22 kHz (to get the highs and lows of music) with 16-bit stereo samples to provide nice differentiation in that range at a rate of 44.1 kHz for clarity in about 128 Kbps. But that’s for music. Spoken voice can be compressed even further, down to 48 to 96 Kbps, while maintaining excellent quality.
Given that a DSL line using the same two wires that carry analog voice can handle 24 Mbps and even more these days, what gives with voice? Possibly one day, we’ll see the end of analog phones and analog lines when nearly everyone has Internet-based VoIP or a mobile phone, and the remaining holdouts (the stubborn, the elderly, and the poor, typically) are forced to attach adapters. (That’s how the U.S. managed the digital television switchover.)
But for now, the PSTN is the PSTN and the Internet is the Internet, and the two kinds of switching networks don’t meet except at gateways. VoIP-to-VoIP over the Internet provides a workaround. Even the earliest successful VoIP calls I can remember making between two computers sounded better to me than any traditional voice call. The problem was always latency (the time it takes for data to transit from one end to the other) and jitter (the consistent delivery in order of necessary packets). Latency is down, jitter reduced, and quality has improved dramatically since the late 1990s, as better compression techniques, more processing power, and the greater availability of bandwidth allows a richer representation of voice.
Skype wasn’t the first system to allow end-to-end VoIP calls by a long shot, although it is surely the most popular at present. It has stepped through a few codecs (the algorithms that convert uncompressed digital representations of media into more compact ones and back again) since its 2003 introduction, and developed its own, SILK, in 2009. SILK captures 70 Hz to 12 kHz at sample rates that vary from 8 to 24 kHz and result in throughput of 6 to 40 Kbps. It varies depending on conditions, with the best results with the highest consistent available throughput.
I’ve done a fair amount of radio guesting in the last several years, and I remember that lovely feel the first time of putting on a set of headphones in the studio, talking into a nice mic, and hearing myself and the host sound as rich through my ears as when I listen to actual broadcasts and podcasts. When I started using Skype routinely around the same time, I had the same reaction: this has the warmth, fullness, and clarity of radio broadcasts. (In a bit of irony, I am often interviewed by radio shows from home via Skype. The program records both ends of the call on its side, and I use Audio Hijack Pro to record my end using a Blue Yeti mic. I send them my audio file, but they have theirs in case of a problem with my recording.)
Make a Skype call using earbuds or with a USB headset, close your eyes, and you find yourself transported next to the party you’re calling. The sense of presence comes through. When I set up interviews for articles, I try to get the other party on Skype. A phone call, and too often a cell call, is scratchy and flat. You can’t get to know someone in a short time with that flat of a call, as you sound dead and distant to the other party. Skype and other VoIP programs with good codecs bring you as close as you can come without being there.
My friends Lex Friedman (a Macworld magazine editor) and Marco Tabini (an open-source development advocate) recently released an iOS game called Let’s Sing. When Lex told me about the game, I thought it a terrific idea, but couldn’t articulate why, even after he let me help test it. The game is a bit like Draw Something but for singing, humming, or whistling a tune without using the lyrics to get a partner to guess the title.
After playing a number of rounds, I realized what Lex and Marco had hit upon, and why I’d soured on Draw Something (besides some game mechanic issues). Drawing can require time, deliberation, and skill, even for silly purposes, and I’m not great at drawing on an iPhone. Watching a drawing unfold in sped-up time can be tedious. There is a human connection there, watching someone’s finger or stylus at work. But it never felt like a real bond.
What my friends hit upon is voice. They record at high-enough fidelity that every round for me is a beautiful connection with friends and family. I discovered Lex’s wife, Lauren, has a lovely voice, and I already knew my pal Ren can belt a tune. That connection makes the game work: I like to hear the voices of people I know and love.
Photo: Brett Claxton
We’ve seen a rebirth via the Internet in the full expressive representation of the sounds we emit, and, I believe, made greater connections among each other as a result. Skype and other Internet telephony programs provide free computer-to-computer connections, and the free part absolutely certainly drove usage for a long time. (Skype is now a double-digit percentage of all international calls.)
But I’d argue that what drives me and others to Skype isn’t just cost. I have effectively free long-distance calling for my purposes with my mobile phone, and services have long existed to let you dial around international long distance for cheap per-minute rates. Rather, I go to Skype to hear the way people sound, and have real conversations.
Bell gave up his work creating discrete multi-tone communications, leaving that for John Cioffi to make use of 100 years later (and win a Bell award), in order to crush the human voice. He didn’t intend that, but it happened nonetheless. It’s a bit of neat closure to see that Bell’s initial interest, applied to data communications, has brought back the clarity of voice.