An Introduction to Digital Audio
A guide to the theory of digital audio, explaining the process of analogue to digital conversion and how sound is represented and stored in digital form.
All sound is analogue in nature, as all sound that reaches our ears is in an acoustical form regardless of its source. Digital audio is a format for digital technology to understand sounds, and to even create them, and it is a tool to allow technology to connect with the audible environment.
This paper presents the scientific principles of digital audio in a matter of fact way, at an introductory level for the reader. Subjective discussions, on the whole, can be found in further advice documents whose subjects relate to the concepts presented here. This paper is not designed to give direct advice on digital audio, but more to help form an understanding of the techniques used in digitising sound from analogue form. Having a basic grasp of the principles outlined here will greatly benefit any further understanding of the applications of digital audio in real world scenarios.
It should be noted that some of the processes mentioned in this document are very complicated and at times can be technologically baffling. Every effort has therefore been made to simplify these processes greatly to aid the general understanding of the concepts as a whole.
What is digital audio?
Digital data is a method of storing values in Binary Form. These values can be the plots of a graph, the numbers of a sequence and, in more complex situations, the values of colour embedded in a picture or values of the properties of sound.
If you can imagine that every aspect of a piece of information can be represented by a numerical value then the concept of digital data becomes easier to understand. For example, a sound wave is representative of changes of amplitude in time (see The Physical Principles of Sound). These changes of amplitude are represented as changes of voltage in analogue systems. If, at regular spaced intervals in time these voltage (amplitude) values are recorded, it is then possible to create an approximation of the sound wave in a digital form.
This information is stored into a digital file format, and the information therein can be used to reproduce the original acoustical sounds at a later time.
Binary form is the decimal counting system that is standard for storing digital data. Binary numbers are expressed in bits where a ‘bit' is an abbreviation of the term ‘binary digit'.
As opposed to decimal numbers (known as base 10) which use the ten digits
0 - 9, binary (or base 2) only uses two digits, 0 and 1.
It is highly efficient in computer systems, as only two numerals are needed in ascertaining an electronic system's primary concern, which is that of voltage being ON or OFF.
Utilising binary's two digits to express decimal values is expressed in the following table.
|Binary Form||Decimal Form|
The table is indicative of binary expressions. For example it can be seen that a decimal value of 6 is equal to the binary value 110.
Negative integers can also be expressed in binary form, commonly using an extra left most bit to indicate whether ‘1' the number is positive or ‘0' whether a number is negative.
All digital information is stored in binary, and in digital audio files, this refers to the values of the recorded changes in electrical voltage.
Analogue to Digital Conversion
The Audio Signal Path
In order to fully understand the concepts of digital audio, it is worth examining the processes undertaken when digitising analogue audio for playback. The diagram below illustrates the signal flow of sound from an analogue input through a digital system to an acoustic output.
1. An analogue or acoustical input source such as a tape machine or a microphone converts air pressure variations into an electrical signal (voltage).
2. An analogue-to-digital converter (ADC) converts the signal into digital data by repeatedly measuring the signal of the changes in voltage.
3. The numerical data is passed to a digital system and analysed, stored or manipulated.
4. The digital system creates a stream of output values, either from the stored values, manipulated values of the input or as a continuous real time stream of the input source.
5. A digital-to-analogue converter (DAC) converts the output of the digital system to variations in electrical voltage.
6. An acoustic output, such as a loudspeaker device, converts voltage changes to air pressure variations (audible sound).
Sampling is the method employed in converting analogue information to digital data. A good analogy to consider is how photography or images can mimick the perception of motion. For example a flip book contains static drawings which contain information at a location relative to others in a sequence. When the viewer flicks through, this information creates a constant animation, no longer static drawings. In a similar way digital audio sampling does the same thing, it uses information taken at specific times to recreate an oscillating waveform.
An acoustic waveform is a continuous line of information. When viewed extremely close up there are no breaks in the waveform even when distortion or noise is present. This continuous line is represented numerically by a digital system at specified points in time. Subsequently, the quality of the digital reading increases with the amount of readings of the continuous analogue signal.
The digitisation of acoustic sound (see Diagram 2a) uses sampling (or discrete time sampling) to store data from an analogue waveform, shown in Diagram 2b. Discrete time sampling reads the continuously variable voltage, which represents the amplitude level, of an analogue waveform to recreate a digital representation of the file. Once a reading is taken, this value is stored and held until the next sample. This is known as sample-and-hold and is common in most digital audio systems. The effect can be seen in Diagram 2c.
The reading of these values occurs at a fixed rate by a sample clock within the ADC. And is known as the sampling frequency, i.e. once every .25ms, expressed as 4000Hz (f = 1/t). This is also known as pulse code modulation (PCM) sampling.
Diagram 2a shows an acoustic sine wave with sample points at regular intervals in time from right to left on the x-axis. The sample values shown in Diagram 2b create the digital representation of the wave file shown in Diagram 2c. A low pass filter is used upon output to smooth the edges of the new digital wave, which represent added high frequencies, with the resultant wave shown in Diagram 2d.
This discrete time sampling process defines Analogue to Digital Conversion and is the fundamental tool that embodies digital audio.
As noticeable in Diagram 2 the sampling frequency needs to be fast enough so the reproduced audio is visually and aurally continuous and not snippets of time of the originals. Therefore the higher the sampling rate, the more accurate digital reproduction can be. This is further achieved by adhering to the sampling theorem, as described in the box below.
Audio sampling is based on the Nyquist-Shannon sampling theorem, which concludes that for accurate signal reconstruction the sampling frequency should be at least twice the bandwidth of the source signal. Therefore, if a source signal has a bandwidth of 0 - 1KHz, then the sampling rate needs to be at least 2KHz.
It can be seen from the previous section that the quality of reproducing an analogue source signal is increased with a higher sampling rate. However, the question of which sample rate is most effective to preserve optimum quality and aaccuracy is dependent on a number of factors. With constant advances in technology, the storage of large amounts of data generated from high sample rate recordings is becoming less of a concern. On the other hand, when digitising audio from low quality analogue media, a high sample rate may not be necessary. If storage space is of no concern, and optimum quality is required, then it is advisable to digitise with the highest sample rate available, however this issue is often a compromise.
It is considered that the maximum essential frequency audible to the human ear is approximately 20KHz. Therefore in audio systems the minimum sampling rate needs to be approximately 40KHz. For practical electronic design reasons and manufacturer standards, the most commonly used baseline sample rate is 44.1KHz. However electronic systems in practice behave differently than the ideal theory of principles and there is a strong argument that optimum audio fidelity is often not being achieved. Studies have shown that methods of modern post-production can often utilise frequencies higher than the human audible range to benefit the audible content. As a result higher sample rates are being used more within pro-audio systems. 48KHz, 96KHz, and 192KHz are becoming more commonly used as the CD is starting to be replaced by media capable of supporting high sample rates.
Even though we may not hear frequencies above the threshold of hearing, they can still modulate frequencies within the audible range, and thus effect the sound. High sample rates also allow for the flaws in anti-aliasing filters (mentioned later) not to affect high frequencies within the audible range due to imperfections of electronic systems in practice.
It is considered that a sample rate of 44.1KHz is adequate for most work at a consumer level but a higher rate with a minimum of 48KHz, but at 96KHz if attainable, would be appropriate for more accurate results of digitisation, especially when archiving material.
One of the main reasons the sampling theorem should be adhered to is to avoid the effects of aliasing. This is because when the sample rate is less than twice the frequency of the maximum frequency of the source material, a new alias frequency is created from the sample points (see Diagram 3 below), which is added to the signal. This effect is known as aliasing.
The problem this creates is that the lower alias frequency is in the audible range and multiple frequencies above this threshold create multiple audible alias'.
Providing the input frequency is higher than half of the sampling frequency, the approximate new alias frequency can be derived from;
Alias = sampling frequency - input frequency
For example, an input frequency of 26Khz sampled at a rate of 40Kz, produces an alias frequency of 14Khz (40Khz - 26KHz).
Diagram 3a shows the sample points on an input wave before analogue-to-digital conversion. As the input signal is greater than half of the sampling frequency the original input signal is filtered out producing a lower alias frequency shown in Diagram 3b.
To reduce the effect of aliasing a low-pass filter is used at the input stage of an ADC that removes any frequencies higher than half the sample rate. This is referred to as an anti-aliasing filter, the effects of which are shown in Diagram 2.
Although a low-pass filter will theoretically remove aliasing, in practice problems occur due to the properties of electronic components that potentially degrade the quality of the signal once anti-aliased.
Quantisation errors occur during sampling where the areas between samples hold no data unlike an analogue signal which has constant data. Each sample is rounded up to the nearest value which in turn deviates from the original source signal. As a result this produces audible noise, known as quantisation noise.
In digital systems, if the input signal is silent, then the result is digital silence; a sample of value 0. However, with a sinusoidal wave input the rounding up of values creates an audible effect, the granulation effect, especially as the waves decay to zero. This noise is similar to that of quiet tape hiss common on analogue magnetic tape.
The more accurate the sample reading compared with the input wave file then the lesser the chances of quantisation. The bit depth (see below) is the range of accuracy the sample can be read at.
A common method of reducing quantisation noise is a technique called dither. Dither adds an element of very low-level noise to a signal prior to analogue-to-digital conversion. This noise forces the quantisation process to jump between adjacent levels at random. It adds the noise, which is a random signal, to the wanted audio signal. The audible effect is known as soft-landing as the grittiness is converted into a soft pool of low-level noise which is far more natural on the human ear.
Dither is not generally considered necessary at higher bit depths, such as 20-bit and above, as the human ear cannot hear quantisation errors at this quality. However when converting to 16-bit and lower, dither becomes necessary to maintain high audible quality.
Digital distortion (Clipping)
When sampling analogue audio care needs to be taken to the input gain level of the analogue source going into the digital system. Gain is the amplitude level of the signal that is fed into the ADC.
Imagine forcing a pie into a cylindrical pipe whose diameter was much shorter that the width of the pie. In order to successfully force the pie in to the pipe, edges of the pie would break off flattening each side. If you were to examine the pie as it came out of the other side of the pipe it would be fair to say that it wouldn't resemble the pie originally, before it entered the pipe.
Similarly, if you were to play an analogue audio file with volume, lets say 11, into a digital system which has an input range of 1 - 10, every time the volume of the analogue file reached 11, the digital system would run out of available range to capture the required information. The result is that the signal gets clipped at the maximum level and is heard as distortion. This effect can be seen in Diagram 4 where the input signals amplitude is increased half way along the time line.
Clipping occurs during the second half of the waveform where the peaks and troughs of the wave are harshly rounded off.
Digital clipping is not easy to undo. When a file is clipped the wave shape is altered and would need to be repaired back to the original form of the analogue file. If this data is lost through clipping then it can be very complicated to render it back to it's original form. As shown in diagram 4 an analogue input signal becomes clipped when sampled. The resultant waveform is shown with the distorted elements visible in the peaks and troughs of the wave.
To avoid digital clipping, gain levels should always be set before digitisation so that the maximum output level of the analogue source file is well within the limits of the range of the DAC. Further information on setting gain levels in digital systems can be found in the advice document Audio Digitisation Workflow.
The bit-depth (or sample-width) determines the accuracy of the amplitude measurements per sample, i.e. the range of values available per sample.
Subsequently, a higher bit-depth allows for greater audible perception in the subtle changes in loudness. This is known as the dynamic range, which is a measurement of the ratio of the loudest available undistorted noise to the quietest available. 16-bit sampling is considered the standard baseline for consumer audio however 24-bit sampling offers a wider dynamic range and is becoming more commonly used in production and digitisation. Similar to the sample rate, a higher bit depth demands greater computational speed from the ADC. The maximum available will vary depending on the equipment in use.
Some systems use lower bit-depths where quality is less of an issue and there are smaller computational demands on the system in use. For example audio in telephony uses 8-bit conversion as there is not a need for high audio fidelity, and to keep the costs down of components being used to process the audio.
The accuracy of 2-bit quantisation (Diagram 5a), with three levels of amplitude resolution compared to 5-bit quantisation (Diagram (5b), with thirty-two levels of amplitude resolution. Shown against the input sine wave.
CD quality is 44.1Hz sampling rate and 16-bit. This was the standard laid out in the 1980s and was cost effective and adequate for the time. For spoken word and most musical content 16-bit sampling is considered to be of adequate quality. However, as standards and quality of digital equipment increases, higher quality through a greater dynamic can be achieved with 24-processing.
Bit rate, which is commonly confused with bit depth, refers to the transfer of data, such as downloading audio files via the internet. It is often required to make an audio file size smaller (known as compression) for downloading or streaming due to the large size of uncompressed audio files and the time it takes to stream or download on slower, and sometimes even on fast connections.
The bit rate is the expression of the amount of information that is stored per second. It is dependent upon the type of codec used to create the file and which, if any, data compression scheme is applied.
A simple equation can be used to calculate the bit rate of a file when the following values are known.
Bit rate = (bit depth) x (sampling rate) x (number of channels)
For example, a 24 bit file sampled at 44.1Khz in stereo (2 channels) has a bit rate of 24 x 44100 x 2 = 216800 bits/sec or 2168 kbits/sec
In practical terms most sound editing software has a function to convert a file to a different bit rate. The options are often presented as common standardised bit rates that differ in quality. For example the mp3 codec compresses data into 5 common bit rates ranging from poor to good in quality and relate to different file sizes.
Storage of digital files
Digital audio files can be highly taxing on storage space, especially when created with high sample rates and bit depth. It is advisable to always take into account the size of files you may be working with to optimise your available storage capacity. A larger file size will also impact on the processing demands of a digital system especially when conducting analysis and modification.
For example, an uncompressed audio CD, 44.1KHz sample rate, 16bits (2 bytes per sample), 2 channels (stereo), lasting 45 minutes can be expressed as.
44100 x 2 x 2 x 45 x 60 = 476280000 bytes
Alternatively uncompressed audio stored on a digital multi-track recorder, 96KHz sample rate, 24bits (3 bytes per sample), 48 channels, lasting 45 minutes can be expressed as
96000 x 3 x 48 x 45 x 60 = 37324800000
≈ 34.76GBThe storage implications of compressed audio file types is discussed in the advice document Audio File Types and Compression.
Issues of quality and fidelity have long been topics of wide discussion with regards to digital audio. There are lots of arguments that technology still lags behind the theory of perfect digitisation, and many argue that digital sound has much harsher characteristics than analogue sound on the human ear. However, with the recent advances in modern technology and the need for long-term digital storage, digital audio processing has been at the forefront of creative and practical development, and has even sprung into its own art forms.
It is now considered that with equipment of a very high quality which is configured correctly, digital conversion of analogue audio is so transparent that all information fro an analogue file can be preserved.
The most important component in the conversion process to the resultant quality of replication is the ADC. It is important to note that even though manufacturers claim to have similar specified equipment available, some ADC greatly outperform others when it comes to the internal processes of the converters which capture the information. A general rule of thumb to go on is the more expensive the converter then the better quality you will receive.
One thing that digital audio cannot replace however, is the subjectivity of the human ear, the most critical tool when working with sound. So when questions arise with regards to fidelity and quality, aside from standards of practice laid out, the most reliable tool is the trained ears of the listener
The range of available frequencies for a specified signal or system. For example, a source file containing frequencies from 100Hz to 15KHz has a 14.9KHz bandwidth between the two frequencies.
The decibel is a measurement of the relationship between sound intensity and perceived loudness by the listener.
Pulse Code Modulation (PCM).
In terms of digital audio files a linear PCM scheme stores binary integer value for each sample.
A technique used in electronics when interfacing with real world inputs. A sample is taken with most commonly the voltage reading being stored within a capacitor. A switch disconnects the capacitor from the input, and thus the sampled value is held. At the next period of the sampling frequency the switch is opened, allowing for the input value to be read again then closed, repeating the process.
Often referred to the Nyquist or the Nyquist-Shannon theorem. Originates from work and research done in 1928 and 1933 in telecommunications and information systems.