Sound Synthesis Part 1: The Basics

This is the first part of a long series of tutorials covering sound synthesis.

I know that of writing of this post there is a magnitude of tutorials on the net yet
so you might ask yourself: why the hell?
Well, as I said, there’s tons of information out there – the problem is it’s cluttered all over the net. You’ll find yourself spending a huge amount of precious time researching.
Site A tells you something important but to understand the details you need to find site B which also gives you information you won’t understand until you found site C.
You get the point. 😉

Prerequisites:
The theory should be understandable to anyone. For the programming parts you will need some basic programming knowledge or at least know how to copy & paste the code into your IDE of choice. Though Flash ain’t as popular as it used to be, I decided to use the ActionScript 3.0 programming language. Initially I considered using the OpenFL software framework which uses Haxe and is pretty popular because it can publish to different platforms (Flash, Android, Windows, HTML5,..) with a single codebase.
Unfortunately after some testing I realized that audio stuff doesn’t really work cross-platform. It just works flawlessy if targeting Flash or Windows so there was no real benefit. in future I might add Haxe code.
My recommended IDE is FlashDevelop using the Adobe Flex SDK.

Let’s dive in I’d say!

I’m sure all of you have seen or at least heard of a tuning fork yet. Simply put it’s a piece of metal which will start to vibrate as soon as you strike it against something solid and ultimately
produce a sound. It’s vibration creates alternating regions of low/high pressure by affecting nearby air molecules. As a matter of fact, sound needs a medium to travel through. Air is great, water works as well. If there is no medium like in space for example, there is no sound actually.
Anyway, as soon as the pressure changes caused by the vibrating fork reach your ear, it converts them into nerve impulses which are interpreted as sound by our brain.

A tuning fork

A tuning fork

Of course the frequency at which the fork vibrates and thus the tone you hear isn’t coincidental. Most tuning forks are pitched to sound the note A @ 440 Hz.
What does that mean? Hz , an abbreviation for Hertz is the unit of frequency and a frequency of 440 Hz is telling us, that the prongs of the tuning fork are vibrating back and forth 440 times a second.
If you pick up the sound using a microphone and send it to an oscilloscope you would see the following:

sineWave

Oscilloscope portraing a sine wave

What you can see above is actually one cycle of a sine wave. It’ sweeps between -x and x periodically and this happens 440 times a second to produce the note A.
If you never heard a tuning fork though you might have trouble imaging what a sine wave sounds like.
Here’s a sample:

By the way, what do you think your speakers/headphones are doing right now? Yes, the moveable part, the diaphragm, is moving back and forth 440 times a second just like the tuning fork … attracting air molecules … reaching your ear and your brain turns it into something you possibly know: the note A. Furthermore, if you could wave your arm back and forth fast enough, that’s 440 times a second, beside causing a lot of wind you could hear the note A too.

Well, though the sound above is synthesized your browser is just playing back an audio file which consists of an endless sequence of numbers. This sequence describes the shape of the sine wave and of course isn’t coincidental either. Lets take a closer look. The horizontal axis corresponds to time, the vertical to amplitude.
detailedSineWave

Consider the following numbers:
A=0 ; B=-0.5 ; C=-1 ; D=-0.5 ; E=0 ; F=0.5 ; G=1 ; H=0.5 ; I=0
Those are samples we’re taking from our sine wave at a specific point in time, matching the distance from the x-axis. If we use those values to draw a graph it roughly looks like our original sine wave.

detailedSineWave2

If you compare the original sine wave in red going from B to C, with the green line you can
see it’s different. It’s lacking accuracy.
By the way, don’t be fooled. There isn’t a green line going from one point to another. In our digital representation of the sine wave anything between is lost and could just be interpolated. I might write on interpolation in a later part of this series.

Anyway, we need values in-between B and C! We need more numbers! How many numbers do we need? Answer is, a lot! The sine wave itself is an continuous signal, also called analog signal, which means there’s an infinite number of points. Luckily we don’ need that much. This is where something called the samplerate comes into play. It simply states how many samples or better in which intervals we’re taking samples from our original analog signal. According to the Nyquist–Shannon sampling theorem, the samplerate needs to be twice the highest analog frequency component. This information alone isn’t helpful. You need to know that humans are able to hear frequencies up to ~20000Hz or 20KHz – thus the common samplingrate of 44100 samples per second was born. I know you wonder why it ain’t 40000 if our ears are limited to 20000hz. That’s another story beyond the scope of this tutorial.
Let’s take a look at our previous picture once more and calculate the samplerate we’ve used. If you remember, a single cycle of a sine wave repeats 440 times a second to form the note A and we’re taking 9 samples (AI). 3960 samples a second, huh? That’s far from 44100 but does a higher samplerate really mean better accuracy?
Let’s double the samplerate! Now we’re sampling at 7920 samples a second.

detailedSineWave3

Still not perfect but if you compare it to our previous image it looks a lot better and we proved that more samples indeed means better accuracy.

That was a lot of theory and I can almost hear you saying: “How can this help me synthesize sounds?”
It’s actually pretty simple. With the above information we know that we just need to generate a sine wave oscilating at a specific frequency and feed it to the computers soundcard.
Almost all programming languages offer a function to compute the sine of an angle, so does AS3.
Test the following code:

AS3’s Math.sin() function excepts the input angle to be in radians so we’re converting degrees 0-360 to radians by multiplying it with PI/180.
If you look at your debug panel you’ll see a lot of numbers ranging from -1 to 1. Yeah, even if you have a lot of imagination I doubt you can see a sine wave right now.
Let’s make a better example involving the BitmapData class.

If you run the code you should see the following image:
as3Sine

Wow! That looks familar! This is what we want! At least it’s kind of what we want. By feeding the values 0-360 to the Math.sin function we get a complete cycle of a sine wave but at what frequency? A samplerate of 44100 is equivalent to 1 second of audio. If we divide the sampleRate by 360 we get 122.5 which is roughly the note B (123.47 Hz).
Replace the drawSine() function with the following:

Voila! Now we got a perfect 440 Hz sine wave.

as3Sine440

440 Hz

What about some experiments now? Try setting the frequency to 880 and re-run.
Hm, it looks kinda different now. Let’s compare the two images.

as3Sine880A

880 Hz

If you count the peaks in both images you’ll notice that there are twice as much. Congratulations! You just discovered pitch! If we get back to our tuning fork, it would vibrate 880 times a second now but still sound the note A, shifted up one octave. Likewise if you set the frequency to 220, you’ll end up with an image containing half the peaks.

as3Sine220A

220 Hz

Pitch is a topic I’ll cover in another part of the series however. For the moment keep in mind that we can produce different notes by changing the frequency.

Looking at pictures is pretty interesting for sure but a tutorial about sound should eventually utilize your speakers, right? How can we synthesize a sound using all of this information?

AS3 provides a powerful event which lets us generate sound on the fly: the SampleDataEvent which gets added to a Sound object. If we do, the Sound object periodically requests data at a specific rate. This rate is between 2048 and 8192 and called a buffer. Remember the samplerate? This is getting important now. As you know there should be at least 44100 samples per second. This is the standard for CD quality audio. There are other common samplerates though. If you’re watching digital TV via satelite for example, the samplerate is 48000 while BluRay movies might contain audio tracks utilizing a samplerate of 96000 or 192000. Well, we’re using ActionScript and the flashplayer has a fixed samplerate of 44100.
Anyway, say we set the buffer to 8192 – what does that mean? We’re feeding around 186 milliseconds of audio data to the sound object.

44100 samples per second == 1 second of audio (1000 milliseconds)
8192 / 44100 * 1000 = 185.759637188209 milliseconds

Every 186 milliseconds we need to provide new samples. The reason we can set this buffer to as low as 2048 samples is another phenomenom called latency. To get a better understanding, picture this. You’re playing the note A one a flute, quickly releasing a finger to play note B. You wouldn’t hear a change in sound until 186 milliseconds passed. Even for our brains this delay is easily noticeable. If we change the buffer to 2048 samples the delay would be around 46 milliseconds. So why don’t we hardcode the buffer to 2048? If we do, the Sound object requests new data roughly 21 times a second. Depending on factors like the computer the program is running on and CPU usage this might cause clicks and pops in your sound because it can’t keep up. So it’s safer to set it to a higher value. For this tutorial the latency isn’t important so we can safely use a buffer of 8192.
Now that we know that we need to provide 8192 samples to the Sound object in which form do we do this? It’s expecting a ByteArray which essentially is a collection of data arranged in bytes.
Let’s compare a plain Array to a ByteArray.
Using AS3 we can create an Array like this:

If we want to trace the whole content of this array, we simply write:

and get the following output in the debug panel:
cats,dogs,birds

Let’s do the same thing using a ByteArray.

if we trace it’s content again we’re getting
catsdogsbirds
this time.

Looks like it’s working. Our Sound object isn’t expecting animals though it wants numbers – more precisely decimals ranging from -1.0 to +1.0. Those values represent the amplitude of a sample. We’re no finished yet I’m afraid. Because music is usually in stereo (left and right) it needs a decimal for the left channel and another decimal for the right channel.
The even numbered refer to the left while the odd numbered bytes refer to the right channel. This is called interleaving. In our example above, “cats” would be left, “dogs” right and “birds” left again.
We’ll create another ByteArray containing numbers now.

If we, again, trace it’s contents we’ll see something odd now:
?Ăł\(Ă”Â\?Ì(Ă”Â\)ÂżĂŸĂĄGÂźzĂĄ?Ă 

That’s because it’s binary data and every byte is treated as a character in the current code set.
Don’t worry about this. We won’t get into detail. The data is still there, believe me!
Try this:

Now that we know a little bit more about ByteArrays, let’s put all together and generate some sound finally!

 

Sweet! That wasn’t too hard, was it?

Lines 10-14
Define some basic parameters like the samplerate and the buffer we’ve talked about.

Lines 28-30
Here we’re instantiating a new Sound object, add a SampleDataEvent to it and most importantly call the play() function of the Sound object. If we do not, it won’t start requesting sample data!

Line 38
This line calculates the sine wave. The variable phase stores in which cycle of the oscillation we currently are. You might have noticed that there is an additional variable called volume. Remember the Sound object expects decimals between -1.0 and +1.0 being the amplitude? +/-1 would be the maximum and by multiplying the sample by 0.1 we’re actually reducing the amplitude of our sine wave to a tenth, which we perceive as a change in volume.
reduceVolume

Line 39
This might be the most important line at all. Math.PI * 2 refers to a full cycle of a sine wave. If we want to hear the note A at 440 Hz, we know that the sine wave needs to oscillate 440 times a second. That means we need to multiply Math.PI * 2 by 440. if this value is then divided by the samplerate we have the amount needed to increment phase for each sample.

Lines 40-43
If the value of phase exceeds Math.PI * 2 (~6.2832) we finished a complete cycle of our sine wave.

Lines 44-45
Yeah, you see double but it ain’t a typo. As I said the bytes inside the bytearray are referring to the left and right channels. Here we’re writing the same sample twice for the left/right channel resulting in a mono audio signal. If you change one of these lines to:

you’ll notice that the sound is just coming from the left/right speaker respectively.

See you in the next part of this series!

Leave a Reply

Your email address will not be published. Required fields are marked *