Sounds are part of the fabric of our lives. From a ringing phone to a bus applying its brakes as it pulls up to a bus stop, we use these sounds to explore and understand our environment. Machine learning (ML) and artificial intelligence (AI) are fast-growing fields with many applications, from image and fraud detection to weather prediction. In this article, we’re focusing on audio classification.

Most ML and AI algorithms are for images and text data — for example, facial recognition, chatbots, and text summarizers. But what about sounds?

Audio classification is simply the process of categorizing audio files into categories according to shared features. An audio file is a digital representation of waves moving at a particular frequency — for an AI or ML algorithm to use this file, it has to be able to interpret this representation.

Thus, before they can be trained on the model and make predictions, an ML algorithm needs labeled audio files converted to a format it understands.

How Audio Classification Works

AI can learn from anything that has patterns. Sounds are waves moving at a particular frequency which you can visualize as waveforms. Below is the waveform of an audio file.

To make sense of this waveform, look at the following image:

The image uses three terms:

Period is the time taken for a wave cycle to be complete or the distance between the two peaks measured as time.
Frequency is the number of cycles that occur per second.
Amplitude is the distance between the origin and the peak of a wave.

The higher the frequency is, the higher-pitched the sound is.

How AI and ML Understand and Classify Audio

In its raw form, a sound is an analog signal — to process it, you must convert it to digital. Analog signals are continuous waves. An analog wave has a discrete value at any given point in time. It’s impossible to convert every value to a digital representation because there are infinite points in a continuous wave. You must convert the audio signal to digital through sampling and quantization.

Sampling means recoding these points at specific time intervals or frequencies. This interval is the sample rate. The higher the sample rate, the less information lost and the fewer errors. However, if the sample rate is too high, it increases the audio file size without significantly improving the audio.

Quantization means rounding the amplitude sampled at each interval to the nearest bit. The greater the bit depth, the lower the quantization. The larger the bit depth, the more memory it takes, so what bit depth to use depends on the amount of available RAM. If you want to learn more, this is a good explanation of analog to digital conversion.

After conversion, the next step is to extract useful features from the audio. You can extract many features, so you must focus on the specific problem you want to solve. Here are some critical features used for audio classification with ML:

Time domain features are extracted directly from the raw audio (waveform). Examples are the amplitude envelope, and the root mean square energy. Time domain features are not enough to represent the sound because they do not include frequency information.

Frequency domain features are also called the spectrum. Examples include band energy ratio, spectral flux, and spectral bandwidth. These features lack time representation. You extract them from a time domain representation through Fourier transform. Fourier transform converts a waveform from a function of time to a function of frequency. Learn more about Fourier Transforms here.

Time and frequency domain features extract a spectrogram from the waveform using a short-time Fourier transform, like Spectrogram or Mel Spectrogram. The spectrogram shows both the frequency and time domain. Spectrograms represent frequency linearly, but humans perceive sounds logarithmically. This means that humans can tell the difference between lower frequencies like 500-100 Hertz (Hz) but find it harder to differentiate between sounds at a higher frequency, from 10,000 Hz to 15,000 Hz. Learn more about how humans respond to various sound ranges here. This difficulty is why the Mel-scale was introduced. A Mel Spectrogram is a spectrogram on the Mel-scale, which measures how different pitches sound to the listener. It maps pitches that, to listeners, sound equidistant.

In a traditional ML approach, these features are combined input for the ML model. Features from the time and frequency domains such as the amplitude envelope, root mean square energy, and band energy ratio, are extracted and trained with an ML model. Speech recognition for smart home security is one application for this.

With a deep learning approach, feature extraction is automatic. You must feed all the spectrograms (images) into the neural networks, and they learn from the patterns and corresponding labels to make predictions.

A deep learning classification task to classify the crying of a baby and a dog barking might go through a process like this:

Gather many sample audios of a crying baby and a barking dog and label them accordingly.
Check the number of channels it has, either mono or stereo, and ensure all samples have the same number of channels.
Resample the audio to a sample rate of 22100 Hz.
Convert these audios to Mel-spectrograms using Mel-frequency cepstrum coefficients (MFCCs).
Feed these spectrograms into the neural network and train them.
Make predictions using the trained model.

Types of ML Audio Classification Tasks

Here are some ML audio classification tasks:

Acoustic event detection classification processes sound to identify the event that caused them — like a baby crying or a car horn. For example, Alexa has a home guard feature that sends notifications to users about possible break-ins or smoke alarms.
Environmental sound classification studies an environment and identifies the different activities going on in that environment. An example is detecting sounds made by animals to classify wildlife in an area.
Music classification automatically organizes and creates playlists or recommendations from apps like Spotify and YouTube Music.
Natural language classification detects language and sentiment in audio. You can use it to communicate with people who speak different languages. A good use case for this exists in virtual assistants like Siri and Alexa.

Language detection is much more complex than detecting “energy,” “loudness,” or “liveliness” in music. The basic features of audio that we’ve discussed are not enough to detect and interpret language.

Natural language processing (NLP) solves this problem by understanding the context in which a particular word occurs. The audio is first converted to text, preprocessed to extract features like local expressions, vocabulary, and meaning, and then a response is generated and converted back to speech. AI must also understand speech in various environments; for instance, sound occuring on a bus, in a market, at an office, or at a party. This is where audio classification and NLP relate and overlap.

Potential Use Cases for Audio Classification

Here are some use cases for audio classification:

Natural language tone detection identifies the emotion of speakers from vocal sounds at certain pitches and frequencies. You can do this with an FFT spectrogram that extracts the pitch period and matches it to the corresponding tone.
Noise-canceling is a feature of many video conferencing and voiceover applications where you must reduce background noise.

Conclusion

Audio classification is a fast-growing field with applications in many areas. Leveraging ML and AI has broken barriers in speech translation, improved customer service, and helped to monitor and understand environments. With deep learning, you no longer have to extract those features you think best represent the audio manually. Current audio applications automatically extract useful features.

Team Symbl

The writing team at Symbl.ai

Machine Learning: Crash Course In Audio Classification