AUDIO CLASSIFICATION FOR COUGH DETECTION
Description
This blog summarizes the course project for our course IEE-03 (ANN)developed by me and my friend Keshav. The project aims to build a device for classifying an audio into cough and normal noise, to be used for remote monitoring of patients suffering with COPD, in which calculating cough frequency is an important parameter for diagnosis.
Content
Here we describe the model that we used for achieving the above objective .The blog is divided into following sections -
- Preparing the Dataset
- Preprocessing Audio Files using MFCC
- Classification of Data using CNN based classifier
Preparing Dataset
We aggregated data from various sources such as Kaggle, Freesound.org, Github etc. To maintain uniformity we segmented the data into segments of 8 sec. with the help of a program and added the label in the name. The final prepared Dataset is available at our Github repository.
Preprocessing
Preprocessing refers to converting the data from its raw format (.mp3 files here) to usable features for our classifier. Audio files are recorded by a microphone at a fixed sampling frequency and represent the loudness or the energy of the sound waves at the particular instant of time hence the files in time domain look something like this.
A way to analyze the file is based purely on the temporal dimension but that isn’t sufficient to extract the category of sound the file belongs to and we need to use the frequency domain aspect of the data. To convert the file from time domain to frequency domain, we use Fourier Transform and the file now looks like this.
Either of the 2 domains of data representation when used individually abstract some of the features of each other and hence we need a way to use both the information simultaneously and Spectrograms are the tool we need.
A Spectrogram is a visual way of representing the signal strength, or “loudness”, of a signal over time at various frequencies present in a particular waveform. Not only can one see whether there is more or less energy at, for example, 2 Hz vs 10 Hz, but one can also see how energy levels vary over time.
Now, the humans perceive sound in a logarithmic manner but the frequencies are measured linearly hence we need a scale which is logarithmic called Mel frequency. Hence there is an Empirical formula which is used to transfer the normal frequency bands to the Mel frequency bands.
Hence we make filter banks i.e. we select the frequency segments and associate them to their Mel frequency corresponding to the median Frequency of the Bandwidth
Now to find the Mel Frequency Cepstral Coefficients Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC They are derived from a type of cepstral representation of the audio clip (a nonlinear “spectrum-of-a-spectrum”).
So what we want here is to extract the vocal envelope of the speech. As we know that the speech is made of two parts, i.e. there is a component which corresponds to the constant High frequency Glottal Pulse and the other one is the energy transferred to the envelope by the vocal tract changing its shape in order to produce different sound.
So if we look on the Workflow we can see what is happening, first we perform a discrete Fourier transform on a short Window. We get its Power spectrum then we take logarithm of the Power Spectrum. The reason for this is to separate the high frequency glottal pulse wave and the envelope of speech.
Now when we once again take the Discrete cosine transform after converting the frequencies to the mel frequencies we get the power corresponding to each mel bins. And as we know the initial frequencies are of interest we take the initial cepstral coefficients thus producing a spectrogram of Mel Frequency Cepstral Coefficients.
After this we get something like (where y-axis is the MFCC x-axis is the number of frames and Pixel value is the Power corresponding)
For deeper insight into audio processing , watch this awesome YouTube playlist — Audio Signal Processing for Machine Learning
Classification of Data
For the binary classification problem, we used a CNN based network with Adam optimizer and binary cross-entropy loss function.
Hyperparameter Tuning
For tuning the hyperparameters, we used TensorBoard and plotted epoch loss and epoch accuracy for various learning rates and different activation functions and found best results for ReLU activation with lr= 0.003.
Results
For the trained model, we achieved f1 score of 94.95% and confusion matrix for the testing data as array([[123, 4], [ 8, 113]]). Hence we can conclude that overall the model was fairly accurate in distinguishing the cough sound from other noises.
Code
All the code for the project can be found on Github