-
Notifications
You must be signed in to change notification settings - Fork 48
Lab 1. Cats and Dogs binary classification
Let us look at the files in the audio data set folders we previously downloaded.
The Cats-Vs-Dogs dataset is a simple audio at a set that has two classes, sounds of cats meowing and of dogs barking. The data was created to have an easy binary example of how classification can work. The data set is a bout 1.5Gb large and has samples of varying length. The samples are all wav files with at least 16 bit depth and at least 44.1 kHz sampling rate. Most will have a sampling rate of 48 kHz and a bit depth of 24.
While high bit depth is not as important (noise is a good thing for neural nets), the sampling rate is very important. The higher the niquist frequency, the more information can be displayed in the Spectrogram.
This data set was collected from freesound.org. Most of the sounds are under public license and have been recorded on very different audio gear, in different contexts etc.
We define classes of data by sorting different classes into different folders; data that should be classified in the same class are put in the same file folder. Any class should be placed into the AudioData
folder.
For Cats vs. Dogs, the data folders look like this:
AudioData/
└── Cats
└── Dogs
Take a little time to look at the number of files, and try playing at some of the files. How many files are there? How big are they? Are they tightly clipped around barks and meows?
Neural networks are currently very good at are detecting images. The Convolutional Neural Networks (CNNs) that are being used for image recognition have become almost ubiquitous and are therefore very easy to play with.
Ironically, then, one of the best ways to perform classification of audio data convert our audio into spectrograms, a visual representation of audio snippets. After we generate these audio-images, we can retrain any standard image classifier to work with our images and help us classify our audio data.
So! the first step is to compute images from the audio data.
We prepared the notebook GeneratingSpectrums
in the 01_Spectrum Generation
folder for this task.
Here is an example of how to run the notebook from the Mac Terminal command-line:
$ ls
00_Setup 03_Running
01_Spectrum Generation README.md
02_Training doc
$ cd 01_Spectrum\ Generation/
$ ls
GeneratingSpectrums.ipynb SpectrumsSettingsTool.ipynb
GeneratingSpectrums_edit.ipynb Standard.SpecVar
$ jupyter notebook GeneratingSpectrums.ipynb
Other ways of launching the GeneratingSpectrums notebook will be demonstrated in lecture.
Run the first two cells, which load libraries and define folder paths, by making sure that the first one is selected (remember the green or blue line on the left side mean edit or command) and press shift+return
twice.
It might take a while but you should see the number in the top left corner, next to the cell change from empty to a star to a number. Something like this
In []: # This code block has not been executed.
In [*]: # This code is being executed but has not finished.
In [1]: #This code block is finished and was the first one to finish.
The third cell will start the processing and convert the audio data into spectrograms.
To check if it's done, look at the In [*]:
box in the top left corner of the cell. If it turns into a number, it is finished. The last step will take several minutes, and may generate warning messages even if everything is working.
By the end of this step, we have images made of all of the sounds in the GeneratedData
folder.
We will be using at the ResNet CNN, which was pretrained on more than a million images from the ImageNet database. Basically, this is a generic CNN that has been trained to "see." We will retrain the last layer of this network to find differences in our Spectrograms.
Please open the notebook Training the Network
in the folder 02_Training
. (See the previous set for an example of how to run a Jupiter Notebook.)
Run the first two cells in that notebook. You should now see a list selection of both cats and dogs. Check both boxes and run the rest of the cells in that notebook. This will load in the data and train the network. The complete execution will probably take a while. (It can take up to 30 minutes depending on your computers speed and number of cores.)
While the code is running, we have take some time to go through the notebook and understand what is going on.
Here are a few interesting sections to look at:
Parameters
The Preview sample training images cells takes an entire training batch of images and displays them with their respective labels. Have a look and verify that indeed you see spectrograph images that look similar to what you saw earlier.
The Training the network cell is the main cell which will take the longest to compute (5-10 minutes) and actually trains the network.
The Post training analytics cells then assess the performance of the newly trained neural network.
In the '# Print predicted and actual labels for Spectragrams' cell, we load a batch of images and display both the ground truth and the predicted value. This is super valuable to look at as you can sometimes issue especially when the network consistently classifies something wrong.
The '# Network analytics' cell runs the complete test dataset through the algorithm, produces a confusion matrix and calculates accuracy. When this notebook is done, you should see values like this.
This is also known as inference.
In this step, we basically put the neural net to a real-world test. We let the network infer from the incoming audio which class its thinks is the best fit.
This involves creating an audio buffer that we continually update with information from the microphone, then creating an image and running it through the neural net. This happens as fast as possible over and over again.
The underlying Python code is a bit more complex; Lab 3 will address the details on that. For now, we have a simple Jupyter Notebook that just lets us test our network. So, please open the notebook Inference
in the folder 03_Running
.
When you run the first two cells you will see that the program will run for 30 seconds and display the most relevant class below the cell.
RTA.RunProgram(targetLength=30,k=1)
You can run the second cell over and over again. Better, however, is to change the targetLength=
variable to something higher e.g. targetLength=60
for a one minute run(if you enter 0 the program will never stop). If you want to see more the second or third most likely class prediction increase the k value. In our case with Cats-Vs-Dogs, the highest meaningful value is 2, as we only have 2 classes.
It helps to get an intuitive feeling for how the sounds get translated to spectrograms. Use Spectrums Settings Exploration.inpy
from the 01_Spectrum Generation
folder to explore the mapping between the sound and the image of the spectrograms.
Using online datasets such as Kaggle, AudioSet, Freesound and Mivia, (and others!) find sounds to create your own audio classification task. Target having ~125 samples of each of two categories.
Converting CSV labelled dataset to folders
Making a Audio Dataset from a Video Dataset
Modify the Cats vs. Dog python notebook inputs to try out your own classification! What works better? What doesn't work as well? Post your learnings on the workshop Discord!
To push changes onto Github, you will need to git add
, git commit
, and then git push
your changes. See [here] (https://dont-be-afraid-to-commit.readthedocs.io/en/latest/git/commandlinegit.html) for more details on how to do this.
This binary classification of audio tasks is an homage to the canonical tutorial of the basic binary classification tasks: Cats vs. Dogs (CV)