Lab 1. Cats and Dogs binary classification

The Cats vs. Dogs dataset

Let us look at the files in the audio data set folders we previously downloaded.

The Data

The Cats-Vs-Dogs dataset is a simple audio at a set that has two classes, sounds of cats meowing and of dogs barking. The data was created to have an easy binary example of how classification can work. The data set is a bout 1.5Gb large and has samples of varying length. The samples are all wav files with at least 16 bit depth and at least 44.1 kHz sampling rate. Most will have a sampling rate of 48 kHz and a bit depth of 24.

While high bit depth is not as important (noise is a good thing for neural nets), the sampling rate is very important. The higher the niquist frequency, the more information can be displayed in the Spectrogram.

Collection

This data set was collected from freesound.org. Most of the sounds are under public license and have been recorded on very different audio gear, in different contexts etc.

Data Structure

We define classes of data by sorting different classes into different folders; data that should be classified in the same class are put in the same file folder. Any class should be placed into the AudioData folder.

For Cats vs. Dogs, the data folders look like this:

AudioData/
└── Cats
└── Dogs

Take a little time to look at the number of files, and try playing at some of the files. How many files are there? How big are they? Are they tightly clipped around barks and meows?

Generating Spectrums

Neural networks are currently very good at are detecting images. The Convolutional Neural Networks (CNNs) that are being used for image recognition have become almost ubiquitous and are therefore very easy to play with.

Ironically, then, one of the best ways to perform classification of audio data convert our audio into spectrograms, a visual representation of audio snippets. After we generate these audio-images, we can retrain any standard image classifier to work with our images and help us classify our audio data.

So! the first step is to compute images from the audio data.

We prepared the notebook GeneratingSpectrums in the 01_Spectrum Generation folder for this task.

Here is an example of how to run the notebook from the Mac Terminal command-line:

$ ls
00_Setup		03_Running
01_Spectrum Generation	README.md
02_Training		doc
$ cd 01_Spectrum\ Generation/
$ ls
GeneratingSpectrums.ipynb	SpectrumsSettingsTool.ipynb
GeneratingSpectrums_edit.ipynb	Standard.SpecVar
$ jupyter notebook GeneratingSpectrums.ipynb

Other ways of launching the GeneratingSpectrums notebook will be demonstrated in lecture.

Run the first two cells, which load libraries and define folder paths, by making sure that the first one is selected (remember the green or blue line on the left side mean edit or command) and press shift+return twice.

It might take a while but you should see the number in the top left corner, next to the cell change from empty to a star to a number. Something like this

In []: # This code block has not been executed.

In [*]: # This code is being executed but has not finished. 

In [1]: #This code block is finished and was the first one to finish.

The third cell will start the processing and convert the audio data into spectrograms.

To check if it's done, look at the In [*]: box in the top left corner of the cell. If it turns into a number, it is finished. The last step will take several minutes, and may generate warning messages even if everything is working.

By the end of this step, we have images made of all of the sounds in the GeneratedData folder.

Training the Neural Network

We will be using at the ResNet CNN, which was pretrained on more than a million images from the ImageNet database. Basically, this is a generic CNN that has been trained to "see." We will retrain the last layer of this network to find differences in our Spectrograms.

Please open the notebook Training the Network in the folder 02_Training. (See the previous set for an example of how to run a Jupiter Notebook.)

Start the Notebook running

Run the first two cells in that notebook. You should now see a list selection of both cats and dogs. Check both boxes and run the rest of the cells in that notebook. This will load in the data and train the network. The complete execution will probably take a while. (It can take up to 30 minutes depending on your computers speed and number of cores.)

Look at the code in the Notebook

While the code is running, we have take some time to go through the notebook and understand what is going on.

Here are a few interesting sections to look at:

Parameters

The Preview sample training images cells takes an entire training batch of images and displays them with their respective labels. Have a look and verify that indeed you see spectrograph images that look similar to what you saw earlier.

The Training the network cell is the main cell which will take the longest to compute (5-10 minutes) and actually trains the network.

The Post training analytics cells then assess the performance of the newly trained neural network.

In the '# Print predicted and actual labels for Spectragrams' cell, we load a batch of images and display both the ground truth and the predicted value. This is super valuable to look at as you can sometimes issue especially when the network consistently classifies something wrong.

The '# Network analytics' cell runs the complete test dataset through the algorithm, produces a confusion matrix and calculates accuracy. When this notebook is done, you should see values like this.

ExampleConfusionMatrix

Running the Neural Net

This is also known as inference.

In this step, we basically put the neural net to a real-world test. We let the network infer from the incoming audio which class its thinks is the best fit.

This involves creating an audio buffer that we continually update with information from the microphone, then creating an image and running it through the neural net. This happens as fast as possible over and over again.

The underlying Python code is a bit more complex; Lab 3 will address the details on that. For now, we have a simple Jupyter Notebook that just lets us test our network. So, please open the notebook Inference in the folder 03_Running.

When you run the first two cells you will see that the program will run for 30 seconds and display the most relevant class below the cell.

RTA.RunProgram(targetLength=30,k=1)

You can run the second cell over and over again. Better, however, is to change the targetLength= variable to something higher e.g. targetLength=60 for a one minute run(if you enter 0 the program will never stop). If you want to see more the second or third most likely class prediction increase the k value. In our case with Cats-Vs-Dogs, the highest meaningful value is 2, as we only have 2 classes.

Explore the data

It helps to get an intuitive feeling for how the sounds get translated to spectrograms. Use Spectrums Settings Exploration.inpy from the 01_Spectrum Generation folder to explore the mapping between the sound and the image of the spectrograms.

Develop your own Binary classification task

Using online datasets such as Kaggle, AudioSet, Freesound and Mivia, (and others!) find sounds to create your own audio classification task. Target having ~125 samples of each of two categories.

Converting CSV labelled dataset to folders

Sharing/Exporting your model

Making a Audio Dataset from a Video Dataset

Modify the Cats vs. Dog python notebook inputs to try out your own classification! What works better? What doesn't work as well? Post your learnings on the workshop Discord!

Push your changes to your Github repo

To push changes onto Github, you will need to git add, git commit, and then git push your changes. See [here] (https://dont-be-afraid-to-commit.readthedocs.io/en/latest/git/commandlinegit.html) for more details on how to do this.

Bonus task: Check out the computer vision version of Cats vs. Dogs!

This binary classification of audio tasks is an homage to the canonical tutorial of the basic binary classification tasks: Cats vs. Dogs (CV)

Home

Provide feedback

Saved searches

Use saved searches to filter your results more quickly