
I built an Emotion Guesser from audio using a CNN and Mel-spectrograms (67% Accuracy on RAVDESS)
I built a live emotion guesser that captures microphone input, processes the audio into Mel-spectrograms, and feeds it through a deep CNN to detect 8 different human emotions (Happy, Sad, Angry, Fearful, Disgust, Surprised, Calm, and Neutral).
Initially, I was only hitting around 29% accuracy on the RAVDESS dataset using a standard 3-block CNN with a Flatten layer. The model was massively overfitting because flattening the spectrograms created way too many parameters.
To fix this, I made a few architectural upgrades:
- Added a 4th Convolutional block (256 filters) to capture deeper frequency abstractions.
- Swapped the Flatten layer out for Global Average Pooling 2D. This drastically reduced the parameter count and acted as a structural regularizer against overfitting.
- Added Early Stopping and ReduceLROnPlateau callbacks to control the learning rate dynamically.
These changes skyrocketed the model's accuracy to over 67%.
I've also written a quick script that uses the sounddevice library to record live audio from your microphone and run predictions through the model in real time.
The code is fully open source. If you are interested in audio processing or CNNs, feel free to check out the repository, and let me know if you have any suggestions for improving the accuracy even further!
GitHub Link: https://github.com/bugbutcherr/emotion-recognizer-model