From raw audio to screening result in under 3 seconds
SickNote is a binary cough classifier that distinguishes healthy coughs from abnormal ones. It converts audio recordings into mel spectrograms — visual representations of sound frequencies — and feeds them through an ensemble of five convolutional neural networks. Each prediction includes a Grad-CAM heatmap showing which spectrogram regions drove the result.
The COUGHVID dataset contains ~34,400 crowdsourced cough recordings. Of these, 2,841 were reviewed by four expert physicians who provided diagnostic labels.
34,400
Total recordings
2,841
Expert-labeled
~2,300
After filtering
Filtering criteria
Class distribution
A small convolutional neural network trained from scratch. Five models are trained with different random seeds and their predictions are averaged for robustness. Each result includes a Grad-CAM heatmap showing which spectrogram regions influenced the classification.
Input: (1, 64, T) mel spectrogram
↓
Conv2d(8) + BatchNorm + ReLU + MaxPool
Conv2d(16) + BatchNorm + ReLU + MaxPool
Conv2d(32) + BatchNorm + ReLU + MaxPool
↓
Flatten
Linear(128) + ReLU + Dropout(0.5)
Linear(1) → logit
Loss function
BCEWithLogitsLoss + pos_weight
Optimizer
Adam (lr=3e-4, wd=1e-4)
Ensemble
5 models, different seeds, averaged probabilities
Explainability
Grad-CAM heatmaps on last conv block
| Metric | Actual | Target |
|---|---|---|
| AUC-ROC | 0.73 | > 0.82 |
| Accuracy | 76% | > 78% |
| Sensitivity | 0.68 | > 0.75 |
| Specificity | 0.77 | > 0.70 |
Classification threshold optimized from 0.50 to 0.52 using Youden's J statistic to balance sensitivity and specificity. Metrics reported on the held-out test set (15% of data, never seen during training).
The engineering choices behind SickNote
Architecture
Multi-class classification (COVID, URTI, LRTI, obstructive disease) dropped accuracy sharply because pathological coughs occupy overlapping feature space in the spectrogram domain. Binary classification — healthy vs. abnormal — is medically honest and performs significantly better with limited data. The output is "something sounds off," not a specific diagnosis.
Architecture
With only ~2,300 expert-labeled samples, a small 3-layer CNN with aggressive dropout (0.5) and reduced channel widths [8, 16, 32] was the right fit. Larger architectures overfit before learning useful features. We tuned every hyperparameter against real dataset statistics from explore.py before writing a single training loop.
Training
We attempted to replace our CNN with a pretrained ResNet18 backbone to leverage features learned from millions of ImageNet images. The plan: freeze pretrained layers, train only a classifier head on our cough spectrograms. With ~1,600 training samples, the model's higher capacity worked against us — it overfit faster than our from-scratch CNN. The pretrained features were too general for the narrow spectrogram patterns that distinguish healthy from abnormal coughs. We reverted to the original architecture.
Data
Standard audio augmentation (noise injection, time stretching) on a dataset this small amplified noise rather than adding signal. The augmented samples were too similar to the originals to provide new information, and synthetic noise patterns confused the model. We removed augmentation entirely and kept the raw expert-labeled data.
Training
Instead of relying on a single model's opinion, we train five models with different random seeds and average their predictions. This smooths out the variance inherent to our small dataset. We then optimized the classification threshold from 0.50 to 0.52 using Youden's J statistic, which maximizes the balance between sensitivity (catching abnormal coughs) and specificity (correctly identifying healthy ones).