How SickNote works

From raw audio to screening result in under 3 seconds

Overview

SickNote is a binary cough classifier that distinguishes healthy coughs from abnormal ones. It converts audio recordings into mel spectrograms — visual representations of sound frequencies — and feeds them through an ensemble of five convolutional neural networks. Each prediction includes a Grad-CAM heatmap showing which spectrogram regions drove the result.

Dataset

The COUGHVID dataset contains ~34,400 crowdsourced cough recordings. Of these, 2,841 were reviewed by four expert physicians who provided diagnostic labels.

34,400

Total recordings

2,841

Expert-labeled

~2,300

After filtering

Filtering criteria

Cough detection confidence > 0.8
At least one expert diagnosis present
Quality rated acceptable by majority of experts

Class distribution

78% abnormal

22%

Processing pipeline

Raw Audio

→

Filter

→

Label

→

Spectrogram

→

Normalize

→

Split

Model architecture

A small convolutional neural network trained from scratch. Five models are trained with different random seeds and their predictions are averaged for robustness. Each result includes a Grad-CAM heatmap showing which spectrogram regions influenced the classification.

Input: (1, 64, T) mel spectrogram

↓

Conv2d(8) + BatchNorm + ReLU + MaxPool

Conv2d(16) + BatchNorm + ReLU + MaxPool

Conv2d(32) + BatchNorm + ReLU + MaxPool

↓

Flatten

Linear(128) + ReLU + Dropout(0.5)

Linear(1) → logit

Loss function

BCEWithLogitsLoss + pos_weight

Optimizer

Adam (lr=3e-4, wd=1e-4)

Ensemble

5 models, different seeds, averaged probabilities

Explainability

Grad-CAM heatmaps on last conv block

Evaluation metrics

Metric	Actual	Target
AUC-ROC	0.73	> 0.82
Accuracy	76%	> 78%
Sensitivity	0.68	> 0.75
Specificity	0.77	> 0.70

Classification threshold optimized from 0.50 to 0.52 using Youden's J statistic to balance sensitivity and specificity. Metrics reported on the held-out test set (15% of data, never seen during training).

Known limitations

All COUGHVID recordings are voluntary intentional coughs
~2,300 expert-labeled samples, split into ~1,600 for training — small by production ML standards
Class imbalance: ~78% abnormal / ~22% healthy after expert filtering
No external validation dataset — generalization to new devices unknown
Binary only — does not distinguish COVID vs URTI vs LRTI vs other
COUGHVID was collected during the COVID pandemic — label distribution reflects that context
Screening tool only — not a diagnostic

Design decisions

The engineering choices behind SickNote

Architecture

Why binary classification

Multi-class classification (COVID, URTI, LRTI, obstructive disease) dropped accuracy sharply because pathological coughs occupy overlapping feature space in the spectrogram domain. Binary classification — healthy vs. abnormal — is medically honest and performs significantly better with limited data. The output is "something sounds off," not a specific diagnosis.

Architecture

Why we built a CNN from scratch

With only ~2,300 expert-labeled samples, a small 3-layer CNN with aggressive dropout (0.5) and reduced channel widths [8, 16, 32] was the right fit. Larger architectures overfit before learning useful features. We tuned every hyperparameter against real dataset statistics from explore.py before writing a single training loop.

Training

Why we tried transfer learning and reverted

We attempted to replace our CNN with a pretrained ResNet18 backbone to leverage features learned from millions of ImageNet images. The plan: freeze pretrained layers, train only a classifier head on our cough spectrograms. With ~1,600 training samples, the model's higher capacity worked against us — it overfit faster than our from-scratch CNN. The pretrained features were too general for the narrow spectrogram patterns that distinguish healthy from abnormal coughs. We reverted to the original architecture.

Data

Why data augmentation didn't help

Standard audio augmentation (noise injection, time stretching) on a dataset this small amplified noise rather than adding signal. The augmented samples were too similar to the originals to provide new information, and synthetic noise patterns confused the model. We removed augmentation entirely and kept the raw expert-labeled data.

Training

Why ensemble + threshold tuning

Instead of relying on a single model's opinion, we train five models with different random seeds and average their predictions. This smooths out the variance inherent to our small dataset. We then optimized the classification threshold from 0.50 to 0.52 using Youden's J statistic, which maximizes the balance between sensitivity (catching abnormal coughs) and specificity (correctly identifying healthy ones).