This project is an advanced, multi-modal machine learning pipeline designed to detect Alzheimer's Disease (AD) from spontaneous speech. By combining deep acoustic representations, contextual linguistic embeddings, and handcrafted prosodic and paralinguistic features, the system holistically analyzes both what a patient says and how they say it. It leverages a novel co-attention mechanism to align verbal content with vocal cues, making it highly effective at identifying the subtle cognitive and linguistic markers of dementia in clinical audio samples.
Combines raw speech audio, text transcripts, prosodic metrics (pitch, jitter), and paralinguistic traits (speech rate, pauses).
Uses 3 stacked Co-Attention blocks to dynamically align acoustic cues with text.
Utilizes frozen, large pre-trained encoders (~144M parameters) paired with a lightweight, trainable fusion layer (~6M parameters) to prevent overfitting.
Automatically extracts and normalizes handcrafted voice rhythm and linguistic fluency metrics.
Includes 10-fold cross-validation, mixed precision (FP16) training, and advanced learning rate scheduling.
Ingests raw audio waveforms (16kHz), CHAT-formatted transcripts, 6 prosodic features, and 4 paralinguistic features.
Utilizes a frozen wav2vec2-base-960h model for acoustic embeddings and bert-base-uncased for contextual word embeddings.
Instead of simple concatenation, the model uses stacked Multi-Head Cross-Modal Attention, allowing the audio stream to attend to the text stream and vice versa.
Summarizes outputs via acoustic pooling and linguistic pooling. These 768-dim vectors are concatenated with prosodic/paralinguistic features (totaling 1546 dimensions) and compressed via a dense projection layer.
A final linear layer with softmax outputs binary logits representing the probability of Control (0) vs. Dementia (1).
Develop a lightweight streaming API to process patient audio and provide live diagnostic probabilities during clinical assessments.
Expand the binary classifier to detect and categorize varying stages of cognitive decline.
Build a visualization tool for clinicians that highlights the exact transcript words and audio segments that contributed most to the prediction.
Interested in this project?