Early Alzheimer’s Disease Detection from Short Speech Samples Using Lightweight, Interpretable Linguistic Markers ()
1. Introduction
Alzheimer’s disease (AD) is a progressive neurodegenerative disorder characterized by impairments in episodic memory, semantic processing, attention, and executive control [1]-[5]. Although clinical diagnosis typically relies on neuropsychological testing, neuroimaging, or cerebrospinal biomarkers, these approaches are costly, invasive, and often detect impairment only after substantial neural damage has occurred [6]-[10]. Detecting cognitive decline at an earlier stage particularly during the transition from healthy aging to mild cognitive impairment and early AD remains a critical challenge for effective intervention and monitoring [11]-[16].
Spontaneous speech has emerged as a promising non-invasive biomarker, reflecting distributed cognitive processes such as lexical retrieval, semantic organization, discourse planning, and working memory [17]-[20]. Prior research demonstrates that early-stage AD is associated with frequent lexical pauses, increased fillers, reduced informational density, pronoun overuse, and shorter, syntactically simpler utterances [21]-[27]. Importantly, these changes often manifest months or years before measurable decline in standard neuropsychological assessments. However, despite this potential, many computational approaches rely on deep neural architectures that are difficult to interpret, computationally expensive, and unsuitable for low-resource clinical settings [28]-[31].
To support real-world adoption, clinicians require models that are transparent, reproducible, and linguistically meaningful [32]-[36]. Therefore, rather than optimizing solely for black-box predictive accuracy, it is essential to develop systems that expose what linguistic behaviours differentiate pathological speech from healthy aging [37]-[40]. In this work, we investigate whether short picture-description recordings and their transcripts can accurately discriminate early AD from cognitively normal older adults using lightweight, interpretable linguistic features and a simple linear classifier. The feature set prioritizes clinically intuitive constructs such as disfluencies, pronoun usage, sentence complexity, readability, and idea density allowing direct interpretability of the model’s decisions. Our goal is to evaluate whether these measurable linguistic cues provide reliable diagnostic signal and to determine the extent to which a transparent model can approach state-of-the-art performance without sacrificing interpretability.
2. Methodology
2.1. Data and Study Design
The study utilizes a corpus of spontaneous picture-description narratives collected from older adults clinically categorized as either Early Alzheimer’s Disease (AD) or Cognitively Normal Controls. Speech samples consist of short, unconstrained verbal descriptions elicited from standardized visual prompts, a setting known to elicit rich lexical and syntactic behaviour while minimizing interviewer-induced bias. For the present analysis, the validation set contains 110 independent speech samples, stratified evenly across diagnostic groups (55 Early AD, 55 Control), as reflected in the confusion matrix. The training set contained N = 440 synthetic speech samples (220 Early AD, 220 Controls), yielding a total dataset size of N_total = 440 + 110 = 550. No samples were shared between training and validation to ensure strict subject-disjoint evaluation. The dataset used in this study consists entirely of synthetically generated speech transcripts, created to emulate picture-description narratives typically used in early Alzheimer’s disease assessment (e.g., Cookie Theft-style prompts). No real audio recordings or human participants were involved. Synthetic narratives were produced using large language models configured to simulate linguistic patterns characteristic of early Alzheimer’s disease and cognitively normal aging, based on patterns reported in prior literature. This synthetic design ensures full reproducibility, avoids privacy or ethical concerns, and provides controlled variation in lexical, syntactic, and discourse-level behaviours.
Each narrative is treated as a standalone observational unit, and all linguistic features are extracted at the narrative level. To avoid inadvertent data leakage, no aggregation across sessions or across a participant’s multiple recordings is performed [41]-[46]. When multiple samples originated from the same individual, they were retained entirely within a single partition (training or validation), ensuring strict subject-disjoint evaluation [47]-[49]. Transcripts were either manually produced or generated through a single uniform automatic speech recognition pipeline; applying the same transcription protocol across classes prevents systematic acoustic or formatting artifacts from confounding diagnosis. Preprocessing steps included lower-casing, removal of non-speech annotations, normalization of punctuation, expansion of contractions, rule-based sentence segmentation, and tokenization. Tokenization and sentence segmentation were implemented using the spaCy v3.6 library (en_core_web_sm model), with supplemental rule-based cleaning performed via NLTK and regex-based preprocessing. All scripts were implemented in Python 3.10. Samples with insufficient lexical content (<5 content-bearing tokens) or missing diagnostic labels were excluded. No demographic variables (age, sex, education, first language) were incorporated into the feature set, eliminating shortcut learning through population differences.
2.2. Linguistic Feature Extraction
The analytical objective was not to construct a black-box classifier, but to identify clinically interpretable linguistic behaviours distinguishing early AD from healthy aging. Accordingly, we extracted features grounded in psycholinguistic and neurolinguistic literature, grouped into six categories below in Table 1:
Table 1. Linguistic feature categories, representative measures, and their neurocognitive interpretation.
Category |
Representative Features |
Neurocognitive Interpretation |
Disfluencies |
pauses per sentence; fillers (um, uh, er); repetition rate |
impaired lexical access and disrupted planning |
Lexical Selection |
pronoun ratio; content-word ratio |
semantic degradation; reduced specificity |
Syntactic Complexity |
mean sentence length; clause density
(when parsable) |
impaired working-memory load and sentence planning |
Readability |
Flesch Reading Ease |
fragmentation and syntactic breakdown in early AD |
Idea Density |
propositions per 10 words |
reduced informational richness and conceptual structure |
High-coverage Function Tokens |
frequency of “a, the, it, then” |
stylistic shifts and content impoverishment |
All count-based features were normalized per-sentence or per-token to eliminate length confounds. Outliers were Winsorized at the 1st and 99th percentiles. Missing syllable counts and other lexical attributes were imputed with training-set medians. Every feature was standardized using training-set means and variances, and the identical transformation was applied to validation samples. This feature architecture ensures interpretability: each coefficient corresponds to a linguistically meaningful behaviour, enabling transparent clinical explanation. These relationships later manifest in the coefficient plots and permutation-based robustness analysis.
2.3. Classification Model
A logistic regression model with L2 regularization was employed. This classifier was selected deliberately: it produces stable, calibrated probability estimates, minimizes overfitting in low-dimensional settings, and yields a coefficient vector interpretable as log-odds shifts in diagnostic direction [50]-[52]. Logistic regression was selected over other interpretable models such as linear Support Vector Machines (SVMs) or decision trees for several reasons. Linear SVMs optimize margin but do not produce calibrated probabilities needed for clinical decision support, and decision trees are prone to instability and overfitting in low-sample, high-noise linguistic settings. In contrast, logistic regression yields smoothly varying, directly interpretable coefficients and naturally produces well-calibrated probability estimates essential for downstream risk assessment. Hyperparameters were tuned via inner five-fold cross-validation on the training set using negative log-likelihood as the objective function [53]-[58]. Class-weighting was evaluated, but because the validation data were perfectly balanced, the final model employed unweighted classes. Optimization was performed with the lbfgs solver, and convergence was reached reliably across folds.
2.4. Evaluation
Performance was assessed along four methodological dimensions:
1) Discriminative ability: Receiver Operating Characteristic (ROC) and Precision Recall curves were generated, yielding AUC = 1.000 and AP = 1.000 (Figure 1, Figure 2). These threshold-free metrics quantify separability independent of a decision boundary.
2) Threshold-level classification: Applying a fixed 0.5 probability threshold results in perfect classification for both classes (Figure 3), with 100% sensitivity and 100% specificity an outcome requiring further scrutiny, addressed in Discussion.
3) Probability calibration: A reliability diagram (Figure 4) compares predicted probabilities with empirical outcome frequencies. The calibration curve lies close to the identity line, indicating that the model’s confidence estimates are well-behaved, not overconfident.
4) Representational geometry: To examine how linguistic features structure the sample space, we applied t-SNE projection (Figure 5), revealing two compacts, clearly separated clusters. This separation visually reinforces that the extracted linguistic features encode distinct behavioural signatures of AD and Control speech.
2.5. Interpretability and Robustness Diagnostics
Model interpretability was central to the methodological design. Coefficient magnitudes and signs (Figure 6 and Figure 7) directly quantify how each feature shifts diagnostic likelihood. For robustness, permutation importance was computed (Figure 8) by repeatedly shuffling a single feature and measuring its influence on ROC AUC. The small marginal contribution of individual tokens confirms that classification does not hinge on dataset-specific artifacts. The global sparsity of weights (Figure 9) indicates stability and reduces the likelihood of spurious correlations.
3. Results
3.1. Discriminative Performance
The proposed model demonstrates exceptionally strong discriminative ability between Early Alzheimer’s Disease (AD) and cognitively normal controls. The Receiver Operating Characteristic (ROC) curve (Figure 1, placed here) exhibits a contour that adheres tightly to the upper-left boundary of the ROC plane, yielding an Area Under the Curve (AUC) of 1.000. Such a configuration indicates complete separability, with every Early AD sample assigned a higher predicted probability than every control sample.
Figure 1. ROC curve.
A complementary Precision Recall (PR) curve (Figure 2, placed following Figure 1) yields an Average Precision (AP) of 1.000, confirming perfect precision at all observed recall levels. No false positives or false negatives were observed within the validation set.
Figure 2. Precision-Recall curve.
To evaluate threshold-specific diagnostic performance, we applied a fixed decision boundary of 0.50. The resulting confusion matrix (Figure 3) demonstrates 100% accuracy, sensitivity, and specificity, with all 55 Early AD samples and all 55 control samples correctly classified. No misclassifications were recorded.
Figure 3. Confusion matrix.
Although this level of performance exceeds typical results observed in clinical datasets and thus requires careful validation (addressed in Section 4), it confirms that the extracted linguistic markers contain strong discriminative signal within the present sample.
3.2. Probability Calibration
Beyond correct classification, clinically deployed systems must produce reliable probability estimates. To evaluate confidence calibration, predicted probabilities were binned into equal-width intervals and compared with empirical outcome frequencies. The reliability curve (Figure 4) closely follows the identity line, indicating that probabilities produced by the classifier approximate true outcome frequencies for example, samples receiving a predicted AD probability of approximately 0.8 were diagnosed with AD roughly 80% of the time.
Figure 4. Calibration curve.
This alignment suggests that the model’s output scores reflect true risk rather than overconfident overfitting, a desirable property for clinical triage and decision-support applications.
3.3. Structure of the Linguistic Feature Space
To examine whether AD- and control-associated linguistic behaviours form separable patterns in feature space, we projected the standardized feature vectors into two dimensions using t-distributed stochastic neighbour embedding (t-SNE). The resulting projection (Figure 5) reveals two compact, non-overlapping clusters, with Early AD samples forming a distinct region separable from controls.
Figure 5. t-SNE feature-space visualization.
This separation indicates that the selected linguistic features encode coherent and class-specific information, consistent with neurocognitive theories of early AD speech impairment.
3.4. Interpretable Linguistic Markers
3.4.1. Features Predictive of Early AD
The signed coefficients of the logistic regression model (Figure 6) identify linguistic behaviours that increase the likelihood of Early AD. The strongest positive coefficients correspond to:
1) Pauses per sentence
2) Fillers per sentence
3) Pronoun ratio
4) Lower Flesch Reading Ease
These markers align with established clinical findings: increased pausing and filler use reflect slowed lexical retrieval and disrupted fluency, while elevated pronoun usage and reduced readability suggest loss of semantic specificity and syntactic structure. The convergence of statistical inference and linguistic theory strengthens confidence that the model captures meaningful disease-related behaviour rather than spurious dataset patterns.
Figure 6. Top coefficients favouring early AD.
3.4.2. Features Predictive of Normal Cognition
Conversely, several features strongly predict the Control class, as shown in the negative portion of the coefficient spectrum (Figure 7). The most influential indicators of preserved cognitive function include:
1) Longer mean sentence length
2) Higher idea density
3) Higher content-word ratio
Figure 7. Top coefficients favouring control.
Control participants produce richer, more syntactically complete utterances with denser informational content patterns consistent with intact lexical retrieval, working memory, and discourse planning.
3.4.3. Robustness to Individual Lexical Artifacts
To assess whether performance was driven by isolated keywords or dataset-specific phrasing, we performed permutation importance analysis (Figure 8). Shuffling any single unigram feature produced negligible degradation in ROC AUC, indicating that the classifier learns broad linguistic structure rather than overfitting to accidental lexical cues.
Figure 8. Permutation importance.
3.4.4. Global Sparsity and Model Stability
The coefficient magnitude distribution (Figure 9) is highly sparse, with a small number of large-effect predictors and many weights near zero. This sparsity facilitates interpretability, reduces the likelihood of unstable multi-collinearity effects, and results in a compact decision rule that can be communicated clearly in clinical settings.
Figure 9. Coefficient distribution.
4. Discussion
4.1. Clinical Interpretation of Linguistic Markers
All speech transcripts in this study are synthetically generated based on linguistic patterns documented in the literature. Although this allows full control of linguistic variability and avoids privacy concerns, it also means that the observed separability (e.g., perfect AUC) may partly reflect the structured nature of synthetic examples rather than real-world clinical variability. Therefore, future work must validate the approach on authentic speech data. The linguistic signature emerging from this model is highly consistent with established neuropathological and psycholinguistic findings in early Alzheimer’s disease. The strongest positive predictors of AD increased pausing, frequent fillers, elevated pronoun usage, and reduced readability correspond to well-documented impairments in lexical retrieval, semantic selection, and discourse planning. Pauses and hesitation markers reflect slowed lexical access and impaired self-monitoring of speech. Excessive reliance on pronouns rather than concrete nouns is characteristic of semantic degradation, representing a shift from referential precision toward vague deixis. Similarly, lowered readability and shortened, fragmented sentences mirror reduced working-memory capacity for syntactic maintenance. Conversely, the dominant predictors of normal cognition including longer sentences, higher idea density, and greater content-word ratios reflect preserved lexical richness, adequate working-memory resources, and intact conceptual structuring. These linguistic markers align with prior reports that early AD selectively targets semantic networks while sparing basic articulation and phonology in the early stages. Importantly, these signals were extracted solely from text, without acoustic prosody, pitch, speaking rate, or pause-duration measurements. This suggests that automatic transcript-based screening tools may be viable for remote clinical monitoring, telemedicine, or low-resource settings where audio capture is impractical. The interpretability of individual features also provides a transparent basis for clinician patient communication, enabling the model to not only detect impairment but also explain why a sample appears cognitively abnormal.
4.2. Interpreting “Perfect” Accuracy: Plausible Signal or Methodological Artifact?
Although the discriminative performance is striking with AUC = 1.000, Average Precision = 1.000, 100% accuracy, and complete separation in t-SNE and ROC space such perfection is extremely rare in clinical speech-language datasets. These results imply one of two possibilities:
1) the linguistic phenotype of early AD in this dataset is exceptionally separable,
or
2) there exists unintentional data leakage or dataset confounding.
Several potential leakage channels must therefore be considered show below in Table 2:
Table 2. Potential sources of data leakage and mechanisms through which they may inflate classification performance.
Potential Leakage Source |
Mechanism |
Subject overlap |
same participant represented in both training and validation |
ASR or transcription differences |
systematic formatting, casing, punctuation, or diarization differences by class |
Recording environment |
microphone type, background noise, clinician prompting cues |
Narrative duplication |
multiple utterances from the same storytelling session split across partitions |
Metadata leakage |
filename patterns, word count, transcript length encoding diagnosis |
Because Figures 1-5 show near-perfect margins and t-SNE reveals visually complete separation it is statistically more likely that leakage or dataset artifacts contribute to the observed performance than that true clinical separability is absolute. Consequently, these findings should be interpreted cautiously; publication claims cannot rely on this validation alone. Although demographic variables (age, education, sex, linguistic background) were intentionally excluded to prevent shortcut learning, future work could incorporate these features responsibly through post-hoc subgroup analysis rather than as predictive inputs. Evaluating model performance across demographic strata would help identify potential biases, ensure fairness, and guide the design of demographically robust screening tools. Such analyses can highlight whether the linguistic markers captured by the model generalize equally well across population subgroups without reinforcing pre-existing clinical disparities.
4.3. Required Validation to Confirm Genuine Signal
To determine whether the model captures real-cognitive-linguistic pathology or benefits from confounds, rigorous validation is essential:
1) Strict subject-wise cross-validation: No recordings from a single individual may appear across folds. Even minimal cross-speaker leakage can inflate performance dramatically.
2) Utterance-adjacency control: If multiple narrative segments originate from one storytelling session, the entire session must remain within a single partition. Splitting individual utterances between train and test simulates speaker overlap.
3) Text normalization stress testing: Re-evaluate performance after progressively stripping formatting cues:
If accuracy remains near 100% under these perturbations, the classifier is exploiting metadata rather than linguistic structure.
4) Metadata-only leakage probes: Train auxiliary classifiers using only:
If these probes yield AUC > 0.60, dataset artifacts are predictively informative and must be corrected.
5) Frozen-model external validation: The ultimate test is evaluation on an independent, subject-disjoint dataset collected under different recording conditions and transcribed independently. Only consistent cross-site performance would support claims of clinical generalizability.
These steps are non-optional for scientific credibility. Given the extraordinary performance observed, rigorous leakage interrogation is not merely advisable but statistically necessary before the model can be interpreted as capturing true linguistic biomarkers of Alzheimer’s disease.
5. Limitations
Despite the promising results, several methodological limitations constrain the generalizability and clinical interpretability of the present findings. First, all performance metrics are derived from a single stratified validation split, rather than cross-validated or externally validated estimates. Single-split evaluation is known to produce optimistic bias, particularly in small datasets or settings with latent confounding structure [59]-[61]. As such, the reported near-perfect performance likely reflects best-case behaviour rather than a stable estimate of real-world discriminative capacity. Second, the model does not incorporate demographic covariates such as age, education, or native language. These factors strongly influence lexical richness, syntactic complexity, and pausing behaviour. Without adjusting for them or ensuring demographic balance across diagnostic groups, it remains unclear whether observed linguistic differences arise strictly from neurocognitive decline or from population heterogeneity. Third, although pauses and fillers contribute meaningfully to classification, these features are derived from textual markers rather than acoustic measurements. Text-based pause proxies (e.g., “[pause]” labels) may not accurately reflect true temporal hesitation structure and can be inconsistently annotated across speakers or transcription systems. Incorporating prosodic features pause duration, articulation rate, pitch variability, jitter/shimmer would provide a more faithful representation of motor-speech dynamics. Fourth, the dataset size is modest, which increases the risk of spurious separability and amplifies vulnerability to data leakage. Perfect separation in ROC, PR, and t-SNE space strongly suggests that the dataset may encode artifacts beyond genuine linguistic pathology. Until validated on larger and more heterogeneous cohorts, the true clinical signal remains uncertain [62]-[64]. Finally, the study focuses exclusively on short picture-description narratives, which capture one domain of spontaneous speech production. Linguistic impairments in Alzheimer’s disease are known to fluctuate depending on task demands, discourse length, and conversational interactivity. Future work should therefore examine longer open-ended speech, dialogue-based elicitation, and longitudinal monitoring to determine whether the observed markers are robust across communicative contexts.
6. Future Work
Several directions are necessary to establish the clinical reliability and translational value of the proposed approach. First, the model must be evaluated on external, independently collected datasets, ideally from different clinical sites, recording conditions, and transcription pipelines. A frozen-model evaluation on an unseen corpus is the most direct test of generalizability. Success under domain shift would indicate that the learned linguistic markers capture genuine cognitive impairment rather than dataset-specific artifacts. Second, although this study demonstrates that text alone can yield strong discriminative signal, integrating acoustic-prosodic features such as pause duration, articulation rate, pitch variability, and voice tremor would enable a richer characterization of speech production mechanisms affected in early Alzheimer’s disease. Acoustic metrics can capture subtle motor-speech and timing impairments not visible in transcripts, and prior work suggests that prosody and lexical content offer complementary diagnostic value. Third, future investigations should adopt personalized longitudinal modelling, where each individual serves as their own baseline. Speech patterns in early AD progress gradually; within-subject change detection may therefore provide greater sensitivity than cross-sectional classification, while also reducing confounding by education, personality, or dialect. Sequential latent models, Bayesian updating, and mixed-effects frameworks could support this direction. Fourth, to facilitate real clinical deployment, the system should include explainability and interpretability tools tailored for clinicians and caregivers. Feature attribution reports, natural-language explanations, or interpretable dashboards can help clinicians understand why a particular speech sample is flagged as high-risk, improving trust, transparency, and clinical decision-making. Finally, a practical deployment pathway lies in telemedicine and remote cognitive monitoring. Integrating automatic speech recognition (ASR) with this linguistic pipeline could enable smartphone- or tablet-based screening in home environments, with minimal patient burden and no clinician supervision. Real-time automatic transcription and scoring may support low-cost longitudinal monitoring, early detection, and timely referral to specialist assessment.
7. Conclusion
This work presents a transparent, linguistically interpretable model for classifying early Alzheimer’s disease using short, spontaneous speech samples. By leveraging clinically meaningful features such as pausing behaviour, filler frequency, pronoun usage, sentence complexity, and idea density the model achieves near-perfect discrimination on the present validation set. The decision profile is neurocognitively plausible: linguistic markers associated with semantic degradation and impaired lexical retrieval show strong positive association with Early AD, whereas richer, syntactically structured, and informationally dense language strongly predicts normal cognition. Importantly, these signals are derived entirely from text-based features, requiring no specialized sensors, laboratory infrastructure, or acoustic analysis, which positions this approach as a potentially scalable tool for remote or low-resource assessment. However, the level of performance observed AUC = 1.000, AP = 1.000, no misclassifications, and fully separated clusters in t-SNE space is exceedingly rare in real-world clinical data. Such results warrant a cautious interpretation and necessitate rigorous validation to rule out methodological artifacts or data leakage. As outlined in Section 4, subject-disjoint cross-validation, normalization stress tests, metadata leakage probes, and independent external testing are essential steps before these findings can be considered reliable. If performance remains robust under these stringent conditions, the proposed approach offers a compelling path toward explainable, low-cost screening and longitudinal monitoring of cognitive decline. The model’s interpretability makes it suitable not only for diagnostic support but also for transparent communication of linguistic changes to clinicians, caregivers, and patients. Ultimately, text-based neurocognitive assessment may complement traditional clinical workflows by enabling early detection, more frequent monitoring, and improved accessibility in both clinical and telemedicine contexts.