Leakage and evaluation#
Subject-level data leakage is the single most common reason that an EEG decoder claims a high cross-validated accuracy and then collapses on a new participant. It is also the reason a tutorial that scores 0.95 on a random window split can score 0.55 on a subject-disjoint split using the same data and the same model. This page explains the failure mode, why it is specific to physiological signals, and how the within-subject / cross-session / cross-subject distinction maps onto the splitters EEGDash provides.
The advice below is consistent with Tip 9 of Cisotto & Chicco (2024) [1], which explicitly identifies subject-aware cross-validation as the only defensible default for clinical EEG. Tutorials that demonstrate the problem live in How do I split EEG data without subject leakage?, When is within-subject decoding the right scientific question?, How well does an EEG decoder generalise to a never-seen subject?, and How much does a within-session decoder drift across sessions of the same subject?.
Why subject leakage destroys generalization claims#
EEG signals carry strong, idiosyncratic, subject-specific statistics: skull thickness, hair impedance, electrode placement, baseline alpha power, blink habits, posture. A neural network with a few thousand free parameters easily latches onto those identity features because they generalise perfectly within a subject — every window from subject A “smells like” subject A.
Now imagine a binary decoder for “eyes open vs. eyes closed”. You shuffle all windows across all subjects and split 80/20 randomly. The classifier quickly discovers that a few windows from each subject are in the test set, and its best strategy is to memorise the spectral fingerprint of subject A and reuse it for the held-out windows from subject A. It then reports an apparent accuracy of, say, 0.94. Unfortunately, this number is a lower-bounded identification accuracy plus the actual condition classification — and on a fresh participant the model may do no better than chance.
The cleanest demonstration of this is to train the same architecture on two splits of the same dataset: a leaky random window split and a subject-disjoint split. The accuracy gap is the leakage tax, and it is typically 0.20–0.40 in absolute accuracy on real EEG decoding problems.
Why random window splits are unsafe#
A window is a short, overlapping slice of a recording. If you make 2-second windows with 50% overlap, neighbouring windows share a full second of samples; their feature vectors differ only by smoothing. When you assign one to “train” and the other to “test”, the test score is almost a noise estimate, not a generalisation estimate.
This problem exists on top of the subject leakage problem: even within a single subject, randomising windows leaks information across the train/ test boundary because the windows overlap in time. The mitigation is twofold:
Split at the recording or session level, not the window level.
If a single recording must be split, choose a splitter that respects temporal contiguity (e.g., the first 80% by time for train, the last 20% for test).
EEGDash defers the actual splitting to braindecode and MOABB, but the conceptual rule is the same regardless of library: a window must inherit the train/test label of its parent recording, never get assigned independently.
Within-subject vs. cross-session vs. cross-subject#
These three terms describe what kind of generalisation you are claiming to measure. They differ in which axis the held-out fold spans:
Within-subject evaluation holds out time within a single participant. Train on the first portion of recording, test on the last portion. Answer: can the model decode this person’s signal later in the same session? This is the easiest setting and the one most clinical BCI demos report.
Cross-session evaluation holds out a different session of the same participant. Train on session 1, test on session 2 (typically collected on a different day, with re-applied electrodes). Answer: does the model survive electrode re-application and day-to-day drift? This is the relevant setting for repeated-use BCIs and for any real-world deployment where calibration is rare.
Cross-subject evaluation holds out different participants. Train on subjects A–T, test on subjects U–Z. Answer: does the model generalise to a person it has never seen? This is the standard for any “subject-invariant” or “foundation-model” claim.
Each setting answers a different scientific question, so neither one is universally correct. The mistake is to report one and implicitly claim another. A paper that splits randomly and then advertises a “general-purpose decoder” is overstating the evaluation; a paper that holds out a session and accurately calls it cross-session is doing honest work even if the number is lower.
Practical guidance#
Always inspect
ds.description["subject"]andds.description["session"]before choosing a splitter. If any subject appears in more than one fold, the split is leaky by construction.Treat the split function as part of the experiment, not a utility. Print, log, and version-control the participants in each fold. Tutorials such as How do I split EEG data without subject leakage? include an audit step that verifies disjointness.
Pick a splitter that matches your scientific question — within-subject, cross-session, or cross-subject. If you cannot decide, default to cross-subject; it is the strictest of the three and rarely misleading.
Always include a chance-level baseline and, where possible, a simple feature baseline (see Features vs. deep learning). A neural network that beats random by 3 points but loses to a logistic regression on band power has not learned the task.
Report variance across folds, not just the mean. Subject-level variance dominates EEG; a mean accuracy with no error bars is not a measurement.
What “metric leakage” looks like in practice#
A few diagnostic patterns you should watch for:
Suspiciously high accuracy on hard problems. A decoder for emotional state from 30 seconds of resting EEG that scores 0.92 is almost certainly leaking subject identity.
Accuracy drops on new subjects. A 25-point drop between cross-validated and held-out cohorts is a leakage signal, not an overfitting signal.
Random labels still score above chance. If you shuffle the labels per recording but keep them constant within a recording, a leaky pipeline still scores well above chance because it is fitting recording identity.
When you see one of those, re-read this page and re-check the splitter.
Further reading#
Saeb, S., Lonini, L., Jayaraman, A., Mohr, D. C., & Kording, K. P. (2017). The need to approximate the use-case in clinical machine learning. GigaScience, 6(5), 1–9. https://doi.org/10.1093/gigascience/gix019
Roy, Y., Banville, H., Albuquerque, I., Gramfort, A., Falk, T. H., & Faubert, J. (2019). Deep learning-based electroencephalography analysis: a systematic review. Journal of Neural Engineering, 16(5), 051001. https://doi.org/10.1088/1741-2552/ab260c
Pernet, C. R., et al. (2019). EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. Scientific Data, 6(1), 103. https://doi.org/10.1038/s41597-019-0104-8