Leakage and evaluation#

Subject-level data leakage is the single most common reason that an EEG decoder claims a high cross-validated accuracy and then collapses on a new participant. It is also the reason a tutorial that scores 0.95 on a random window split can score 0.55 on a subject-disjoint split using the same data and the same model. This page explains the failure mode, why it is specific to physiological signals, and how the within-subject / cross-session / cross-subject distinction maps onto the splitters EEGDash provides.

The advice below is consistent with Tip 9 of Cisotto & Chicco (2024) [1], which explicitly identifies subject-aware cross-validation as the only defensible default for clinical EEG. Tutorials that demonstrate the problem live in How do I split EEG data without subject leakage?, When is within-subject decoding the right scientific question?, How well does an EEG decoder generalise to a never-seen subject?, and How much does a within-session decoder drift across sessions of the same subject?.

Why subject leakage destroys generalization claims#

EEG signals carry strong, idiosyncratic, subject-specific statistics: skull thickness, hair impedance, electrode placement, baseline alpha power, blink habits, posture. A neural network with a few thousand free parameters easily latches onto those identity features because they generalise perfectly within a subject — every window from subject A “smells like” subject A.

Now imagine a binary decoder for “eyes open vs. eyes closed”. You shuffle all windows across all subjects and split 80/20 randomly. The classifier quickly discovers that a few windows from each subject are in the test set, and its best strategy is to memorise the spectral fingerprint of subject A and reuse it for the held-out windows from subject A. It then reports an apparent accuracy of, say, 0.94. Unfortunately, this number is a lower-bounded identification accuracy plus the actual condition classification — and on a fresh participant the model may do no better than chance.

The cleanest demonstration of this is to train the same architecture on two splits of the same dataset: a leaky random window split and a subject-disjoint split. The accuracy gap is the leakage tax, and it is typically 0.20–0.40 in absolute accuracy on real EEG decoding problems.

Why random window splits are unsafe#

A window is a short, overlapping slice of a recording. If you make 2-second windows with 50% overlap, neighbouring windows share a full second of samples; their feature vectors differ only by smoothing. When you assign one to “train” and the other to “test”, the test score is almost a noise estimate, not a generalisation estimate.

This problem exists on top of the subject leakage problem: even within a single subject, randomising windows leaks information across the train/ test boundary because the windows overlap in time. The mitigation is twofold:

Split at the recording or session level, not the window level.
If a single recording must be split, choose a splitter that respects temporal contiguity (e.g., the first 80% by time for train, the last 20% for test).

EEGDash defers the actual splitting to braindecode and MOABB, but the conceptual rule is the same regardless of library: a window must inherit the train/test label of its parent recording, never get assigned independently.

Within-subject vs. cross-session vs. cross-subject#

These three terms describe what kind of generalisation you are claiming to measure. They differ in which axis the held-out fold spans:

Within-subject evaluation holds out time within a single participant. Train on the first portion of recording, test on the last portion. Answer: can the model decode this person’s signal later in the same session? This is the easiest setting and the one most clinical BCI demos report.
Cross-session evaluation holds out a different session of the same participant. Train on session 1, test on session 2 (typically collected on a different day, with re-applied electrodes). Answer: does the model survive electrode re-application and day-to-day drift? This is the relevant setting for repeated-use BCIs and for any real-world deployment where calibration is rare.
Cross-subject evaluation holds out different participants. Train on subjects A–T, test on subjects U–Z. Answer: does the model generalise to a person it has never seen? This is the standard for any “subject-invariant” or “foundation-model” claim.

Each setting answers a different scientific question, so neither one is universally correct. The mistake is to report one and implicitly claim another. A paper that splits randomly and then advertises a “general-purpose decoder” is overstating the evaluation; a paper that holds out a session and accurately calls it cross-session is doing honest work even if the number is lower.

Practical guidance#

Always inspect ds.description["subject"] and ds.description["session"] before choosing a splitter. If any subject appears in more than one fold, the split is leaky by construction.
Treat the split function as part of the experiment, not a utility. Print, log, and version-control the participants in each fold. Tutorials such as How do I split EEG data without subject leakage? include an audit step that verifies disjointness.
Pick a splitter that matches your scientific question — within-subject, cross-session, or cross-subject. If you cannot decide, default to cross-subject; it is the strictest of the three and rarely misleading.
Always include a chance-level baseline and, where possible, a simple feature baseline (see Features vs. deep learning). A neural network that beats random by 3 points but loses to a logistic regression on band power has not learned the task.
Report variance across folds, not just the mean. Subject-level variance dominates EEG; a mean accuracy with no error bars is not a measurement.

What “metric leakage” looks like in practice#

A few diagnostic patterns you should watch for:

Suspiciously high accuracy on hard problems. A decoder for emotional state from 30 seconds of resting EEG that scores 0.92 is almost certainly leaking subject identity.
Accuracy drops on new subjects. A 25-point drop between cross-validated and held-out cohorts is a leakage signal, not an overfitting signal.
Random labels still score above chance. If you shuffle the labels per recording but keep them constant within a recording, a leaky pipeline still scores well above chance because it is fitting recording identity.

When you see one of those, re-read this page and re-check the splitter.