.. _concepts-leakage-and-evaluation: Leakage and evaluation ====================== Subject-level data leakage is the single most common reason that an EEG decoder claims a high cross-validated accuracy and then collapses on a new participant. It is also the reason a tutorial that scores 0.95 on a random window split can score 0.55 on a subject-disjoint split using the *same* data and the *same* model. This page explains the failure mode, why it is specific to physiological signals, and how the within-subject / cross-session / cross-subject distinction maps onto the splitters EEGDash provides. The advice below is consistent with Tip 9 of Cisotto & Chicco (2024) [1]_, which explicitly identifies subject-aware cross-validation as the only defensible default for clinical EEG. Tutorials that demonstrate the problem live in :doc:`/generated/auto_examples/tutorials/10_core_workflow/plot_11_leakage_safe_split`, :doc:`/generated/auto_examples/tutorials/50_evaluation/plot_50_within_subject_evaluation`, :doc:`/generated/auto_examples/tutorials/50_evaluation/plot_51_cross_subject_evaluation`, and :doc:`/generated/auto_examples/tutorials/50_evaluation/plot_52_cross_session_evaluation`. Why subject leakage destroys generalization claims -------------------------------------------------- EEG signals carry strong, idiosyncratic, *subject-specific* statistics: skull thickness, hair impedance, electrode placement, baseline alpha power, blink habits, posture. A neural network with a few thousand free parameters easily latches onto those identity features because they generalise perfectly within a subject — every window from subject A "smells like" subject A. Now imagine a binary decoder for "eyes open vs. eyes closed". You shuffle all windows across all subjects and split 80/20 randomly. The classifier quickly discovers that a few windows from each subject are in the test set, and its best strategy is to memorise the spectral fingerprint of subject A and reuse it for the held-out windows from subject A. It then reports an apparent accuracy of, say, 0.94. Unfortunately, this number is a lower-bounded *identification* accuracy plus the actual condition classification — and on a fresh participant the model may do no better than chance. The cleanest demonstration of this is to train the same architecture on two splits of the same dataset: a leaky random window split and a subject-disjoint split. The accuracy gap is the leakage tax, and it is typically 0.20–0.40 in absolute accuracy on real EEG decoding problems. Why random window splits are unsafe ----------------------------------- A *window* is a short, overlapping slice of a recording. If you make 2-second windows with 50% overlap, neighbouring windows share a full second of samples; their feature vectors differ only by smoothing. When you assign one to "train" and the other to "test", the test score is almost a noise estimate, not a generalisation estimate. This problem exists on top of the subject leakage problem: even within a single subject, randomising windows leaks information across the train/ test boundary because the windows overlap in time. The mitigation is twofold: 1. Split at the **recording or session level**, not the window level. 2. If a single recording must be split, choose a splitter that respects temporal contiguity (e.g., the first 80% by time for train, the last 20% for test). EEGDash defers the actual splitting to braindecode and MOABB, but the conceptual rule is the same regardless of library: a window must inherit the train/test label of its parent recording, never get assigned independently. Within-subject vs. cross-session vs. cross-subject -------------------------------------------------- These three terms describe what kind of generalisation you are claiming to measure. They differ in which axis the held-out fold spans: - **Within-subject** evaluation holds out *time* within a single participant. Train on the first portion of recording, test on the last portion. Answer: *can the model decode this person's signal later in the same session?* This is the easiest setting and the one most clinical BCI demos report. - **Cross-session** evaluation holds out a *different session* of the same participant. Train on session 1, test on session 2 (typically collected on a different day, with re-applied electrodes). Answer: *does the model survive electrode re-application and day-to-day drift?* This is the relevant setting for repeated-use BCIs and for any real-world deployment where calibration is rare. - **Cross-subject** evaluation holds out *different participants*. Train on subjects A–T, test on subjects U–Z. Answer: *does the model generalise to a person it has never seen?* This is the standard for any "subject-invariant" or "foundation-model" claim. Each setting answers a different scientific question, so neither one is universally correct. The mistake is to *report* one and *implicitly claim* another. A paper that splits randomly and then advertises a "general-purpose decoder" is overstating the evaluation; a paper that holds out a session and accurately calls it cross-session is doing honest work even if the number is lower. Practical guidance ------------------ 1. Always inspect ``ds.description["subject"]`` and ``ds.description["session"]`` before choosing a splitter. If any subject appears in more than one fold, the split is leaky by construction. 2. Treat the split function as part of the experiment, not a utility. Print, log, and version-control the participants in each fold. Tutorials such as :doc:`/generated/auto_examples/tutorials/10_core_workflow/plot_11_leakage_safe_split` include an audit step that verifies disjointness. 3. Pick a splitter that matches your scientific question — within-subject, cross-session, or cross-subject. If you cannot decide, default to cross-subject; it is the strictest of the three and rarely misleading. 4. Always include a chance-level baseline and, where possible, a simple feature baseline (see :doc:`features_vs_deep_learning`). A neural network that beats random by 3 points but loses to a logistic regression on band power has not learned the task. 5. Report variance across folds, not just the mean. Subject-level variance dominates EEG; a mean accuracy with no error bars is not a measurement. What "metric leakage" looks like in practice -------------------------------------------- A few diagnostic patterns you should watch for: - **Suspiciously high accuracy on hard problems.** A decoder for emotional state from 30 seconds of resting EEG that scores 0.92 is almost certainly leaking subject identity. - **Accuracy drops on new subjects.** A 25-point drop between cross-validated and held-out cohorts is a leakage signal, not an overfitting signal. - **Random labels still score above chance.** If you shuffle the labels per recording but keep them constant within a recording, a leaky pipeline still scores well above chance because it is fitting recording identity. When you see one of those, re-read this page and re-check the splitter. Related tutorials ----------------- - :doc:`/generated/auto_examples/tutorials/10_core_workflow/plot_11_leakage_safe_split` is the canonical demonstration of leaky vs. safe splits on EEGDash data. - :doc:`/generated/auto_examples/tutorials/50_evaluation/plot_50_within_subject_evaluation`, :doc:`/generated/auto_examples/tutorials/50_evaluation/plot_51_cross_subject_evaluation`, and :doc:`/generated/auto_examples/tutorials/50_evaluation/plot_52_cross_session_evaluation` show the same dataset evaluated under each protocol. - :doc:`/generated/auto_examples/tutorials/50_evaluation/plot_53_learning_curves` illustrates how leakage interacts with sample-size effects. - :doc:`/generated/auto_examples/tutorials/50_evaluation/plot_54_compare_two_pipelines` shows how to compare pipelines once a defensible split is in place. Further reading --------------- .. [1] Cisotto, G., & Chicco, D. (2024). Ten quick tips for clinical electroencephalographic (EEG) data acquisition and signal processing. *PeerJ Computer Science*, 10, e2256. https://doi.org/10.7717/peerj-cs.2256 - Saeb, S., Lonini, L., Jayaraman, A., Mohr, D. C., & Kording, K. P. (2017). The need to approximate the use-case in clinical machine learning. *GigaScience*, 6(5), 1–9. https://doi.org/10.1093/gigascience/gix019 - Roy, Y., Banville, H., Albuquerque, I., Gramfort, A., Falk, T. H., & Faubert, J. (2019). Deep learning-based electroencephalography analysis: a systematic review. *Journal of Neural Engineering*, 16(5), 051001. https://doi.org/10.1088/1741-2552/ab260c - Pernet, C. R., et al. (2019). EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. *Scientific Data*, 6(1), 103. https://doi.org/10.1038/s41597-019-0104-8