Note
Go to the end to download the full example code or to run this example in your browser via Binder.
Age regression from EEG (applied case study)#
Can a feature-based regression head predict a child’s age from a few
seconds of resting-state EEG, on subjects the model has never seen?
This applied case study takes the Healthy Brain Network release
ds005505 (Alexander et al. 2017, doi:10.1038/sdata.2017.181), surfaced
through NEMAR (Delorme et al. 2022, doi:10.1093/nargab/lqac023), wires up
an eegdash.EEGDashDataset query, builds a strict subject-aware
split (Cisotto and Chicco 2024, doi:10.7717/peerj-cs.2256), fits a
sklearn.linear_model.Ridge head on band-power features, and
reports Pearson r, Spearman rho, R^2, and MAE against a median-baseline
predictor. EEG-based brain-age regression has a long line of prior work
(Zoubi et al. 2018, doi:10.3389/fnhum.2018.00461). The headline
question here is not whether we beat the published literature; the
honest one is, does the model beat the train-set median predictor on
never-seen subjects?
# Difficulty: 3-star (advanced applied project)
Learning objectives#
After this case study you will be able to:
load HBN
ds005505viaeegdash.EEGDashDatasetwithageandsexondescription_fields.build a subject-aware split with
sklearn.model_selection.GroupKFoldkeyed on subject.fit a
sklearn.linear_model.Ridgehead on a feature table and reportr2, MAE, and a median-baseline reference.read a three-panel diagnostic plate (predicted vs true, per-subject signed residual, error histogram) for an EEG regression.
Requirements#
~30 s on CPU when the gallery uses the offline feature path (default).
Real-data path needs network access to NEMAR plus
braindecode[hub].Prereqs: How do I turn one EEG recording into a PyTorch DataLoader? and How do I push EEGDash features through a scikit-learn Pipeline?.
Concept: Leakage and evaluation for the subject-aware evaluation rationale that the figure plate is built around.
Step 1. Setup, seeds, parametrised cache#
import os
import warnings
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import pearsonr, spearmanr
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import GroupKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from eegdash.viz import use_eegdash_style
use_eegdash_style()
warnings.simplefilter("ignore", category=FutureWarning)
warnings.simplefilter("ignore", category=UserWarning)
SEED = 42
np.random.seed(SEED)
rng = np.random.default_rng(SEED)
cache_dir = Path(os.environ.get("EEGDASH_CACHE_DIR", Path.cwd() / "eegdash_cache"))
cache_dir.mkdir(parents=True, exist_ok=True)
Step 2. Predict, what do you expect?#
Predict. Constant predictors set the chance level. A model that
always returns the train-set median age scores r2 = 0 against the
test-set mean by definition. Brain-age regression on faint resting
EEG is a hard inverse problem; what r2 do you expect a Ridge head on
~12 features to reach with only ~12 held-out subjects: 0.10? 0.40?
0.80? Write your guess before scrolling.
Step 3. Load HBN ds005505 (real path) or synthesise a feature table (gallery)#
In production, eegdash.EEGDashDataset exposes age and
sex through description_fields so they surface as per-recording
columns. The canonical call is shown below; this case study then
synthesises a feature table with the same column layout so the
rendered gallery runs offline and the smoke test stays fast.
from eegdash import EEGDashDataset
from braindecode.preprocessing import (
Preprocessor, create_fixed_length_windows, preprocess,
)
from braindecode.features import extract_features
ds = EEGDashDataset(
dataset="ds005505",
cache_dir=cache_dir,
description_fields=["subject", "session", "task", "age", "sex"],
)
preprocess(ds, [
Preprocessor("pick_channels", ch_names=["Cz", "Pz", "Oz", "Fz"]),
Preprocessor("resample", sfreq=128),
Preprocessor("filter", l_freq=1.0, h_freq=55.0),
], n_jobs=4)
windows = create_fixed_length_windows(
ds, window_size_samples=256, window_stride_samples=256,
drop_last_window=True, preload=False,
)
features = extract_features(windows, feature_groups=["spectral"])
We mirror the resulting shape: 12 subjects, ~24 windows per subject,
12 features per window (band-power proxies + variance on Cz/Pz/Oz),
plus a subject-level age drawn from the HBN child cohort range
(roughly 6 to 18 years).
N_SUBJECTS, N_WINDOWS = 12, 24
BANDS = ("delta", "theta", "alpha", "beta")
CH_NAMES = ("Cz", "Pz", "Oz")
subject_age = rng.uniform(6.0, 18.0, size=N_SUBJECTS)
subject_sex = rng.choice(["F", "M"], size=N_SUBJECTS)
rows: list[dict] = []
for s in range(N_SUBJECTS):
a = float(subject_age[s])
sx = str(subject_sex[s])
for w in range(N_WINDOWS):
# Per-window features: a faint age signal in alpha / beta plus a
# mild subject-level offset (the part the cross-subject loop must
# not memorise) and a noise floor that keeps R^2 modest.
bias = 0.04 * (s - N_SUBJECTS / 2)
row = {
"subject": f"sub-NDARAA{s:04d}",
"session": "ses-01",
"run": "run-01",
"dataset": "ds005505",
"sample_id": f"sub-NDARAA{s:04d}__w{w:03d}",
"age": a,
"sex": sx,
}
for ch in CH_NAMES:
row[f"var_{ch}"] = float(rng.gamma(2.0, 1.0) + bias)
for band in BANDS:
base = rng.normal(0.0, 1.0)
if band in ("alpha", "beta"):
# Age tracks alpha / beta in resting EEG :cite:`alzoubi2018ageeeg`.
base += 0.18 * (a - 12.0)
row[f"spec_{band}_{ch}"] = float(base + 0.4 * bias)
rows.append(row)
feature_table = pd.DataFrame(rows)
feature_cols = [c for c in feature_table.columns if c.startswith(("var_", "spec_"))]
metadata = feature_table[
["subject", "session", "run", "dataset", "sample_id", "age", "sex"]
].copy()
metadata["target"] = metadata["age"].astype(float)
y = metadata["target"].to_numpy()
X = feature_table[feature_cols].to_numpy()
print(
f"feature_table: rows={len(metadata)} | features={len(feature_cols)} | "
f"subjects={metadata['subject'].nunique()} | "
f"age_range=[{metadata['age'].min():.1f}, {metadata['age'].max():.1f}] yr"
)
assert metadata["age"].notna().all(), "age has NaN rows"
assert pd.api.types.is_float_dtype(metadata["age"]), "age is not float"
Step 4. Build a Ridge regression head on top of features#
A sklearn.preprocessing.StandardScaler then
sklearn.linear_model.Ridge sklearn.pipeline.Pipeline
(Pedregosa et al. 2011, doi:10.5555/1953048.2078195) is the regression
analogue of a logistic head. alpha=1.0 is a defensible default for
a 12-feature table, and random_state=42 keeps the loop byte stable.
def make_regressor() -> Pipeline:
"""Return a fresh ``StandardScaler -> Ridge`` Pipeline."""
return Pipeline(
[
("scaler", StandardScaler()),
("clf", Ridge(alpha=1.0, random_state=SEED)),
]
)
Step 5. Cross-subject 5-fold split, no subject leakage#
Run. sklearn.model_selection.GroupKFold keyed on
subject is the contract: every subject appears in exactly one
test fold. We assert zero subject overlap before scoring, and we
pool the held-out predictions across folds so the diagnostic plate
in step 7 sees every subject once. Subject leakage routinely
produces optimistic accuracies (Cisotto and Chicco 2024 Tip 9), so
the assertion below is non-negotiable.
groups = metadata["subject"].to_numpy()
unique_subjects = np.unique(groups)
n_folds = min(5, len(unique_subjects))
splitter = GroupKFold(n_splits=n_folds)
fold_r2: list[float] = []
fold_mae: list[float] = []
fold_baseline_mae: list[float] = []
held_records: list[tuple[str, str, float, float]] = []
for k, (tr, te) in enumerate(splitter.split(X, y, groups=groups)):
train_subj = set(groups[tr].tolist())
test_subj = set(groups[te].tolist())
overlap = train_subj & test_subj
assert not overlap, f"Fold {k} leaked subjects: {sorted(overlap)}"
pipe = make_regressor().fit(X[tr], y[tr])
y_pred = pipe.predict(X[te])
fold_r2.append(float(r2_score(y[te], y_pred)))
fold_mae.append(float(mean_absolute_error(y[te], y_pred)))
median_pred = float(np.median(y[tr]))
fold_baseline_mae.append(
float(mean_absolute_error(y[te], np.full_like(y[te], median_pred)))
)
for sid, sx, true_val, pred_val in zip(
metadata.loc[te, "subject"].tolist(),
metadata.loc[te, "sex"].tolist(),
y[te].tolist(),
y_pred.tolist(),
):
held_records.append((sid, sx, float(true_val), float(pred_val)))
print(
f"Fold {k}: r2={fold_r2[-1]:+.3f} | mae={fold_mae[-1]:.3f} yr | "
f"baseline_mae={fold_baseline_mae[-1]:.3f} yr"
)
mean_r2 = float(np.mean(fold_r2))
std_r2 = float(np.std(fold_r2, ddof=1))
mean_mae = float(np.mean(fold_mae))
mean_baseline_mae = float(np.mean(fold_baseline_mae))
Step 6. A common mistake, and how to recover#
Run. A frequent slip is wiring a non-numeric age column into
sklearn.linear_model.Ridge. age arrives as strings if a
CSV was loaded without dtype hints (or if a NEMAR sidecar serialises
it as text), and Ridge then refuses to fit. We trigger the failure
on purpose with try / except so the error message and the
recovery sit next to each other in the rendered gallery.
try:
bad_y = metadata["age"].astype(str).to_numpy()
Ridge(alpha=1.0, random_state=SEED).fit(X[:8], bad_y[:8])
except (ValueError, TypeError) as exc:
print(f"Caught {type(exc).__name__}: {str(exc)[:90]}")
# Recovery: cast the target to float before fitting any regression head.
fixed_y = pd.to_numeric(metadata["age"], errors="coerce").to_numpy()
Ridge(alpha=1.0, random_state=SEED).fit(X[:8], fixed_y[:8])
print(f"Recovery: cast age to float (dtype={fixed_y.dtype}); Ridge fit.")
Step 7. Investigate, three-panel diagnostic plate#
Investigate. A single MAE number hides every failure mode that
matters. The sibling _age_regression_figure.py renders three
panels: predicted vs true age (one point per held-out subject, with
Pearson r, Spearman rho, R^2, and MAE in a corner box, points colored
by sex), per-subject signed residuals sorted by |residual| (orange
for over-prediction, blue for under-prediction), and a histogram of
window-level errors with a Gaussian density overlay so the bias
(mean) and the spread (std) read at a glance. This is the kind of
diagnostic Pernet et al. 2019 (doi:10.1038/s41597-019-0104-8)
recommend before any clinical claim.
held_df = pd.DataFrame(held_records, columns=["subject", "sex", "true", "pred"])
y_true_pooled = held_df["true"].to_numpy()
y_pred_pooled = held_df["pred"].to_numpy()
subject_pooled = held_df["subject"].tolist()
sex_pooled = held_df["sex"].tolist()
from _age_regression_figure import draw_age_regression_figure
fig = draw_age_regression_figure(
y_true_subj=y_true_pooled,
y_pred_subj=y_pred_pooled,
subject_ids=subject_pooled,
sex_or_fold=sex_pooled,
plot_id="project_age_regression",
)
plt.show()
# Pull figure-side metrics back into the script namespace so the
# wrap-up print line stays consistent with the corner box.
fig_metrics = fig._eegdash_age_metrics
print(
f"figure metrics: r={fig_metrics['pearson_r']:+.3f} | "
f"rho={fig_metrics['spearman_rho']:+.3f} | "
f"R^2={fig_metrics['r2']:+.3f} | "
f"MAE={fig_metrics['mae']:.3f} yr | "
f"n_subjects={fig_metrics['n_subjects']}"
)
Modify, where to take this next (concept only)#
Modify (concept). Two upgrades close most of the gap to a real
scientific result. First, scale the cohort: HBN ships several
thousand recordings, and reliable brain-age estimates need at least
a few hundred subjects per fold. Second, swap Ridge for a deep
encoder, braindecode.models.EEGConformer (Song et al. 2023,
doi:10.1109/TNSRE.2022.3230250) trained on raw windows, with the
same cross-subject contract enforced at the dataset level. We do not
run it here (gpu_required + multi-hour fit out of scope), but the
subject-aware split contract above is the floor any such system must
still satisfy.
Result, model vs median baseline#
Five folds, disjoint subject test sets. mean +/- std against the
regression chance level. The median-baseline MAE is the floor the
trained model must come in below. The pooled-window vs subject-level
numbers split the disagreement: pooled noise vs subject-level rank
reversal (Cisotto and Chicco 2024 Tip 7).
pearson_pooled = float(pearsonr(y_true_pooled, y_pred_pooled).statistic)
spearman_pooled = float(spearmanr(y_true_pooled, y_pred_pooled).statistic)
print(
f"Cross-subject 5-fold age regression: r2={mean_r2:+.3f} +/- {std_r2:.3f} "
f"| mae={mean_mae:.3f} yr | baseline_mae={mean_baseline_mae:.3f} yr "
f"| metric: r2"
)
print(
f"pooled-window r={pearson_pooled:+.3f} | rho={spearman_pooled:+.3f} | "
f"subject r={fig_metrics['pearson_r']:+.3f} | "
f"subject rho={fig_metrics['spearman_rho']:+.3f}"
)
print(
f"chance: predicting the train-median age scores baseline_mae="
f"{mean_baseline_mae:.3f} yr on test."
)
assert mean_mae < mean_baseline_mae, "Model MAE must be below the median-baseline MAE."
Wrap-up#
We loaded an HBN-shaped feature table with age as a float column,
built a 5-fold GroupKFold split keyed on subject, asserted
zero subject leakage, fit a sklearn.linear_model.Ridge head
per fold, and reported R^2 plus MAE alongside the median-baseline
predictor. The three-panel figure is the diagnostic that surfaces
what a single MAE hides: a flat predicted-vs-true cloud means the
model regresses toward the train-set mean, a one-sided residual bar
chart means a single subject drags the error, and a wide Gaussian on
the histogram means the per-window spread leaves little room for
clinical claims. Age in HBN is a derived metadata column, not a
diagnosis; any clinical framing belongs in a follow-up study with
much larger N.
Try it yourself#
Swap
sklearn.linear_model.Ridgeforsklearn.neural_network.MLPRegressor(stillrandom_state=42).Pre-train a
braindecode.models.EEGConformeron raw windows and feed encoder activations into the same Ridge head.Bump
n_foldstoN_SUBJECTSfor leave-one-subject-out variance.
References#
See References for the centralised bibliography of papers
cited above. Add or amend an entry once in
docs/source/refs.bib; every tutorial inherits the update.