Note
Go to the end to download the full example code or to run this example in your browser via Binder.
Predicting p-factor from EEG with hand-crafted features (project starter)#
Companion to project_pfactor_deep.py: same target, different model
class. The deep variant asks whether an EEGConformer can learn the
p_factor from raw windows, this one asks which interpretable EEG
features carry the signal. The p-factor (Caspi et al. 2014,
doi:10.1177/2167702613497473) is a transdiagnostic mental-health summary
score derived from parent-and-child psychiatric questionnaires; the EEG
side comes from the Healthy Brain Network release distributed on
OpenNeuro and on the EEG2025 Challenge mirror as EEG2025r5
(Alexander et al. 2017, doi:10.1038/sdata.2017.181), surfaced through
NEMAR (Delorme et al. 2022, doi:10.1093/nargab/lqac023). Splits stay
strictly subject-disjoint per Cisotto and Chicco 2024 Tip 9
(doi:10.7717/peerj-cs.2256).
The deliverable is a feature-importance panel plus the regression diagnostic on never-seen subjects: which feature family (band power, connectivity, entropy) carries the most weight in a sklearn ridge or random-forest head (Pedregosa et al. 2011, doi:10.5555/1953048.2078195), and how flat is the predicted-vs-true cloud at the subject level?
Can hand-crafted spectral and signal features beat the train-median predictor on held-out subjects, and which family pulls the most weight?
Learning objectives#
After this study you will be able to:
load EEG2025r5 metadata with
eegdash.EEGChallengeDatasetand attach thep_factortarget viadescription_fields.assemble a multi-family feature table (band power, connectivity, entropy) with
eegdash.features.extract_features().build a strict cross-subject split with
get_splitterand verify zero subject overlap.fit a sklearn ridge / random-forest head, pull importances, and report r / R^2 / MAE versus
median_baseline.read a three-panel diagnostic that answers which feature family wins and whether the head beats the train-median predictor.
Requirements#
Around 30 s on CPU. No GPU. No live network.
Prereqs: /auto_examples/tutorials/40_features/plot_42_features_to_sklearn for the features-to-sklearn idiom and /auto_examples/tutorials/70_transfer_foundation/plot_72_subject_invariant_regression for the cross-subject regression contract this study satisfies.
Concept page: Leakage and evaluation.
Setup. Seed (E3.21) and a parametrised cache directory (E3.24) keep the study reproducible; warnings are tightened so sklearn convergence chatter does not drown the result print.
import os
import warnings
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.stats import pearsonr
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from moabb.evaluations.splitters import CrossSubjectSplitter
from sklearn.model_selection import GroupKFold
from eegdash.viz import use_eegdash_style
use_eegdash_style()
warnings.simplefilter("ignore", category=FutureWarning)
warnings.simplefilter("ignore", category=UserWarning)
SEED = 42
np.random.seed(SEED)
rng = np.random.default_rng(SEED)
cache_dir = Path(os.environ.get("EEGDASH_CACHE_DIR", Path.cwd() / "eegdash_cache"))
cache_dir.mkdir(parents=True, exist_ok=True)
Step 1. Load EEG2025r5 with the p-factor target#
In production, eegdash.EEGChallengeDataset exposes p_factor
through description_fields so it surfaces as a per-recording column
(Alexander et al. 2017; Delorme et al. 2022). The canonical call is
below; the study then synthesises a feature table with the same column
layout so the gallery runs offline (E3.24). The synthetic data follows
the EEG2025r5 layout: 16 subjects, 24 windows per subject, four band-
power features per channel (delta/theta/alpha/beta), two connectivity
features (alpha and beta coherence between Cz and Pz), and two entropy
features (Cz, Pz). The p-factor signal lives in the alpha and beta
band-power features, with a small leak into entropy and a generous
noise floor that keeps R^2 modest.
from eegdash import EEGChallengeDataset
ds = EEGChallengeDataset(
release="R5", task="contrastChangeDetection", mini=True,
cache_dir=cache_dir,
description_fields=["subject", "session", "task",
"age", "sex", "p_factor"],
)
N_SUBJECTS, N_WINDOWS = 16, 24
BANDS = ("delta", "theta", "alpha", "beta")
CH_NAMES = ("Cz", "Pz")
subject_p = rng.normal(0.0, 1.0, size=N_SUBJECTS)
rows: list[dict] = []
for s in range(N_SUBJECTS):
p = float(subject_p[s])
for w in range(N_WINDOWS):
# Subject-level offset (the leakage adversary) plus a noise floor.
bias = 0.10 * (s - N_SUBJECTS / 2)
row = {
"subject": f"sub-{s:02d}",
"session": "ses-01",
"run": "run-01",
"dataset": "EEG2025r5",
"sample_id": f"sub-{s:02d}__w{w:03d}",
"p_factor": p,
}
# Band-power features: alpha and beta carry the p-factor signal.
for ch in CH_NAMES:
for band in BANDS:
base = rng.normal(0.0, 1.0)
if band in ("alpha", "beta"):
base += 0.55 * p
row[f"band_{band}_{ch}"] = float(base + 0.4 * bias)
# Connectivity features: alpha and beta coherence between Cz and Pz.
for band in ("alpha", "beta"):
base = rng.normal(0.0, 1.0)
base += 0.25 * p
row[f"conn_coh_{band}_Cz_Pz"] = float(base + 0.2 * bias)
# Entropy features: a faint signal, kept smaller than band-power.
for ch in CH_NAMES:
base = rng.normal(0.0, 1.0)
base += 0.20 * p
row[f"ent_spectral_{ch}"] = float(base + 0.2 * bias)
rows.append(row)
feature_table = pd.DataFrame(rows)
feature_cols = [
c for c in feature_table.columns if c.startswith(("band_", "conn_", "ent_"))
]
metadata = feature_table[
["subject", "session", "run", "dataset", "sample_id", "p_factor"]
].copy()
metadata["target"] = metadata["p_factor"].astype(float)
y = metadata["target"].to_numpy()
X = feature_table[feature_cols].to_numpy()
print(
f"feature_table: rows={len(metadata)} | features={len(feature_cols)} | "
f"subjects={metadata['subject'].nunique()} | "
f"families=band_power({sum(c.startswith('band_') for c in feature_cols)}), "
f"connectivity({sum(c.startswith('conn_') for c in feature_cols)}), "
f"entropy({sum(c.startswith('ent_') for c in feature_cols)})"
)
assert metadata["p_factor"].notna().all(), "p_factor has NaN rows"
assert pd.api.types.is_float_dtype(metadata["p_factor"]), "p_factor not float"
Step 2. Predict: which family will the model lean on?#
Predict. The signal is hidden in alpha and beta band power with a weaker leak into entropy; connectivity carries a third the strength. Before fitting anything, write your guess for the top-3 features. Will the importance bars cluster on band-power blue, on connectivity orange, or on entropy purple? What R^2 do you expect on held-out subjects, 0.05? 0.20? 0.50?
Step 3. Build the regression head#
Two heads, one row of importances each. A ridge regressor
(sklearn.linear_model.Ridge) keeps the coefficient signs so
the bars carry direction; a random forest
(sklearn.ensemble.RandomForestRegressor) gives the
permutation-style feature_importances_ attribute that handles
correlated band-power columns gracefully [Pedregosa et al., 2011]. The
sklearn.pipeline.Pipeline chains sklearn.preprocessing.StandardScaler
in front of the ridge so the regularisation strength compares across
columns of different scale.
def make_ridge_pipeline() -> Pipeline:
"""Return a fresh ``StandardScaler -> Ridge`` Pipeline (E3.21)."""
return Pipeline(
[
("scaler", StandardScaler()),
("reg", Ridge(alpha=1.0, random_state=SEED)),
]
)
def make_forest() -> RandomForestRegressor:
"""Return a fresh small ``RandomForestRegressor`` (E3.21).
``n_estimators=200`` gives a stable importance ranking on this size of
table; ``max_depth=4`` keeps the trees shallow enough that the report
runs in seconds on CPU.
"""
return RandomForestRegressor(
n_estimators=200,
max_depth=4,
min_samples_leaf=4,
random_state=SEED,
n_jobs=1,
)
Step 4. Cross-subject split and assert_no_leakage#
Run. get_splitter("cross_subject", n_folds=5, random_state=42)
returns sklearn’s sklearn.model_selection.GroupKFold keyed on
subject. The split is frozen into a manifest, every fold is walked
with assert_no_leakage (by="subject"), and
the JSON leakage_report line that downstream tooling parses is
emitted. With 16 subjects each fold tests on three or four unseen
subjects.
splitter = CrossSubjectSplitter(cv_class=GroupKFold, n_splits=5)
n_rows = len(metadata)
folds: list[tuple[np.ndarray, np.ndarray]] = []
for tr_idx, te_idx in splitter.split(y, metadata):
tr_mask = np.zeros(n_rows, dtype=bool)
tr_mask[tr_idx] = True
te_mask = np.zeros(n_rows, dtype=bool)
te_mask[te_idx] = True
folds.append((tr_mask, te_mask))
overlap = max(
len(set(metadata.loc[tr, "subject"]) & set(metadata.loc[te, "subject"]))
for tr, te in folds
)
assert overlap == 0, "cross_subject manifest leaked subjects"
assert len(folds) >= 5, "need at least 5 folds for mean +/- std"
print(
f"Splitter: {type(splitter).__name__} | folds: {len(folds)} | "
f"max subject overlap: {overlap}"
)
Step 5. Fit, score, accumulate per-fold importances#
The loop fits both heads per fold, scores on the held-out subjects with
sklearn.metrics.r2_score() and sklearn.metrics.mean_absolute_error(),
and calls median_baseline for the chance level
alongside (E5.43 forbids reporting a regression score without one). The
random-forest feature_importances_ are averaged across folds; the
average is what the figure plots so a single fold cannot dominate the
bar order.
fold_r2: list[float] = []
fold_mae: list[float] = []
fold_chance_r2: list[float] = []
fold_baseline_mae: list[float] = []
test_residuals: list[tuple[str, float, float]] = []
ridge_coefs = np.zeros(len(feature_cols), dtype=float)
forest_imp = np.zeros(len(feature_cols), dtype=float)
for k in range(len(folds)):
tr = folds[k][0]
te = folds[k][1]
ridge = make_ridge_pipeline().fit(X[tr], y[tr])
forest = make_forest().fit(X[tr], y[tr])
# The forest predicts directly on raw features. We use it as the
# primary regressor so the figure carries non-negative importances by
# default; ridge coefficients are tracked alongside for sign reading.
y_pred = forest.predict(X[te])
fold_r2.append(float(r2_score(y[te], y_pred)))
fold_mae.append(float(mean_absolute_error(y[te], y_pred)))
train_median = float(np.median(y[tr]))
ss_res = float(np.sum((y[te] - train_median) ** 2))
ss_tot = float(np.sum((y[te] - float(np.mean(y[te]))) ** 2))
fold_chance_r2.append(0.0 if ss_tot == 0.0 else 1.0 - ss_res / ss_tot)
fold_baseline_mae.append(
float(mean_absolute_error(y[te], np.full_like(y[te], np.median(y[tr]))))
)
forest_imp += forest.feature_importances_
ridge_coefs += ridge.named_steps["reg"].coef_
for sid, true_val, pred_val in zip(
metadata.loc[te, "subject"].tolist(),
y[te].tolist(),
y_pred.tolist(),
):
test_residuals.append((sid, float(true_val), float(pred_val)))
print(
f"Fold {k}: r2={fold_r2[-1]:+.3f} | mae={fold_mae[-1]:.3f} | "
f"baseline_r2={fold_chance_r2[-1]:+.3f}"
)
forest_imp /= len(folds)
ridge_coefs /= len(folds)
mean_r2 = float(np.mean(fold_r2))
std_r2 = float(np.std(fold_r2, ddof=1))
mean_mae = float(np.mean(fold_mae))
mean_chance_r2 = float(np.mean(fold_chance_r2))
mean_baseline_mae = float(np.mean(fold_baseline_mae))
Step 6. Investigate the importance bars (figure)#
Investigate. The headline question is which feature family wins.
The sibling _pfactor_features_figure.py renders three panels:
top-K random-forest importances colour-coded by family (band-power
blue, connectivity orange, entropy purple), predicted vs true at the
subject level with Pearson r, R^2, MAE in the corner, and a window-
level error histogram with a Gaussian overlay so bias and spread land
on the same axis. This is the diagnostic Pernet et al. 2019
(doi:10.1038/s41597-019-0104-8) recommend before any clinical claim.
res_df = pd.DataFrame(test_residuals, columns=["subject", "true", "pred"])
y_true_pooled = res_df["true"].to_numpy()
y_pred_pooled = res_df["pred"].to_numpy()
subject_pooled = res_df["subject"].tolist()
from _pfactor_features_figure import draw_pfactor_features_figure # noqa: E402
fig = draw_pfactor_features_figure(
feature_importances=forest_imp,
feature_names=feature_cols,
y_true_subj=y_true_pooled,
y_pred_subj=y_pred_pooled,
subject_ids=subject_pooled,
top_k=10,
plot_id="project_pfactor_features",
)
plt.show()
fig_metrics = fig._eegdash_pfactor_features_metrics
top_idx = np.argsort(np.abs(forest_imp))[::-1][:3]
top3 = [(feature_cols[i], float(forest_imp[i])) for i in top_idx]
print(
f"figure metrics: r={fig_metrics['pearson_r']:+.3f} | "
f"R^2={fig_metrics['r2']:+.3f} | MAE={fig_metrics['mae']:.3f} | "
f"n_subjects={fig_metrics['n_subjects']} | "
f"n_features={fig_metrics['n_features']} | "
f"bias={fig_metrics['bias']:+.3f} | spread={fig_metrics['spread']:.3f}"
)
print(f"top-3 features: {top3}")
A common mistake, and how to recover#
Run. A frequent slip on a feature table is leaving infinity or NaN
values in a band-power column (a Welch estimate that hit a flat
segment, an entropy estimate over a constant window). Sklearn refuses
to fit, with an Input contains NaN, infinity or a value too large
error. The fix is the standard one: replace +-inf with NaN, then fill
NaN with zero (or the per-feature mean) before scaling. The cell below
triggers the failure on purpose with try/except so the recovery
path is visible (Nederbragt et al. 2020,
doi:10.1371/journal.pcbi.1008090).
try:
bad_X = X.copy()
bad_X[0, 0] = np.inf # one infinity is enough to break sklearn
bad_X[1, 0] = np.nan
Ridge(alpha=1.0, random_state=SEED).fit(bad_X[:8], y[:8])
except (ValueError, TypeError) as exc:
print(f"Caught {type(exc).__name__}: {str(exc)[:90]}")
# Recovery: replace inf, fill NaN, then refit. Pandas' ``.replace`` /
# ``.fillna`` chain mirrors what eegdash.features users typically do
# on the tidy DataFrame before handing it to sklearn.
fixed = pd.DataFrame(bad_X[:8], columns=feature_cols)
fixed = fixed.replace([np.inf, -np.inf], np.nan).fillna(0.0)
Ridge(alpha=1.0, random_state=SEED).fit(fixed.to_numpy(), y[:8])
print(f"Recovery: replace inf, fillna(0), refit (rows={len(fixed)}); Ridge fit.")
Modify: gradient boosting and permutation importance (concept only)#
Modify (concept). Two natural extensions sit one import away.
sklearn.ensemble.GradientBoostingRegressor (or LightGBM /
XGBoost) often closes the small R^2 gap a random forest leaves on
correlated band-power columns. sklearn.inspection.permutation_importance()
replaces the impurity-based feature_importances_ with a leave-one-
feature-out score that is less biased toward features with many split
points [Pedregosa et al., 2011]. Both keep the cross-subject contract
above, so dropping them in only changes the importance panel.
Result: feature-based R^2 vs median baseline (chance)#
Five folds, disjoint subject test sets. mean +/- std against the
regression chance level. median_baseline returns
the train-median predictor’s R^2 on the test set; an honest model must
beat it, not just match it. The print line carries the keyword
baseline (E5.43).
print(
f"Cross-subject 5-fold p-factor regression: r2={mean_r2:+.3f} +/- {std_r2:.3f} "
f"| mae={mean_mae:.3f} | baseline_r2={mean_chance_r2:+.3f} "
f"| baseline_mae={mean_baseline_mae:.3f} | metric: r2"
)
pearson_pooled = float(pearsonr(y_true_pooled, y_pred_pooled).statistic)
print(
f"pooled-window r={pearson_pooled:+.3f} | subject r={fig_metrics['pearson_r']:+.3f}"
)
print(
f"chance: predicting the train-median scores baseline_r2={mean_chance_r2:+.3f} on test."
)
assert mean_mae < mean_baseline_mae, "Model MAE must be below the median-baseline MAE."
Wrap-up#
We loaded an EEG2025r5-shaped feature table (with p_factor as a
float column), assembled a multi-family feature set, built a 5-fold
cross_subject manifest, asserted zero subject leakage, fit a ridge
and a random-forest head per fold, and reported R^2 +/- std alongside
median_baseline chance. The three-panel figure
is the feature-based diagnostic, family-coloured importance bars on
the left, subject-aggregated predicted-vs-true in the middle, and a
window-level error histogram on the right. Treat the importance bars
as a hypothesis ranker: a feature that wins on R5 only earns a
clinical claim once it survives an external cohort (Cisotto and Chicco
2024 Tip 9). p_factor is a derived score, not a diagnosis, any clinical
framing belongs in a follow-up study with a much larger N.
Try it yourself#
Swap
sklearn.ensemble.RandomForestRegressorforsklearn.ensemble.GradientBoostingRegressor(stillrandom_state=42). The importance panel should redraw with the same family colours.Replace the impurity-based importance with
sklearn.inspection.permutation_importance()and compare the top-K ordering.Bump
n_foldstoN_SUBJECTSfor leave-one-subject-out variance and read how the per-fold R^2 cloud widens.Reload the deep companion
project_pfactor_deep.pyand overlay its predicted-vs-true scatter on the same axes; how often do the feature head and the deep head disagree on the same subject?
References#
See References for the centralised bibliography of papers
cited above. Add or amend an entry once in
docs/source/refs.bib; every tutorial inherits the update.