Load one EEG recording#

Estimated reading time:18 minutes

Difficulty 1 | Runtime: 2m | Compute: CPU

Once a search returns a candidate dataset (see plot_00), the next question is practical: what does one recording actually contain? This tutorial loads a single BIDS file from OpenNeuro ds004504 (Miltiadous et al. 2023; Alzheimer / frontotemporal dementia / healthy controls) through the catalogue shared with NEMAR [Delorme et al., 2022], unwraps the mne.io.Raw object, and inspects channels, montage, and spectrum. The dataset is small (88 subjects, ~10 min per subject, 19 channels at 500 Hz on a standard 10-20 layout) so the first fetch is under 25 MB and every later step is a hot-cache read.

Keywords: loading, BIDS

Learning objectives#

  • Build an EEGDashDataset and read its HTML repr.

  • Pick one record by indexing into dataset.datasets.

  • Read sampling rate, channel count, and duration from the underlying mne.io.Raw.

  • Look at the signal (raw.plot), the montage (raw.plot_sensors), and the spectrum (raw.compute_psd().plot).

Requirements#

  • About 2 min on CPU on first run; under 10 s once cached.

  • Network: ~80 MB on the first call (downloaded once into the cache).

  • Cache directory: EEGDASH_CACHE_DIR env var, defaults to ~/.eegdash_cache.

  • Prerequisite: plot_00_first_search.

Setup. No randomness here, so no seed.

import os
from pathlib import Path

import matplotlib.pyplot as plt
import mne
import numpy as np
import pandas as pd

import eegdash
from eegdash import EEGDashDataset
from eegdash.viz import style_figure, use_eegdash_style

# Force MNE's matplotlib browser so raw.plot returns a Figure (the Qt
# browser returns an MNEQtBrowser instance and skips static capture).
mne.viz.set_browser_backend("matplotlib")

use_eegdash_style()
cache_dir = os.environ.get("EEGDASH_CACHE_DIR", str(Path.home() / ".eegdash_cache"))
print(f"eegdash {eegdash.__version__}; cache_dir={cache_dir}")
Using matplotlib as 2D backend.
eegdash 0.8.2; cache_dir=/home/runner/eegdash_cache

Concepts behind EEGDashDataset#

A few ideas that the rest of the tutorial assumes:

  • Metadata index vs. file blobs. EEGDash separates records (small JSON documents in MongoDB) from files (BIDS payloads on OpenNeuro / S3). A query touches only the index; the bytes wait.

  • Lazy by default. EEGDashDataset(...) returns immediately; no signal moves over the network. Each per-record entry exposes raw as a property, so the download and open happen the first time you read it (and only that record). The constructor accepts download=False to enforce strict offline mode against an already populated cache; otherwise EEGDash fetches missing files into cache_dir on demand. download_all() lets you front-load explicitly when a pipeline cannot afford a cold cache.

  • Inherits from Braindecode. EEGDashDataset is a braindecode.datasets.BaseConcatDataset; each entry under datasets is a braindecode.datasets.BaseDataset subclass. Any code that accepts those types (create_fixed_length_windows(ds, ...), DataLoader(ds, ...), Preprocessor(...) pipelines) accepts an EEGDashDataset without changes.

  • Pull and push from a Hub. push_to_hub() / pull_from_hub round-trip a derived dataset (preprocessed signals, windows, or a feature table) to a HuggingFace Hub repo. The metadata index stays the source of truth for raw recordings; the Hub is for the versioned, ML-ready artefacts you build on top of them.

  • BIDS entities are the query language. dataset, subject, task, session, run flow through to the index unchanged [Pernet et al., 2019].

Step 1: Build the dataset (lazy)#

Constructing an EEGDashDataset only queries the metadata catalogue. No EEG bytes move yet.

Predict. How many records will the query for ds004504 subject 001 return: 1 eyes-closed file, several tasks, or the whole dataset?

Run. Build the object; the HTML repr below shows what matched.

DATASET = "ds004504"  # Miltiadous AD/FTD/HC, 19 ch standard 10-20, 500 Hz
SUBJECT = "001"
dataset = EEGDashDataset(cache_dir=cache_dir, dataset=DATASET, subject=SUBJECT)
dataset
BaseConcatDataset
TypeBaseConcatDataset of EEGDashRaw
Recordings1
Total samples299900
Sfreq*500.0 Hz
Channels*19 (19 EEG)
Ch. names*Fp1, Fp2, F3, F4, C3, C4, P3, P4, O1, O2, ... (+9 more)
Montage*head
Duration*599.8 s
* from first recording
Description1 recordings × 9 columns [subject, task, age, gender, group, mmse, session, run, sex]


Step 2: Summarize without touching the network#

Every matched record exposes its BIDS metadata on description, a pandas.DataFrame materialised without any network call. Useful for sanity-checking a query before paying any download cost.

description = dataset.description
first_record = dataset.records[0] if getattr(dataset, "records", None) else {}


def _nunique(col: str) -> int:
    return description[col].nunique() if col in description.columns else 0


pd.Series(
    {
        "n_records": len(dataset),
        "n_subjects": _nunique("subject"),
        "n_tasks": _nunique("task"),
        "n_sessions": _nunique("session"),
        "channels": int(first_record.get("nchans", 0)),
        "sampling_frequency_hz": float(first_record.get("sampling_frequency", 0.0)),
    },
    name="value",
).to_frame()
value
n_records 299900.0
n_subjects 1.0
n_tasks 1.0
n_sessions 0.0
channels 19.0
sampling_frequency_hz 500.0


Step 3: Pick one record#

EEGDashDataset wraps a list of per-recording entries on datasets. Each entry is an EEGDashBaseDataset (a Braindecode BaseDataset subclass) carrying a BIDS path, a description, and the lazy raw property. Indexing into that list is the standard Python idiom for grabbing one; record.raw is what triggers the download and opens the file with MNE-Python [Gramfort et al., 2013].

record = dataset.datasets[0]
raw = record.raw

Step 4: BIDS metadata as a DataFrame#

Every recording carries the same BIDS envelope [Pernet et al., 2019]. record.description already is a pandas.Series; keeping the fields we care about and adding the duration gives a readable view.

keep = [
    "dataset",
    "subject",
    "task",
    "session",
    "run",
    "sampling_frequency",
    "nchans",
    "ntimes",
    "datatype",
]
meta_view = record.description.reindex(keep).to_frame("value")
meta_view.loc["duration (s)"] = round(raw.times[-1] - raw.times[0], 1)
meta_view.loc["annotations"] = len(raw.annotations)
meta_view
value
dataset NaN
subject 001
task eyesclosed
session None
run None
sampling_frequency NaN
nchans NaN
ntimes NaN
datatype NaN
duration (s) 599.8
annotations 0


Step 5: Inspect the MNE Raw object#

MNE-Python ships its own HTML repr for Raw: a small table with sampling rate, projections, bad channels, and channel types.

Investigate. Read the rate, the channel-type breakdown, and check the “bad channels” row: empty here (no bad channels marked yet).

raw
General
Filename(s) sub-001_task-eyesclosed_eeg.set
MNE object type RawEEGLAB
Measurement date Unknown
Participant
Experimenter Unknown
Acquisition
Duration 00:10:00 (HH:MM:SS)
Sampling frequency 500.00 Hz
Time points 299,900
Channels
EEG
Head & sensor digitization 22 points
Filters
Highpass 0.00 Hz
Lowpass 250.00 Hz


Step 6: Plot the signal#

Run. mne.io.Raw.plot() renders an 8 s window across a small midline + temporal selection, the most readable view at this scale. Explicit scalings and a longer duration make the morphology legible; default settings squeeze too many channels into the same strip and flatten the signal.

midline_picks = [
    ch
    for ch in ("Fz", "Cz", "Pz", "Oz", "FCz", "CPz", "POz", "T7", "T8")
    if ch in raw.ch_names
][:8]
fig_raw = raw.plot(
    start=10.0,
    duration=8.0,
    picks=midline_picks if midline_picks else None,
    n_channels=8,
    scalings={"eeg": 50e-6},
    show=False,
    show_scrollbars=False,
    show_scalebars=False,
    title="",
)
plot 01 first recording

Step 7: Sensor topology#

Where the electrodes sit on the scalp matters for every later analysis. mne.io.Raw.plot_sensors() draws the montage as a 2-D head schematic. With 74 channels, channel-name overlays clutter the view; we drop them and let the spatial pattern speak.

ds004504 uses the standard 10-20 layout (Fp1, Fp2, F3, F4, … Cz, Pz, 19 electrodes); standard_1020 covers all of them. plot_sensors is rendered only when positions actually got attached, so the cell stays robust against arbitrary channel-naming.

raw.set_montage(
    "standard_1020",
    match_case=False,
    match_alias=True,
    on_missing="ignore",
)
has_positions = any(not np.allclose(ch["loc"][:3], 0.0) for ch in raw.info["chs"])
fig_sens, ax_sens = plt.subplots(figsize=(5.0, 5.0))
if has_positions:
    raw.plot_sensors(
        kind="topomap",
        show_names=False,
        axes=ax_sens,
        show=False,
    )
    ax_sens.set_title("")
else:
    ax_sens.text(
        0.5,
        0.5,
        "Sensor positions unavailable in this build's montage",
        ha="center",
        va="center",
    )
    ax_sens.set_axis_off()
fig_sens.tight_layout()
plot 01 first recording

Step 8: Power spectral density#

A PSD is the cleanest first look at signal quality: line noise spikes at 50 or 60 Hz, slow drifts below 1 Hz, broadband 1/f. We compute Welch’s PSD on EEG channels only and plot in dB.

psd = raw.copy().pick("eeg").compute_psd(fmax=80.0, verbose=False)
fig_psd = psd.plot(picks="eeg", average=True, show=False)
style_figure(
    fig_psd,
    title="Power spectral density (Welch)",
    subtitle=f"{DATASET} sub-{SUBJECT} | {len(raw.copy().pick('eeg').ch_names)} EEG channels | fmax=80 Hz",
    source=f"EEGDash plot_01 | OpenNeuro {DATASET} :cite:`miltiadous2023`",
)
plt.show()
plot 01 first recording
Plotting power spectral density (dB=True).
/home/runner/work/EEGDash/EEGDash/eegdash/viz/identity.py:178: UserWarning: This figure was using a layout engine that is incompatible with subplots_adjust and/or tight_layout; not calling subplots_adjust.
  fig.subplots_adjust(top=0.84, bottom=0.18, left=0.12, right=0.95)

A common mistake, and how to recover#

Run. Indexing past the end of dataset.datasets raises IndexError, the standard Python contract. We trigger it on purpose so the failure mode is visible [Nederbragt et al., 2020].

try:
    _ = dataset.datasets[999]
except IndexError as exc:
    print(f"Caught IndexError: {exc}")
    safe_idx = min(999, len(dataset.datasets) - 1)
    print(f"Recovery: dataset.datasets[{safe_idx}] instead.")
Caught IndexError: list index out of range
Recovery: dataset.datasets[0] instead.

Modify#

Your turn. Change SUBJECT to a different one and re-run Step 3. The acquisition protocol is shared, so sampling rate and channel count match; small differences in run length per subject are normal.

SECOND_SUBJECT = "002"
raw_b = (
    EEGDashDataset(cache_dir=cache_dir, dataset=DATASET, subject=SECOND_SUBJECT)
    .datasets[0]
    .raw
)
pd.DataFrame(
    {
        f"sub-{SUBJECT}": [
            raw.info["sfreq"],
            raw.info["nchan"],
            round(raw.times[-1], 1),
        ],
        f"sub-{SECOND_SUBJECT}": [
            raw_b.info["sfreq"],
            raw_b.info["nchan"],
            round(raw_b.times[-1], 1),
        ],
    },
    index=["sfreq (Hz)", "nchan", "duration (s)"],
)
Downloading sub-002_task-eyesclosed_channels.tsv:   0%|          | 0.00/284 [00:00<?, ?B/s]
Downloading sub-002_task-eyesclosed_channels.tsv: 100%|██████████| 284/284 [00:00<00:00, 955kB/s]

Downloading sub-002_task-eyesclosed_eeg.json:   0%|          | 0.00/868 [00:00<?, ?B/s]
Downloading sub-002_task-eyesclosed_eeg.json: 100%|██████████| 868/868 [00:00<00:00, 2.91MB/s]
[06/09/26 20:58:31] WARNING  File not found on S3, skipping:   downloader.py:163
                             s3://openneuro.org/ds004504/sub-0
                             02/eeg/sub-002_task-eyesclosed_ee
                             g.fdt

Downloading sub-002_task-eyesclosed_eeg.set:   0%|          | 0.00/30.3M [00:00<?, ?B/s]
Downloading sub-002_task-eyesclosed_eeg.set:  39%|███▉      | 11.9M/30.3M [00:00<00:00, 62.1MB/s]
Downloading sub-002_task-eyesclosed_eeg.set:  60%|█████▉    | 18.1M/30.3M [00:00<00:00, 46.1MB/s]
Downloading sub-002_task-eyesclosed_eeg.set:  85%|████████▍ | 25.6M/30.3M [00:00<00:00, 43.7MB/s]
Downloading sub-002_task-eyesclosed_eeg.set: 100%|██████████| 30.3M/30.3M [00:00<00:00, 48.9MB/s]
sub-001 sub-002
sfreq (Hz) 500.0 500.0
nchan 19.0 19.0
duration (s) 599.8 793.1


Make#

Mini-project. Pick a different OpenNeuro dataset (start with ds002893 for a 32-channel auditory study) and repeat Steps 1, 3, and 5. Compare the montage and the PSD shape.

Wrap-up#

We loaded one recording end-to-end without writing a download script: EEGDashDataset resolved the BIDS query against the NEMAR / OpenNeuro catalogue, dataset.datasets[0].raw materialised the file lazily, and MNE-Python carried the rest. The signal is unprocessed; line noise and drifts are still in the view. Preprocessing belongs in plot_10_preprocess_and_window. Next: plot_02_dataset_to_dataloader wraps these recordings in a PyTorch DataLoader.

References#

See References for the centralized bibliography of papers cited above. Add or amend an entry once in docs/source/refs.bib; every tutorial inherits the update.

Total running time of the script: (0 minutes 5.265 seconds)