EEGDash objects: EEGDash, EEGDashDataset, EEGChallengeDataset#

The library exposes three top-level objects that look similar at first glance but answer very different questions. Picking the wrong one is the most common source of confusion for new users. This page explains what each one is, what it gives you back, and why each exists.

In short:

  • EEGDash is a catalogue client. It talks to the metadata service and returns records (dicts of metadata). Nothing is downloaded.

  • EEGDashDataset is a PyTorch-compatible dataset. It turns a catalogue query into a list of recordings that can be loaded, preprocessed, windowed, and iterated over.

  • EEGChallengeDataset is a frozen, derivative dataset, used for shared-benchmark contexts (currently the EEG 2025 Competition). It loads pre-resampled, pre-filtered, pre-cut data so that every participant is evaluated against an identical signal.

Records vs. datasets#

A record is a metadata document for one BIDS recording: which dataset it belongs to, which subject, task, session, run, channel count, sampling frequency, the path on S3, and so on. A record does not contain the samples themselves. EEGDash.find() returns records.

A dataset is a Python object that lazily resolves records into actual EEG recordings (typically wrapping mne.io.Raw via the EEGDashRaw adapter). EEGDashDataset returns datasets. The first time you access .raw on one of its entries, the underlying file is downloaded into the local cache; subsequent accesses are offline.

This split exists because metadata is small and cheap (you can search 700+ datasets in seconds), but raw EEG is large and slow (one HBN session is hundreds of MB). You usually want to inspect metadata first, decide what to keep, and only then trigger downloads.

Typical use of EEGDash#

Use EEGDash when you want to browse the catalogue without committing to a download. The result is a list of dicts; you can filter it in pure Python before doing anything heavyweight.

from eegdash import EEGDash

client = EEGDash()

# Discover datasets matching loose, human-friendly filters.
datasets_df = client.search_datasets(modality="eeg", task="rest",
                                     n_subjects_min=20)
print(datasets_df[["dataset", "n_subjects", "task"]].head())

# Drill into one dataset and look at individual recordings.
records = client.find({"dataset": "ds002718", "task": "FacePerception"})
print(f"Found {len(records)} recordings.")
print(records[0].keys())  # subject, session, run, sampling_frequency, ...

No EEG samples were downloaded by either of those calls. The catalogue is the API surface; downloads are explicit and live one layer deeper.

Typical use of EEGDashDataset#

Use EEGDashDataset when you want a real, indexable dataset that you can hand to braindecode preprocessing or a PyTorch DataLoader. It accepts the same filter keywords as EEGDash.find plus a cache_dir and a small set of dataset-construction options.

from eegdash import EEGDashDataset

ds = EEGDashDataset(
    cache_dir="./eegdash_cache",
    dataset="ds002718",
    task="FacePerception",
    subject=["sub-002", "sub-003"],
    description_fields=["subject", "session", "task", "age"],
)

print(len(ds))                  # number of recordings
print(ds.description.head())    # tidy metadata table
raw = ds[0].raw                 # triggers the first download
raw.filter(0.5, 40)             # mne.io.Raw operations

Lazy loading and caching#

The dataset is lazy. Construction merely resolves the metadata; the raw attribute on each entry is materialised on first access and then held in memory for the lifetime of that object. The sample bytes are written to cache_dir / <dataset_id>, mirroring the BIDS layout, so a second run with the same query is offline. If you set download=False and the cache already has the data, the catalogue is bypassed entirely and EEGDashDataset reads the local BIDS tree directly. This makes it straightforward to share a cache between machines or to work without network access once the first run completes.

The lazy mode also lets you pass an on_error policy: in pipelines that scan many recordings, on_error="skip" flags problem files via ds._skipped so you can filter them out with a list comprehension (ds.datasets = [d for d in ds.datasets if not getattr(d, "_skipped", False)]) when a few files in a release are known to be corrupt.

When to use EEGChallengeDataset#

EEGChallengeDataset is a thin wrapper over the same machinery, but points at a frozen, preprocessed bucket: the data are downsampled to a fixed rate, filtered with a fixed band, and cut into the canonical task blocks used by the EEG 2025 Competition. If you are participating in the competition, you must use EEGChallengeDataset; otherwise your local results are not comparable to the public leaderboard. If you are not participating, EEGChallengeDataset is still useful when you want a fully reproducible benchmark: every user sees identical bytes.

The library prints a notice when you try to load competition releases through plain EEGDashDataset to nudge users away from this footgun.

Choosing among the three#

A practical decision tree:

  • “I just want to know what is out there.” → EEGDash.

  • “I want a PyTorch-style dataset for my own analysis.” → EEGDashDataset.

  • “I am running EEG 2025 Competition code, or I want strictly identical preprocessing across users.” → EEGChallengeDataset.

Mixing modes inside a single experiment is fine: a common workflow uses EEGDash.search_datasets to find candidate datasets, then constructs EEGDashDataset instances for the few you actually want to model.

Further reading#