.. _concepts-eegdash-objects: EEGDash objects: ``EEGDash``, ``EEGDashDataset``, ``EEGChallengeDataset`` ========================================================================= The library exposes three top-level objects that look similar at first glance but answer very different questions. Picking the wrong one is the most common source of confusion for new users. This page explains what each one is, what it gives you back, and why each exists. In short: - ``EEGDash`` is a **catalogue client**. It talks to the metadata service and returns *records* (dicts of metadata). Nothing is downloaded. - ``EEGDashDataset`` is a **PyTorch-compatible dataset**. It turns a catalogue query into a list of recordings that can be loaded, preprocessed, windowed, and iterated over. - ``EEGChallengeDataset`` is a **frozen, derivative dataset**, used for shared-benchmark contexts (currently the EEG 2025 Competition). It loads pre-resampled, pre-filtered, pre-cut data so that every participant is evaluated against an identical signal. Records vs. datasets -------------------- A *record* is a metadata document for one BIDS recording: which dataset it belongs to, which subject, task, session, run, channel count, sampling frequency, the path on S3, and so on. A record does **not** contain the samples themselves. ``EEGDash.find()`` returns records. A *dataset* is a Python object that lazily resolves records into actual EEG recordings (typically wrapping ``mne.io.Raw`` via the ``EEGDashRaw`` adapter). ``EEGDashDataset`` returns datasets. The first time you access ``.raw`` on one of its entries, the underlying file is downloaded into the local cache; subsequent accesses are offline. This split exists because metadata is small and cheap (you can search 700+ datasets in seconds), but raw EEG is large and slow (one HBN session is hundreds of MB). You usually want to inspect metadata first, decide what to keep, and only then trigger downloads. Typical use of ``EEGDash`` -------------------------- Use ``EEGDash`` when you want to *browse* the catalogue without committing to a download. The result is a list of dicts; you can filter it in pure Python before doing anything heavyweight. .. code-block:: python from eegdash import EEGDash client = EEGDash() # Discover datasets matching loose, human-friendly filters. datasets_df = client.search_datasets(modality="eeg", task="rest", n_subjects_min=20) print(datasets_df[["dataset", "n_subjects", "task"]].head()) # Drill into one dataset and look at individual recordings. records = client.find({"dataset": "ds002718", "task": "FacePerception"}) print(f"Found {len(records)} recordings.") print(records[0].keys()) # subject, session, run, sampling_frequency, ... No EEG samples were downloaded by either of those calls. The catalogue is the API surface; downloads are explicit and live one layer deeper. Typical use of ``EEGDashDataset`` --------------------------------- Use ``EEGDashDataset`` when you want a real, indexable dataset that you can hand to braindecode preprocessing or a PyTorch ``DataLoader``. It accepts the same filter keywords as ``EEGDash.find`` plus a ``cache_dir`` and a small set of dataset-construction options. .. code-block:: python from eegdash import EEGDashDataset ds = EEGDashDataset( cache_dir="./eegdash_cache", dataset="ds002718", task="FacePerception", subject=["sub-002", "sub-003"], description_fields=["subject", "session", "task", "age"], ) print(len(ds)) # number of recordings print(ds.description.head()) # tidy metadata table raw = ds[0].raw # triggers the first download raw.filter(0.5, 40) # mne.io.Raw operations Lazy loading and caching ------------------------ The dataset is *lazy*. Construction merely resolves the metadata; the ``raw`` attribute on each entry is materialised on first access and then held in memory for the lifetime of that object. The sample bytes are written to ``cache_dir / ``, mirroring the BIDS layout, so a second run with the same query is offline. If you set ``download=False`` and the cache already has the data, the catalogue is bypassed entirely and ``EEGDashDataset`` reads the local BIDS tree directly. This makes it straightforward to share a cache between machines or to work without network access once the first run completes. The lazy mode also lets you pass an ``on_error`` policy: in pipelines that scan many recordings, ``on_error="skip"`` flags problem files via ``ds._skipped`` so you can filter them out with a list comprehension (``ds.datasets = [d for d in ds.datasets if not getattr(d, "_skipped", False)]``) when a few files in a release are known to be corrupt. When to use ``EEGChallengeDataset`` ----------------------------------- ``EEGChallengeDataset`` is a thin wrapper over the same machinery, but points at a frozen, preprocessed bucket: the data are downsampled to a fixed rate, filtered with a fixed band, and cut into the canonical task blocks used by the EEG 2025 Competition. If you are participating in the competition, you **must** use ``EEGChallengeDataset``; otherwise your local results are not comparable to the public leaderboard. If you are not participating, ``EEGChallengeDataset`` is still useful when you want a fully reproducible benchmark: every user sees identical bytes. The library prints a notice when you try to load competition releases through plain ``EEGDashDataset`` to nudge users away from this footgun. Choosing among the three ------------------------ A practical decision tree: - "I just want to know what is out there." → ``EEGDash``. - "I want a PyTorch-style dataset for my own analysis." → ``EEGDashDataset``. - "I am running EEG 2025 Competition code, or I want strictly identical preprocessing across users." → ``EEGChallengeDataset``. Mixing modes inside a single experiment is fine: a common workflow uses ``EEGDash.search_datasets`` to find candidate datasets, then constructs ``EEGDashDataset`` instances for the few you actually want to model. Related tutorials ----------------- - :doc:`/generated/auto_examples/tutorials/00_start_here/plot_00_first_search` walks through metadata-only catalogue exploration with ``EEGDash``. - :doc:`/generated/auto_examples/tutorials/00_start_here/plot_01_first_recording` contrasts the catalogue view with a first ``EEGDashDataset`` load. - :doc:`/generated/auto_examples/tutorials/00_start_here/plot_02_dataset_to_dataloader` builds a PyTorch ``DataLoader`` on top of an ``EEGDashDataset``. - :doc:`/generated/auto_examples/tutorials/10_core_workflow/plot_13_save_and_reuse_prepared_data` shows how to persist a preprocessed dataset to disk and reload without re-running the catalogue query. Further reading --------------- - Pernet, C. R., et al. (2019). EEG-BIDS, an extension to the brain imaging data structure for electroencephalography. *Scientific Data*, 6(1), 103. https://doi.org/10.1038/s41597-019-0104-8 - Gramfort, A., et al. (2013). MEG and EEG data analysis with MNE-Python. *Frontiers in Neuroscience*, 7, 267. https://doi.org/10.3389/fnins.2013.00267 - Cisotto, G., & Chicco, D. (2024). Ten quick tips for clinical electroencephalographic (EEG) data acquisition and signal processing. *PeerJ Computer Science*, 10, e2256. https://doi.org/10.7717/peerj-cs.2256