Working Offline with EEGDash#

Estimated reading time:3 minutes

Many HPC clusters restrict or block network access. It’s common to have dedicated queues for internet-enabled jobs that differ from GPU queues. This tutorial shows how to use EEGChallengeDataset offline once a dataset is present on disk.

from pathlib import Path
import platformdirs

from eegdash.const import RELEASE_TO_OPENNEURO_DATASET_MAP
from eegdash.dataset.dataset import EEGChallengeDataset


# We'll use Release R2 as an example (HBN subset).
# :doc:`EEGChallengeDataset </api/dataset/eegdash.dataset.EEGChallengeDataset>`
# uses a suffixed cache folder for the competition data (e.g., "-bdf-mini").
release = "R2"
dataset_id = RELEASE_TO_OPENNEURO_DATASET_MAP[release]
task = "RestingState"
# Choose a cache directory. This should be on a fast local filesystem.
cache_dir = Path(platformdirs.user_cache_dir("EEGDash"))
cache_dir.mkdir(parents=True, exist_ok=True)

Step 1: Populate the local cache (Online)#

This block downloads the dataset from S3 to your local cache directory. Run this part on a machine with internet access. If the dataset is already on your disk at the specified cache_dir, you can comment out or skip this section.

To keep this example self-contained, we prefetch the data here.

ds_online = EEGChallengeDataset(
    release=release,
    cache_dir=cache_dir,
    task=task,
    mini=True,
)

# Optional prefetch of all recordings (downloads everything to cache).
from joblib import Parallel, delayed

_ = Parallel(n_jobs=-1)(delayed(lambda d: d.raw)(d) for d in ds_online.datasets)
╭────────────────────── EEG 2025 Competition Data Notice ──────────────────────╮
│ This object loads the HBN dataset that has been preprocessed for the EEG     │
│ Challenge:                                                                   │
│   * Downsampled from 500Hz to 100Hz                                          │
│   * Bandpass filtered (0.5-50 Hz)                                            │
│                                                                              │
│ For full preprocessing applied for competition details, see:                 │
│   https://github.com/eeg2025/downsample-datasets                             │
│                                                                              │
│ The HBN dataset have some preprocessing applied by the HBN team:             │
│   * Re-reference (Cz Channel)                                                │
│                                                                              │
│ IMPORTANT: The data accessed via `EEGChallengeDataset` is NOT identical to   │
│ what you get from EEGDashDataset directly.                                   │
│ If you are participating in the competition, always use                      │
│ `EEGChallengeDataset` to ensure consistency with the challenge data.         │
╰──────────────────────── Source: EEGChallengeDataset ─────────────────────────╯

Step 2: Basic Offline Usage#

Once the data is cached locally, you can interact with it without needing an internet connection. The key is to instantiate your dataset object with the download=False flag. This tells EEGChallengeDataset to look for data in the cache_dir instead of trying to connect to the database or S3.

# Here we check that the local cache folder exists
offline_root = cache_dir / f"{dataset_id}-bdf-mini"
print(f"Local dataset folder exists: {offline_root.exists()}\n{offline_root}")

ds_offline = EEGChallengeDataset(
    release=release,
    cache_dir=cache_dir,
    task=task,
    download=False,
)

print(f"Found {len(ds_offline.datasets)} recording(s) offline.")
if ds_offline.datasets:
    print("First record bidspath:", ds_offline.datasets[0].record["bidspath"])
Local dataset folder exists: True
/home/runner/.cache/EEGDash/ds005506-bdf-mini
╭────────────────────── EEG 2025 Competition Data Notice ──────────────────────╮
│ This object loads the HBN dataset that has been preprocessed for the EEG     │
│ Challenge:                                                                   │
│   * Downsampled from 500Hz to 100Hz                                          │
│   * Bandpass filtered (0.5-50 Hz)                                            │
│                                                                              │
│ For full preprocessing applied for competition details, see:                 │
│   https://github.com/eeg2025/downsample-datasets                             │
│                                                                              │
│ The HBN dataset have some preprocessing applied by the HBN team:             │
│   * Re-reference (Cz Channel)                                                │
│                                                                              │
│ IMPORTANT: The data accessed via `EEGChallengeDataset` is NOT identical to   │
│ what you get from EEGDashDataset directly.                                   │
│ If you are participating in the competition, always use                      │
│ `EEGChallengeDataset` to ensure consistency with the challenge data.         │
╰──────────────────────── Source: EEGChallengeDataset ─────────────────────────╯
Found 20 recording(s) offline.
First record bidspath: ds005506/sub-NDARAB793GL3/eeg/sub-NDARAB793GL3_task-RestingState_eeg.bdf

Step 3: Filtering Entities Offline#

Even without a database connection, you can still filter your dataset by BIDS entities like subject, session, or task. When download=False, EEGChallengeDataset uses the BIDS directory structure and filenames to apply these filters. This example shows how to load data for a specific subject from the local cache.

ds_offline_sub = EEGChallengeDataset(
    cache_dir=cache_dir,
    release=release,
    download=False,
    subject="NDARAB793GL3",
)

print(f"Filtered by subject=NDARAB793GL3: {len(ds_offline_sub.datasets)} recording(s).")
if ds_offline_sub.datasets:
    keys = ("dataset", "subject", "task", "run")
    print("Records (dataset, subject, task, run):")
    for idx, base_ds in enumerate(ds_offline_sub.datasets, start=1):
        rec = base_ds.record
        summary = ", ".join(f"{k}={rec.get(k)}" for k in keys)
        print(f"  {idx:03d}: {summary}")
╭────────────────────── EEG 2025 Competition Data Notice ──────────────────────╮
│ This object loads the HBN dataset that has been preprocessed for the EEG     │
│ Challenge:                                                                   │
│   * Downsampled from 500Hz to 100Hz                                          │
│   * Bandpass filtered (0.5-50 Hz)                                            │
│                                                                              │
│ For full preprocessing applied for competition details, see:                 │
│   https://github.com/eeg2025/downsample-datasets                             │
│                                                                              │
│ The HBN dataset have some preprocessing applied by the HBN team:             │
│   * Re-reference (Cz Channel)                                                │
│                                                                              │
│ IMPORTANT: The data accessed via `EEGChallengeDataset` is NOT identical to   │
│ what you get from EEGDashDataset directly.                                   │
│ If you are participating in the competition, always use                      │
│ `EEGChallengeDataset` to ensure consistency with the challenge data.         │
╰──────────────────────── Source: EEGChallengeDataset ─────────────────────────╯
Filtered by subject=NDARAB793GL3: 1 recording(s).
Records (dataset, subject, task, run):
  001: dataset=ds005506, subject=NDARAB793GL3, task=RestingState, run=None

Step 4: Comparing Online vs. Offline Data#

As a sanity check, you can verify that the data loaded from your local cache is identical to the data fetched from the online sources. This section compares the shape of the raw data from the online and offline datasets to ensure they match. This is a good way to confirm your local cache is complete and correct.

If you have network access, you can uncomment the block below to download and compare shapes.

raw_online = ds_online.datasets[0].raw
raw_offline = ds_offline.datasets[0].raw
print("online shape:", raw_online.get_data().shape)
print("offline shape:", raw_offline.get_data().shape)
print("shapes equal:", raw_online.get_data().shape == raw_offline.get_data().shape)
online shape: (129, 40800)
offline shape: (129, 40800)
shapes equal: True

Step 4.1: Comparing Descriptions, Online vs. Offline Data#

If you have network access, you can uncomment the block below to download and compare shapes.

description_online = ds_online.description
description_offline = ds_offline.description
print(description_offline)
print(description_online)
print("Online description shape:", description_online.shape)
print("Offline description shape:", description_offline.shape)
print("Descriptions equal:", description_online.equals(description_offline))
         subject          task  ...  seqlearning8target symbolsearch
0   NDARAB793GL3  RestingState  ...           available    available
1   NDARAM675UR8  RestingState  ...         unavailable    available
2   NDARBM839WR5  RestingState  ...           available    available
3   NDARBU730PN8  RestingState  ...           available    available
4   NDARCT974NAJ  RestingState  ...           available    available
5   NDARCW933FD5  RestingState  ...           available    available
6   NDARCZ770BRG  RestingState  ...           available    available
7   NDARDW741HCF  RestingState  ...         unavailable    available
8   NDARDZ058NZN  RestingState  ...         unavailable    available
9   NDAREC377AU2  RestingState  ...           available    available
10  NDAREM500WWH  RestingState  ...         unavailable    available
11  NDAREV527ZRF  RestingState  ...           available    available
12  NDAREV601CE7  RestingState  ...           available    available
13  NDARFF070XHV  RestingState  ...           available    available
14  NDARFR108JNB  RestingState  ...         unavailable    available
15  NDARFT305CG1  RestingState  ...         unavailable    available
16  NDARGA056TMW  RestingState  ...           available    available
17  NDARGH775KF5  RestingState  ...           available    available
18  NDARGJ878ZP4  RestingState  ...         unavailable    available
19  NDARHA387FPM  RestingState  ...           available    available

[20 rows x 25 columns]
         subject          task  ...  seqlearning8target symbolsearch
0   NDARAB793GL3  RestingState  ...           available    available
1   NDARAM675UR8  RestingState  ...         unavailable    available
2   NDARBM839WR5  RestingState  ...           available    available
3   NDARBU730PN8  RestingState  ...           available    available
4   NDARCT974NAJ  RestingState  ...           available    available
5   NDARCW933FD5  RestingState  ...           available    available
6   NDARCZ770BRG  RestingState  ...           available    available
7   NDARDW741HCF  RestingState  ...         unavailable    available
8   NDARDZ058NZN  RestingState  ...         unavailable    available
9   NDAREC377AU2  RestingState  ...           available    available
10  NDAREM500WWH  RestingState  ...         unavailable    available
11  NDAREV527ZRF  RestingState  ...           available    available
12  NDAREV601CE7  RestingState  ...           available    available
13  NDARFF070XHV  RestingState  ...           available    available
14  NDARFR108JNB  RestingState  ...         unavailable    available
15  NDARFT305CG1  RestingState  ...         unavailable    available
16  NDARGA056TMW  RestingState  ...           available    available
17  NDARGH775KF5  RestingState  ...           available    available
18  NDARGJ878ZP4  RestingState  ...         unavailable    available
19  NDARHA387FPM  RestingState  ...           available    available

[20 rows x 25 columns]
Online description shape: (20, 25)
Offline description shape: (20, 25)
Descriptions equal: True

Notes and troubleshooting#

  • Working offline selects recordings by parsing BIDS filenames and directory structure. Some DB-only fields are unavailable; entity filters (subject, session, task, run) usually suffice.

  • If you encounter issues, please open a GitHub issue so we can discuss.

Total running time of the script: (1 minutes 14.010 seconds)

Estimated memory usage: 1023 MB

Gallery generated by Sphinx-Gallery