Note
Go to the end to download the full example code or to run this example in your browser via Binder.
How-to: work offline against a populated EEGDash cache#
Goal: instantiate EEGChallengeDataset with download=False and load
the same records as the online path, with no network calls.
Goal#
Load and filter EEGDash records from a local BIDS cache on an HPC node or air-gapped workstation, with zero network calls, and prove the cache is complete by comparing online vs. offline shape and metadata.
Prerequisites#
Estimated time: ~4 min on CPU (cache hit; one online prefetch on first run only).
You have already populated the cache via
how_to_download_a_dataset(ordownload_allbelow).Concept: [docs/source/concepts/lazy_loading_and_cache.rst](../../docs/source/concepts/lazy_loading_and_cache.rst).
Data: HBN release
R2(OpenNeurods005506), taskRestingState,mini=Truesubset (<200 MB).
Setup – seed and resolve the cache directory from the environment.
import os
from pathlib import Path
import numpy as np
from eegdash import EEGChallengeDataset
from eegdash.const import RELEASE_TO_OPENNEURO_DATASET_MAP
from eegdash.paths import get_default_cache_dir
np.random.seed(42)
RELEASE = "R2"
TASK = "RestingState"
DATASET_ID = RELEASE_TO_OPENNEURO_DATASET_MAP[RELEASE] # "ds005506"
# Resolve cache from EEGDASH_CACHE if set, else the package default.
# Never hard-code paths -- HPC jobs override this per node.
cache_dir = Path(os.environ.get("EEGDASH_CACHE_DIR", get_default_cache_dir())).resolve()
cache_dir.mkdir(parents=True, exist_ok=True)
print(f"cache_dir = {cache_dir}")
Recipe#
Step 1 – Populate the cache (online, once)#
Run this block on a node with internet. download_all prefetches
every record so subsequent runs can use download=False. If your
cache is already populated, this is a near-instant no-op.
ds_online = EEGChallengeDataset(
release=RELEASE,
cache_dir=cache_dir,
task=TASK,
mini=True,
)
ds_online.download_all(n_jobs=-1)
print(f"online: {len(ds_online.datasets)} recording(s) cached.")
Step 2 – Load offline with download=False#
This is the air-gapped path: EEGDash parses BIDS filenames in the cache
instead of querying the database or S3. The challenge subset lives at
<cache_dir>/<dataset_id>-bdf-mini; check it exists before loading.
offline_root = cache_dir / f"{DATASET_ID}-bdf-mini"
assert offline_root.exists(), f"missing cache folder: {offline_root}"
ds_offline = EEGChallengeDataset(
release=RELEASE,
cache_dir=cache_dir,
task=TASK,
download=False,
)
print(f"offline: {len(ds_offline.datasets)} recording(s) loaded.")
if ds_offline.datasets:
print("first bidspath:", ds_offline.datasets[0].record["bidspath"])
Step 3 – Filter by BIDS entity offline#
With download=False you can still filter by subject, session,
task, and run – those entities live in the BIDS filenames, not
the database. Database-only fields (e.g., modality aliases) are not
available offline.
ds_offline_sub = EEGChallengeDataset(
release=RELEASE,
cache_dir=cache_dir,
task=TASK,
download=False,
subject="NDARAB793GL3",
)
print(f"subject filter: {len(ds_offline_sub.datasets)} recording(s).")
assert len(ds_offline_sub.datasets) <= len(ds_offline.datasets), (
"filtered set must be a subset of the unfiltered offline records"
)
Step 4 – Verify the cache is complete#
Compare record counts, raw-data shapes, and the description tables. If
any of these diverge, the cache is partial – re-run download_all or
clear the suffixed folder and start over.
assert len(ds_offline.datasets) == len(ds_online.datasets), (
"offline record count must match online; cache is partial"
)
shape_online = ds_online.datasets[0].raw.get_data().shape
shape_offline = ds_offline.datasets[0].raw.get_data().shape
print(f"online shape : {shape_online}")
print(f"offline shape: {shape_offline}")
assert shape_online == shape_offline, "raw shape mismatch"
desc_online = ds_online.description
desc_offline = ds_offline.description
print(f"description shapes: online={desc_online.shape} offline={desc_offline.shape}")
assert desc_offline.equals(desc_online), "description metadata diverges"
print("offline cache is complete.")
Result#
ds_offline.records_count == ds_online.records_count(cache complete).raw.get_data().shapematches across paths.description.equals(...)is True – offline parses identical metadata.Subject filter returns a strict subset (asserted in Step 3).
No network call after Step 1 (network_mb == 0 for Steps 2-4).
Source: HBN release R2 (OpenNeuro ds005506), task RestingState, mini=True.
Common pitfalls#
If
cache_dirdoes not exist,EEGDashDatasetwill silently re-download. Always create it first AND setEEGDASH_OFFLINE=1(ordownload=False) on air-gapped nodes – belt and braces.The challenge subset lives under
<cache_dir>/<dataset_id>-bdf-mini, not<cache_dir>/<dataset_id>. Mixingmini=Trueonline withmini=Falseoffline (or vice versa) loads zero records without an obvious error – always pass the same release suffix on both paths.Filtering offline only honours BIDS-entity fields (subject, session, task, run). Database-only filters (e.g., custom
modalityaliases) silently match nothing; pre-stage a derived manifest if you need them.download=Falseskips S3 but still walks the BIDS tree on instantiation. On Lustre/NFS this can stall; stage the cache to local NVMe (seehow_to_use_hpc_cache) before training.
See also#
[how_to_download_a_dataset](how_to_download_a_dataset.py) – populate the cache before going offline.
[how_to_use_hpc_cache](how_to_use_hpc_cache.py) – stage the cache onto local-node storage for IO-bound jobs.
Concept: [docs/source/concepts/lazy_loading_and_cache.rst](../../docs/source/concepts/lazy_loading_and_cache.rst).
References#
Pernet et al. 2019, EEG-BIDS, Sci. Data 6:103. https://doi.org/10.1038/s41597-019-0104-8 – the BIDS-EEG layout that makes offline filename-based filtering possible.
Dataset: OpenNeuro ds005506 (HBN R2, RestingState). https://doi.org/10.18112/openneuro.ds005506.v1.0.0