Note

Go to the end to download the full example code or to run this example in your browser via Binder.

Place the EEGDash cache on shared or local cluster storage#

Estimated reading time:5 minutes

On HPC clusters, where you put the EEGDash cache is often the difference between a 30-minute and a 30-second epoch. Shared filesystems (Lustre, GPFS, NFS) survive job restarts but throttle under metadata-heavy access; node-local NVMe is fast but volatile; and $HOME is almost always too slow for training. This how-to shows how to point eegdash.paths.get_default_cache_dir() at the right tier, stage data once, and verify the cache before training.

The recipe assumes you already know how to populate the cache (see how_to_download_a_dataset) and load offline (see how_to_work_offline). We follow the cluster-software best practices summarised by Cisotto and Chicco (2024, doi:10.3389/fninf.2024.1338139): keep heavy IO on local-to-node storage, stage in at job-start, and never read training data over the network home.

Goal#

Resolve cache_dir from a SLURM environment variable, stage data from a shared persistent location to per-node fast scratch at job start, and verify the cache hit on subsequent runs without contacting S3.

Prerequisites#

A SLURM/LSF/PBS account with one shared filesystem (e.g. /scratch or $SCRATCH) and one node-local fast disk ($TMPDIR, /local/$SLURM_JOB_ID, or an NVMe mount).
eegdash installed in the activated environment.
The dataset of interest already populated once on the shared filesystem (head node with internet, or via how_to_download_a_dataset).

Recipe#

Step 1 – Identify your storage tiers#

Most schedulers expose three useful paths. $HOME is shared and slow; never put the cache there. $SCRATCH (or /scratch/$USER) is shared and fast-ish but throttles under metadata-heavy reads. Per-job local scratch ($TMPDIR on Slurm with --tmp, or /local/$SLURM_JOB_ID) is the fastest option but is wiped at job exit. Inspect them in your job script:

# In your sbatch script
echo "HOME    = $HOME"
echo "SCRATCH = ${SCRATCH:-/scratch/$USER}"
echo "TMPDIR  = ${TMPDIR:-/tmp}"
df -h "$TMPDIR" "${SCRATCH:-/scratch/$USER}"

Step 2 – Set `EEGDASH_CACHE_DIR` to fast scratch#

eegdash.paths.get_default_cache_dir() honours the EEGDASH_CACHE_DIR environment variable first. Export it at the top of your sbatch script so every Python process in the job inherits the same path:

export EEGDASH_CACHE_DIR="${TMPDIR:-/tmp}/eegdash_cache"
mkdir -p "$EEGDASH_CACHE_DIR"

Verify from Python that the resolution works as expected.

import os
import random
from pathlib import Path

import numpy as np

from eegdash.paths import get_default_cache_dir

# Seed local RNG for the deterministic synthetic record we forge below
# (E3.21). HPC how-tos run on a head node where reproducibility is still
# expected even for the demo's tmp-cache record.
np.random.seed(42)
random.seed(42)

os.environ["EEGDASH_CACHE_DIR"] = str(Path.cwd() / ".eegdash_cache_local")
local_cache = get_default_cache_dir()
print(f"EEGDash will read/write under: {local_cache}")
assert local_cache == Path(os.environ["EEGDASH_CACHE_DIR"]).resolve()

Step 3 – Stage data from shared to node-local at job start#

The “stage-in” pattern is the workhorse of HPC IO: keep a single canonical copy of the dataset on shared scratch and rsync it to node-local disk at the start of each job. Reads during training then hit NVMe; the shared copy survives across jobs.

SHARED_CACHE="${SCRATCH:-/scratch/$USER}/eegdash_cache"   # persistent
LOCAL_CACHE="${TMPDIR:-/tmp}/eegdash_cache"               # volatile

mkdir -p "$LOCAL_CACHE"
# -a archive, --info=progress2 quieter than -v on large trees
rsync -a --info=progress2 "$SHARED_CACHE"/ "$LOCAL_CACHE"/
export EEGDASH_CACHE_DIR="$LOCAL_CACHE"

# Optional: stage-out fresh artefacts back, so the next job benefits.
trap 'rsync -a --update "$LOCAL_CACHE"/ "$SHARED_CACHE"/' EXIT

Step 4 – Verify cache hit on subsequent runs#

A correctly staged cache lets you instantiate the dataset with download=False and observe a non-zero record count without network IO. Use this as a smoke test at the top of your training script – if it fails, stage-in did not complete and you should fail fast rather than silently re-downloading from S3.

print("\nSimulating a stage-in verification (no real cluster needed):")
local_cache.mkdir(parents=True, exist_ok=True)
fake_record_dir = local_cache / "ds_demo" / "sub-01"
fake_record_dir.mkdir(parents=True, exist_ok=True)
(fake_record_dir / "sub-01_task-rest_eeg.bdf").touch()

n_records = sum(1 for _ in local_cache.rglob("*_eeg.bdf"))
assert n_records >= 1, "stage-in copied 0 records; abort job before training"
print(f"  cache_dir   = {local_cache}")
print(f"  records on disk = {n_records}")
print("  --> training can proceed offline (download=False)")

# In a real script you would now do, e.g.:
#
#     ds = EEGDashDataset(cache_dir=local_cache, dataset="ds005514",
#                         task="RestingState", download=False)
#     assert len(ds.datasets) == n_records

Common pitfalls#

Home directory is shared and slow. Quotas on $HOME are tiny and the filesystem is not designed for thousands of concurrent reads. Putting the cache there is the most common cause of slow first-epoch IO.
Node-local cache disappears between jobs. $TMPDIR and /local/$SLURM_JOB_ID are wiped at job exit. Always keep the canonical copy on shared scratch and stage in fresh each job.
Race conditions when multiple jobs hit one cache. Two jobs writing into the same EEGDASH_CACHE_DIR can produce truncated files. Either give each job its own EEGDASH_CACHE_DIR (per-task subdirectory) or pre-populate the shared cache once on a head node with internet access and run all subsequent jobs with download=False.
Metadata-server contention on Lustre/GPFS. Hundreds of small file stats during dataloading can throttle the whole filesystem. If first- epoch IO is slow but disk bandwidth is idle, the bottleneck is metadata, not throughput – move to node-local NVMe.

References#

Cisotto, G., & Chicco, D. (2024). Ten quick tips for clinical electroencephalographic (EEG) data acquisition and signal processing. Frontiers in Neuroinformatics, 18, 1338139. doi:10.3389/fninf.2024.1338139