Note
Go to the end to download the full example code or to run this example in your browser via Binder.
Place the EEGDash cache on shared or local cluster storage#
On HPC clusters, where you put the EEGDash cache is often the difference
between a 30-minute and a 30-second epoch. Shared filesystems (Lustre, GPFS,
NFS) survive job restarts but throttle under metadata-heavy access; node-local
NVMe is fast but volatile; and $HOME is almost always too slow for
training. This how-to shows how to point eegdash.paths.get_default_cache_dir()
at the right tier, stage data once, and verify the cache before training.
The recipe assumes you already know how to populate the cache (see
how_to_download_a_dataset) and load offline (see how_to_work_offline).
We follow the cluster-software best practices summarised by Cisotto and
Chicco (2024, doi:10.3389/fninf.2024.1338139): keep heavy IO on local-to-node
storage, stage in at job-start, and never read training data over the network
home.
Goal#
Resolve cache_dir from a SLURM environment variable, stage data from a
shared persistent location to per-node fast scratch at job start, and verify
the cache hit on subsequent runs without contacting S3.
Prerequisites#
A SLURM/LSF/PBS account with one shared filesystem (e.g.
/scratchor$SCRATCH) and one node-local fast disk ($TMPDIR,/local/$SLURM_JOB_ID, or an NVMe mount).eegdashinstalled in the activated environment.The dataset of interest already populated once on the shared filesystem (head node with internet, or via
how_to_download_a_dataset).
Recipe#
Step 1 – Identify your storage tiers#
Most schedulers expose three useful paths. $HOME is shared and slow;
never put the cache there. $SCRATCH (or /scratch/$USER) is shared
and fast-ish but throttles under metadata-heavy reads. Per-job local scratch
($TMPDIR on Slurm with --tmp, or /local/$SLURM_JOB_ID) is the
fastest option but is wiped at job exit. Inspect them in your job script:
# In your sbatch script
echo "HOME = $HOME"
echo "SCRATCH = ${SCRATCH:-/scratch/$USER}"
echo "TMPDIR = ${TMPDIR:-/tmp}"
df -h "$TMPDIR" "${SCRATCH:-/scratch/$USER}"
Step 2 – Set EEGDASH_CACHE_DIR to fast scratch#
eegdash.paths.get_default_cache_dir() honours the EEGDASH_CACHE_DIR
environment variable first. Export it at the top of your sbatch script so
every Python process in the job inherits the same path:
export EEGDASH_CACHE_DIR="${TMPDIR:-/tmp}/eegdash_cache"
mkdir -p "$EEGDASH_CACHE_DIR"
Verify from Python that the resolution works as expected.
import os
import random
from pathlib import Path
import numpy as np
from eegdash.paths import get_default_cache_dir
# Seed local RNG for the deterministic synthetic record we forge below
# (E3.21). HPC how-tos run on a head node where reproducibility is still
# expected even for the demo's tmp-cache record.
np.random.seed(42)
random.seed(42)
os.environ["EEGDASH_CACHE_DIR"] = str(Path.cwd() / ".eegdash_cache_local")
local_cache = get_default_cache_dir()
print(f"EEGDash will read/write under: {local_cache}")
assert local_cache == Path(os.environ["EEGDASH_CACHE_DIR"]).resolve()
Step 3 – Stage data from shared to node-local at job start#
The “stage-in” pattern is the workhorse of HPC IO: keep a single canonical
copy of the dataset on shared scratch and rsync it to node-local disk
at the start of each job. Reads during training then hit NVMe; the shared
copy survives across jobs.
SHARED_CACHE="${SCRATCH:-/scratch/$USER}/eegdash_cache" # persistent
LOCAL_CACHE="${TMPDIR:-/tmp}/eegdash_cache" # volatile
mkdir -p "$LOCAL_CACHE"
# -a archive, --info=progress2 quieter than -v on large trees
rsync -a --info=progress2 "$SHARED_CACHE"/ "$LOCAL_CACHE"/
export EEGDASH_CACHE_DIR="$LOCAL_CACHE"
# Optional: stage-out fresh artefacts back, so the next job benefits.
trap 'rsync -a --update "$LOCAL_CACHE"/ "$SHARED_CACHE"/' EXIT
Step 4 – Verify cache hit on subsequent runs#
A correctly staged cache lets you instantiate the dataset with
download=False and observe a non-zero record count without network IO.
Use this as a smoke test at the top of your training script – if it fails,
stage-in did not complete and you should fail fast rather than silently
re-downloading from S3.
print("\nSimulating a stage-in verification (no real cluster needed):")
local_cache.mkdir(parents=True, exist_ok=True)
fake_record_dir = local_cache / "ds_demo" / "sub-01"
fake_record_dir.mkdir(parents=True, exist_ok=True)
(fake_record_dir / "sub-01_task-rest_eeg.bdf").touch()
n_records = sum(1 for _ in local_cache.rglob("*_eeg.bdf"))
assert n_records >= 1, "stage-in copied 0 records; abort job before training"
print(f" cache_dir = {local_cache}")
print(f" records on disk = {n_records}")
print(" --> training can proceed offline (download=False)")
# In a real script you would now do, e.g.:
#
# ds = EEGDashDataset(cache_dir=local_cache, dataset="ds005514",
# task="RestingState", download=False)
# assert len(ds.datasets) == n_records
Common pitfalls#
Home directory is shared and slow. Quotas on
$HOMEare tiny and the filesystem is not designed for thousands of concurrent reads. Putting the cache there is the most common cause of slow first-epoch IO.Node-local cache disappears between jobs.
$TMPDIRand/local/$SLURM_JOB_IDare wiped at job exit. Always keep the canonical copy on shared scratch and stage in fresh each job.Race conditions when multiple jobs hit one cache. Two jobs writing into the same
EEGDASH_CACHE_DIRcan produce truncated files. Either give each job its ownEEGDASH_CACHE_DIR(per-task subdirectory) or pre-populate the shared cache once on a head node with internet access and run all subsequent jobs withdownload=False.Metadata-server contention on Lustre/GPFS. Hundreds of small file stats during dataloading can throttle the whole filesystem. If first- epoch IO is slow but disk bandwidth is idle, the bottleneck is metadata, not throughput – move to node-local NVMe.
See also#
how_to_work_offline: drivesdownload=Falseonce the cache exists.how_to_run_preprocessing_on_slurm: a full sbatch template wrapping the stage-in/stage-out pattern shown here.
References#
Cisotto, G., & Chicco, D. (2024). Ten quick tips for clinical electroencephalographic (EEG) data acquisition and signal processing. Frontiers in Neuroinformatics, 18, 1338139. doi:10.3389/fninf.2024.1338139