EEGDashDataset#

class eegdash.EEGDashDataset(cache_dir: str | Path, query: dict[str, Any] = None, description_fields: list[str] | None = None, s3_bucket: str | None = None, records: list[dict] | None = None, download: bool = True, n_jobs: int = -1, eeg_dash_instance: Any = None, database: str | None = None, auth_token: str | None = None, on_error: str = 'raise', max_concurrency: int = 20, description_precedence: str = 'participant_tsv', **kwargs)[source]#

Bases: BaseConcatDataset

Create a new EEGDashDataset from a given query or local BIDS dataset directory and dataset name. An EEGDashDataset is pooled collection of EEGDashBaseDataset instances (individual recordings) and is a subclass of braindecode’s BaseConcatDataset.

Examples

Basic usage with dataset and subject filtering:

>>> from eegdash import EEGDashDataset
>>> dataset = EEGDashDataset(
...     cache_dir="./data",
...     dataset="ds002718",
...     subject="012"
... )
>>> print(f"Number of recordings: {len(dataset)}")

Filter by multiple subjects and specific task:

>>> subjects = ["012", "013", "014"]
>>> dataset = EEGDashDataset(
...     cache_dir="./data",
...     dataset="ds002718",
...     subject=subjects,
...     task="RestingState"
... )

Load and inspect EEG data from recordings:

>>> if len(dataset) > 0:
...     recording = dataset[0]
...     raw = recording.load()
...     print(f"Sampling rate: {raw.info['sfreq']} Hz")
...     print(f"Number of channels: {len(raw.ch_names)}")
...     print(f"Duration: {raw.times[-1]:.1f} seconds")

Advanced filtering with raw MongoDB queries:

>>> from eegdash import EEGDashDataset
>>> query = {
...     "dataset": "ds002718",
...     "subject": {"$in": ["012", "013"]},
...     "task": "RestingState"
... }
>>> dataset = EEGDashDataset(cache_dir="./data", query=query)

Working with dataset collections and braindecode integration:

>>> # EEGDashDataset is a braindecode BaseConcatDataset
>>> for i, recording in enumerate(dataset):
...     if i >= 2:  # limit output
...         break
...     print(f"Recording {i}: {recording.description}")
...     raw = recording.load()
...     print(f"  Channels: {len(raw.ch_names)}, Duration: {raw.times[-1]:.1f}s")
Parameters:
  • cache_dir (str | Path) – Directory where data are cached locally.

  • query (dict | None) – Raw MongoDB query to filter records. If provided, it is merged with keyword filtering arguments (see **kwargs) using logical AND. You must provide at least a dataset (either in query or as a keyword argument). Only fields in ALLOWED_QUERY_FIELDS are considered for filtering.

  • dataset (str) – Dataset identifier (e.g., "ds002718"). Required if query does not already specify a dataset.

  • task (str | list[str]) – Task name(s) to filter by (e.g., "RestingState").

  • subject (str | list[str]) – Subject identifier(s) to filter by (e.g., "NDARCA153NKE").

  • session (str | list[str]) – Session identifier(s) to filter by (e.g., "1").

  • run (str | list[str]) – Run identifier(s) to filter by (e.g., "1").

  • description_fields (list[str]) – Fields to extract from each record and include in dataset descriptions (e.g., “subject”, “session”, “run”, “task”).

  • s3_bucket (str | None) – Optional S3 bucket URI (e.g., “s3://mybucket”) to use instead of the default OpenNeuro bucket when downloading data files.

  • records (list[dict] | None) – Pre-fetched metadata records. If provided, the dataset is constructed directly from these records and no MongoDB query is performed.

  • download (bool, default True) – If False, load from local BIDS files only. Local data are expected under cache_dir / dataset; no DB or S3 access is attempted.

  • n_jobs (int) – Number of parallel jobs to use where applicable (-1 uses all cores).

  • eeg_dash_instance (EEGDash | None) – Optional existing EEGDash client to reuse for DB queries. If None, a new client is created on demand, not used in the case of no download.

  • database (str | None) – Database name to use (e.g., “eegdash”, “eegdash_staging”). If None, uses the default database.

  • auth_token (str | None) – Authentication token for accessing protected databases. Required for staging or admin operations.

  • max_concurrency (int, default 20) – Maximum number of parallel S3 transfer connections used when downloading data. Higher values speed up large/multi-file downloads but consume more bandwidth.

  • on_error (str, default "raise") –

    How to handle DataIntegrityError when accessing .raw on individual recordings:

    • "raise" (default): propagate the exception.

    • "warn": log the error as a warning and set .raw to None.

    • "skip": silently set .raw to None.

    Skipped recordings are flagged via ds._skipped so callers can filter them out with a list comprehension after iteration.

  • description_precedence (str, default "participant_tsv") –

    Which source wins when the same field appears in both the record and the embedded participant_tsv data:

    • "participant_tsv" (default): the participant_tsv value overwrites the record value, including None values.

    • "record": the record-level value is kept.

    Raises ValueError if not one of the above.

  • **kwargs (dict) –

    Additional keyword arguments serving two purposes:

    • Filtering: any keys present in ALLOWED_QUERY_FIELDS are treated as query filters (e.g., dataset, subject, task, …).

    • Dataset options: remaining keys are forwarded to EEGDashRaw.

property cumulative_sizes: list[int]#

Recompute cumulative sizes from current dataset lengths.

Overrides the cached version from BaseConcatDataset because individual dataset lengths can change after lazy raw loading (estimated ntimes from JSON metadata may differ from actual n_times in the raw file).

download_all(n_jobs: int | None = None) None[source]#

Download missing remote files in parallel.

Parameters:

n_jobs (int | None) – Number of parallel workers to use. If None, defaults to self.n_jobs.

property cummulative_sizes#
static cumsum(sequence)[source]#
property description: DataFrame#
get_metadata() DataFrame[source]#

Concatenate the metadata and description of the wrapped Epochs.

Returns:

metadata – DataFrame containing as many rows as there are windows in the BaseConcatDataset, with the metadata and description information for each window.

Return type:

pd.DataFrame

classmethod pull_from_hub(repo_id: str, preload: bool = True, token: str | None = None, cache_dir: str | Path | None = None, force_download: bool = False, revision: str | None = None, **kwargs)[source]#

Load a dataset from the Hugging Face Hub.

Parameters:
  • repo_id (str) – Repository ID on the Hugging Face Hub (e.g., “username/dataset-name”).

  • preload (bool, default=True) – Whether to preload the data into memory. If False, uses lazy loading (when supported by the format).

  • token (str | None) – Hugging Face API token. If None, uses cached token.

  • cache_dir (str | Path | None) – Directory to cache the downloaded dataset. If None, uses default cache directory (~/.cache/huggingface/datasets).

  • force_download (bool, default=False) – Whether to force re-download even if cached.

  • revision (str | None, default=None) – Specific branch, tag, or commit to download. If None, uses the repository’s default revision.

  • **kwargs – Additional arguments (currently unused).

Returns:

The loaded dataset.

Return type:

BaseConcatDataset

Raises:

Examples

>>> from braindecode.datasets import BaseConcatDataset
>>> dataset = BaseConcatDataset.pull_from_hub("username/nmt-dataset")
>>> print(f"Loaded {len(dataset)} windows")
>>>
>>> # Use with PyTorch
>>> from torch.utils.data import DataLoader
>>> loader = DataLoader(dataset, batch_size=32, shuffle=True)
push_to_hub(repo_id: str, private: bool = False, token: str | None = None, compression: str = 'blosc', compression_level: int = 5, pipeline_name: str = 'braindecode', chunk_size: int = 5000000, local_cache_dir: str | Path | None = None, **kwargs) str[source]#

Upload the dataset to the Hugging Face Hub in BIDS-like Zarr format.

The dataset is converted to Zarr format with blosc compression, which provides optimal random access performance for PyTorch training. The data is stored in a BIDS sourcedata-like structure with events.tsv, channels.tsv, and participants.tsv sidecar files.

Parameters:
  • repo_id (str) – Repository ID on the Hugging Face Hub (e.g., “username/dataset-name”).

  • private (bool, default=False) – Whether to create a private repository.

  • token (str | None) – Hugging Face API token. If None, uses cached token.

  • compression (str, default="blosc") – Compression algorithm for Zarr. Options: “blosc”, “zstd”, “gzip”, None.

  • compression_level (int, default=5) – Compression level (0-9). Level 5 provides optimal balance.

  • pipeline_name (str, default="braindecode") – Name of the processing pipeline for BIDS sourcedata.

  • chunk_size (int, default=5_000_000) – Number of samples per chunk in Zarr along the time/window dimension. Larger chunk sizes create fewer but larger chunks/files. This parameter is used for both continuous data (e.g., RawDataset, EEGWindowsDataset) and pre-cut windows (WindowsDataset). For WindowsDataset, multiple windows may be stored in a single chunk depending on their duration and the chosen chunk_size.

  • local_cache_dir (str | Path | None) –

    Local directory to use for temporary files during upload. If None, uses the system temp directory and cleans it up after upload. If provided, the directory is used as a persistent cache:

    • If the directory is empty (or does not exist), the cache is built there and a lock file (format_info.json) is written once the cache is complete, before the upload starts. The file contains the zarr conversion parameters as JSON.

    • If the lock file is present and its JSON parameters match the current call, cache creation is skipped and the upload resumes directly (useful for retrying interrupted uploads).

    • If the lock file is present but its JSON parameters differ from the current call, a ValueError is raised.

    • If the directory is non-empty but the lock file is absent, a ValueError is raised listing the files found.

  • **kwargs – Additional arguments passed to huggingface_hub.upload_large_folder().

Returns:

URL of the uploaded dataset on the Hub.

Return type:

str

Raises:
  • ImportError – If huggingface-hub is not installed.

  • ValueError – If the dataset is empty or format is invalid.

Examples

>>> dataset = NMT(path=path, preload=True)
>>> # Upload with BIDS-like structure
>>> url = dataset.push_to_hub(
...     repo_id="myusername/nmt-dataset",
... )
save(path: str, overwrite: bool = False, offset: int = 0)[source]#

Save datasets to files by creating one subdirectory for each dataset:

path/
    0/
        0-raw.fif | 0-epo.fif
        description.json
        raw_preproc_kwargs.json (if raws were preprocessed)
        window_kwargs.json (if this is a windowed dataset)
        window_preproc_kwargs.json  (if windows were preprocessed)
        target_name.json (if target_name is not None and dataset is raw)
    1/
        1-raw.fif | 1-epo.fif
        description.json
        raw_preproc_kwargs.json (if raws were preprocessed)
        window_kwargs.json (if this is a windowed dataset)
        window_preproc_kwargs.json  (if windows were preprocessed)
        target_name.json (if target_name is not None and dataset is raw)
Parameters:
  • path (str) –

    Directory in which subdirectories are created to store

    -raw.fif | -epo.fif and .json files to.

  • overwrite (bool) – Whether to delete old subdirectories that will be saved to in this call.

  • offset (int) – If provided, the integer is added to the id of the dataset in the concat. This is useful in the setting of very large datasets, where one dataset has to be processed and saved at a time to account for its original position.

set_description(description: dict | DataFrame, overwrite: bool = False)[source]#

Update (add or overwrite) the dataset description.

Parameters:
  • description (dict | pd.DataFrame) – Description in the form key: value where the length of the value has to match the number of datasets.

  • overwrite (bool) – Has to be True if a key in description already exists in the dataset description.

set_target(column: Hashable) BaseConcatDataset[source]#

Use column as the target y for every subdataset.

Dispatches on the subdataset type:

  • For WindowsDataset / EEGWindowsDataset, column is looked up in per-window metadata first, then in the per-record description (broadcast to every window). The resolved values overwrite ds.metadata['target'] and ds.y. For WindowsDataset, the underlying ds.windows.metadata is kept in sync so get_metadata() and the repr reflect the new target.

  • For RawDataset, column must exist on the description. ds.target_name is set to column so __getitem__ reads description[column] as y on every access — no rebuild needed.

Parameters:

column (Hashable) – Name of a metadata column or description field (BIDS entity, participants.tsv extra, …). Typically a string, but any hashable that pandas accepts as a column label is allowed.

Returns:

self

Return type:

BaseConcatDataset

Raises:
  • TypeError – If any subdataset is not a WindowsDataset, EEGWindowsDataset, or RawDataset, or if a windowed subdataset has lazy (non-DataFrame) metadata.

  • ValueError – If column is not present on a subdataset’s metadata or description, or if a windowed subdataset has targets_from='channels' (which would make this a silent no-op since __getitem__ reads y from misc channels, not from metadata['target']).

split(by: str | list[int] | list[list[int]] | dict[str, list[int]] | None = None, property: str | None = None, split_ids: list[int] | list[list[int]] | dict[str, list[int]] | None = None) dict[str, BaseConcatDataset][source]#

Split the dataset based on information listed in its description.

The format could be based on a DataFrame or based on indices.

Parameters:
  • by (str | list | dict) – If by is a string, splitting is performed based on the description DataFrame column with this name. If by is a (list of) list of integers, the position in the first list corresponds to the split id and the integers to the datapoints of that split. If a dict then each key will be used in the returned splits dict and each value should be a list of int.

  • property (str) –

    Deprecated

    Some property which is listed in the info DataFrame.

  • split_ids (list | dict) –

    Deprecated

    List of indices to be combined in a subset. It can be a list of int or a list of list of int.

Returns:

splits – A dictionary with the name of the split (a string) as key and the dataset as value.

Return type:

dict

property target_transform#
to_epochs_dataset() BaseConcatDataset[WindowsDataset][source]#

Converts this BaseConcatDataset such that all datasets are WindowsDataset with mne.Epochs.

In Braindecode, the data can either be stored as mne.io.Raw (in EEGWindowsDataset) or as mne.Epochs (in WindowsDataset). This function converts all the underlying datasets to WindowsDataset with mne.Epochs. This can be useful for reducing disk space when you want to save a dataset.

Returns:

A new BaseConcatDataset where all datasets are WindowsDataset with mne.Epochs.

Return type:

BaseConcatDataset[WindowsDataset]

Raises:

ValueError – If any of the underlying datasets is a RawDataset or any other type that is not EEGWindowsDataset or WindowsDataset, as they cannot be converted to epochs.

property transform#
datasets#

Usage Example#

from eegdash import EEGDashDataset

dataset = EEGDashDataset(cache_dir="./data", dataset="ds002718")
print(f"Number of recordings: {len(dataset)}")

See Also#