EEGDashDataset#

class eegdash.EEGDashDataset(cache_dir: str | Path, query: dict[str, Any] = None, description_fields: list[str] | None = None, s3_bucket: str | None = None, records: list[dict] | None = None, download: bool = True, n_jobs: int = -1, eeg_dash_instance: Any = None, database: str | None = None, auth_token: str | None = None, **kwargs)[source]#

Bases: BaseConcatDataset

Create a new EEGDashDataset from a given query or local BIDS dataset directory and dataset name. An EEGDashDataset is pooled collection of EEGDashBaseDataset instances (individual recordings) and is a subclass of braindecode’s BaseConcatDataset.

Examples

Basic usage with dataset and subject filtering:

>>> from eegdash import EEGDashDataset
>>> dataset = EEGDashDataset(
...     cache_dir="./data",
...     dataset="ds002718",
...     subject="012"
... )
>>> print(f"Number of recordings: {len(dataset)}")

Filter by multiple subjects and specific task:

>>> subjects = ["012", "013", "014"]
>>> dataset = EEGDashDataset(
...     cache_dir="./data",
...     dataset="ds002718",
...     subject=subjects,
...     task="RestingState"
... )

Load and inspect EEG data from recordings:

>>> if len(dataset) > 0:
...     recording = dataset[0]
...     raw = recording.load()
...     print(f"Sampling rate: {raw.info['sfreq']} Hz")
...     print(f"Number of channels: {len(raw.ch_names)}")
...     print(f"Duration: {raw.times[-1]:.1f} seconds")

Advanced filtering with raw MongoDB queries:

>>> from eegdash import EEGDashDataset
>>> query = {
...     "dataset": "ds002718",
...     "subject": {"$in": ["012", "013"]},
...     "task": "RestingState"
... }
>>> dataset = EEGDashDataset(cache_dir="./data", query=query)

Working with dataset collections and braindecode integration:

>>> # EEGDashDataset is a braindecode BaseConcatDataset
>>> for i, recording in enumerate(dataset):
...     if i >= 2:  # limit output
...         break
...     print(f"Recording {i}: {recording.description}")
...     raw = recording.load()
...     print(f"  Channels: {len(raw.ch_names)}, Duration: {raw.times[-1]:.1f}s")
Parameters:
  • cache_dir (str | Path) – Directory where data are cached locally.

  • query (dict | None) – Raw MongoDB query to filter records. If provided, it is merged with keyword filtering arguments (see **kwargs) using logical AND. You must provide at least a dataset (either in query or as a keyword argument). Only fields in ALLOWED_QUERY_FIELDS are considered for filtering.

  • dataset (str) – Dataset identifier (e.g., "ds002718"). Required if query does not already specify a dataset.

  • task (str | list[str]) – Task name(s) to filter by (e.g., "RestingState").

  • subject (str | list[str]) – Subject identifier(s) to filter by (e.g., "NDARCA153NKE").

  • session (str | list[str]) – Session identifier(s) to filter by (e.g., "1").

  • run (str | list[str]) – Run identifier(s) to filter by (e.g., "1").

  • description_fields (list[str]) – Fields to extract from each record and include in dataset descriptions (e.g., “subject”, “session”, “run”, “task”).

  • s3_bucket (str | None) – Optional S3 bucket URI (e.g., “s3://mybucket”) to use instead of the default OpenNeuro bucket when downloading data files.

  • records (list[dict] | None) – Pre-fetched metadata records. If provided, the dataset is constructed directly from these records and no MongoDB query is performed.

  • download (bool, default True) – If False, load from local BIDS files only. Local data are expected under cache_dir / dataset; no DB or S3 access is attempted.

  • n_jobs (int) – Number of parallel jobs to use where applicable (-1 uses all cores).

  • eeg_dash_instance (EEGDash | None) – Optional existing EEGDash client to reuse for DB queries. If None, a new client is created on demand, not used in the case of no download.

  • database (str | None) – Database name to use (e.g., “eegdash”, “eegdash_staging”). If None, uses the default database.

  • auth_token (str | None) – Authentication token for accessing protected databases. Required for staging or admin operations.

  • **kwargs (dict) –

    Additional keyword arguments serving two purposes:

    • Filtering: any keys present in ALLOWED_QUERY_FIELDS are treated as query filters (e.g., dataset, subject, task, …).

    • Dataset options: remaining keys are forwarded to EEGDashRaw.

download_all(n_jobs: int | None = None) None[source]#

Download missing remote files in parallel.

Parameters:

n_jobs (int | None) – Number of parallel workers to use. If None, defaults to self.n_jobs.

save(path, overwrite=False)[source]#

Save the dataset to disk.

Parameters:
  • path (str or Path) – Destination file path.

  • overwrite (bool, default False) – If True, overwrite existing file.

Return type:

None

property cummulative_sizes#
static cumsum(sequence)[source]#
property cumulative_sizes: list[int]#

Cumulative sizes of the underlying datasets.

When the dataset is created with lazy=True, the cumulative sizes are computed on first access and then cached.

property description: DataFrame#
get_metadata() DataFrame[source]#

Concatenate the metadata and description of the wrapped Epochs.

Returns:

metadata – DataFrame containing as many rows as there are windows in the BaseConcatDataset, with the metadata and description information for each window.

Return type:

pd.DataFrame

classmethod pull_from_hub(repo_id: str, preload: bool = True, token: str | None = None, cache_dir: str | Path | None = None, force_download: bool = False, **kwargs)[source]#

Load a dataset from the Hugging Face Hub.

Parameters:
  • repo_id (str) – Repository ID on the Hugging Face Hub (e.g., “username/dataset-name”).

  • preload (bool, default=True) – Whether to preload the data into memory. If False, uses lazy loading (when supported by the format).

  • token (str | None) – Hugging Face API token. If None, uses cached token.

  • cache_dir (str | Path | None) – Directory to cache the downloaded dataset. If None, uses default cache directory (~/.cache/huggingface/datasets).

  • force_download (bool, default=False) – Whether to force re-download even if cached.

  • **kwargs – Additional arguments (currently unused).

Returns:

The loaded dataset.

Return type:

BaseConcatDataset

Raises:
  • ImportError – If huggingface-hub is not installed.

  • FileNotFoundError – If the repository or dataset files are not found.

Examples

>>> from braindecode.datasets import BaseConcatDataset
>>> dataset = BaseConcatDataset.pull_from_hub("username/nmt-dataset")
>>> print(f"Loaded {len(dataset)} windows")
>>>
>>> # Use with PyTorch
>>> from torch.utils.data import DataLoader
>>> loader = DataLoader(dataset, batch_size=32, shuffle=True)
push_to_hub(repo_id: str, commit_message: str | None = None, private: bool = False, token: str | None = None, create_pr: bool = False, compression: str = 'blosc', compression_level: int = 5, pipeline_name: str = 'braindecode') str[source]#

Upload the dataset to the Hugging Face Hub in BIDS-like Zarr format.

The dataset is converted to Zarr format with blosc compression, which provides optimal random access performance for PyTorch training. The data is stored in a BIDS sourcedata-like structure with events.tsv, channels.tsv, and participants.tsv sidecar files.

Parameters:
  • repo_id (str) – Repository ID on the Hugging Face Hub (e.g., “username/dataset-name”).

  • commit_message (str | None) – Commit message. If None, a default message is generated.

  • private (bool, default=False) – Whether to create a private repository.

  • token (str | None) – Hugging Face API token. If None, uses cached token.

  • create_pr (bool, default=False) – Whether to create a Pull Request instead of directly committing.

  • compression (str, default="blosc") – Compression algorithm for Zarr. Options: “blosc”, “zstd”, “gzip”, None.

  • compression_level (int, default=5) – Compression level (0-9). Level 5 provides optimal balance.

  • pipeline_name (str, default="braindecode") – Name of the processing pipeline for BIDS sourcedata.

Returns:

URL of the uploaded dataset on the Hub.

Return type:

str

Raises:
  • ImportError – If huggingface-hub is not installed.

  • ValueError – If the dataset is empty or format is invalid.

Examples

>>> dataset = NMT(path=path, preload=True)
>>> # Upload with BIDS-like structure
>>> url = dataset.push_to_hub(
...     repo_id="myusername/nmt-dataset",
...     commit_message="Upload NMT EEG dataset"
... )
set_description(description: dict | DataFrame, overwrite: bool = False)[source]#

Update (add or overwrite) the dataset description.

Parameters:
  • description (dict | pd.DataFrame) – Description in the form key: value where the length of the value has to match the number of datasets.

  • overwrite (bool) – Has to be True if a key in description already exists in the dataset description.

split(by: str | list[int] | list[list[int]] | dict[str, list[int]] | None = None, property: str | None = None, split_ids: list[int] | list[list[int]] | dict[str, list[int]] | None = None) dict[str, BaseConcatDataset][source]#

Split the dataset based on information listed in its description.

The format could be based on a DataFrame or based on indices.

Parameters:
  • by (str | list | dict) – If by is a string, splitting is performed based on the description DataFrame column with this name. If by is a (list of) list of integers, the position in the first list corresponds to the split id and the integers to the datapoints of that split. If a dict then each key will be used in the returned splits dict and each value should be a list of int.

  • property (str) – Some property which is listed in the info DataFrame.

  • split_ids (list | dict) – List of indices to be combined in a subset. It can be a list of int or a list of list of int.

Returns:

splits – A dictionary with the name of the split (a string) as key and the dataset as value.

Return type:

dict

property target_transform#
property transform#
datasets: list[T]#

Usage Example#

from eegdash import EEGDashDataset

dataset = EEGDashDataset(cache_dir="./data", dataset="ds002718")
print(f"Number of recordings: {len(dataset)}")

See Also#