eegdash package#

Submodules#

Module contents#

EEGDash: A comprehensive platform for EEG data management and analysis.

EEGDash provides a unified interface for accessing, querying, and analyzing large-scale EEG datasets. It integrates with cloud storage and REST APIs to streamline EEG research workflows.

Bases: EEGDashError

Raised when a dataset record has known data integrity issues.

This exception is raised when attempting to load a record that has been flagged during ingestion as having missing companion files or other integrity problems.

record

The problematic record metadata.

Type:: dict

issues

List of specific integrity issues found.

Type:: list[str]

authors

Dataset authors who can be contacted about the issue.

Type:: list[str]

contact_info

Contact information for reporting the issue.

Type:: list[str] | None

source_url

URL to the dataset source for reporting issues.

Type:: str | None

Examples

>>> try:
...     dataset.raw  # Attempt to load data
... except DataIntegrityError as e:
...     print(f"Cannot load: {e.issues}")
...     print(f"Contact authors: {e.authors}")

classmethod from_record(record: dict[str, Any]) → DataIntegrityError[source]

Create a DataIntegrityError from a record with integrity issues.

Parameters:: record (dict) – Record containing _data_integrity_issues and optionally _dataset_authors, _dataset_contact, _source_url.
Returns:: Exception with all relevant context.
Return type:: DataIntegrityError

log_error() → None[source]: Log the error using the EEGDash logger with rich formatting.

log_warning() → None[source]: Log the integrity issues as warnings (non-blocking).

print_rich(console: Console | None = None) → None[source]

Print a rich formatted version of the error to the console.

Parameters:: console (Console, optional) – Rich console to print to. If None, creates a new one.

classmethod warn_from_record(record: dict[str, Any]) → None[source]

Log a warning about data integrity issues without raising an exception.

Use this when you want to warn about issues but still allow loading.

Parameters:: record (dict) – Record containing _data_integrity_issues and optionally _dataset_authors, _dataset_contact, _source_url.

class eegdash.EEGChallengeDataset(release: str, cache_dir: str, mini: bool = True, query: dict | None = None, s3_bucket: str | None = None, **kwargs)[source]

Bases: EEGDashDataset

A dataset helper for the EEG 2025 Challenge.

This class simplifies access to the EEG 2025 Challenge datasets. It is a specialized version of EEGDashDataset that is pre-configured for the challenge’s data releases. It automatically maps a release name (e.g., “R1”) to the corresponding OpenNeuro dataset and handles the selection of subject subsets (e.g., “mini” release).

Parameters:

release (str) – The name of the challenge release to load. Must be one of the keys in RELEASE_TO_OPENNEURO_DATASET_MAP (e.g., “R1”, “R2”, …, “R11”).
cache_dir (str) – The local directory where the dataset will be downloaded and cached.
mini (bool, default True) – If True, the dataset is restricted to the official “mini” subset of subjects for the specified release. If False, all subjects for the release are included.
query (dict, optional) – An additional MongoDB-style query to apply as a filter. This query is combined with the release and subject filters using a logical AND. The query must not contain the dataset key, as this is determined by the release parameter.
s3_bucket (str, optional) – The base S3 bucket URI where the challenge data is stored. Defaults to the official challenge bucket.
**kwargs – Additional keyword arguments passed directly to the EEGDashDataset constructor. This includes the keyword filters (task, subject, session, run, modality, …; see ALLOWED_QUERY_FIELDS), each accepting a scalar or a list ($in), as well as target_name which is forwarded to braindecode.

Raises:

ValueError – If the specified release is unknown, or if the query argument contains a dataset key. Also raised if mini is True and a requested subject is not part of the official mini-release subset.

See also

EEGDashDataset: The base class for creating datasets from queries.

class eegdash.EEGDash(*, database: str = 'eegdash', api_url: str | None = None, auth_token: str | None = None)[source]

Bases: object

High-level interface to the EEGDash metadata database.

Provides methods to query, insert, and update metadata records stored in the EEGDash database via REST API gateway.

For working with collections of recordings as PyTorch datasets, prefer EEGDashDataset.

Create a new EEGDash client.

Parameters:

database (str, default "eegdash") – Name of the MongoDB database to connect to. Common values: "eegdash" (production), "eegdash_staging" (staging), "eegdash_v1" (legacy archive).
api_url (str, optional) – Override the default API URL. If not provided, uses the default public endpoint or the EEGDASH_API_URL environment variable.
auth_token (str, optional) – Authentication token for admin write operations. Not required for public read operations.

Examples

>>> eegdash = EEGDash()  # production
>>> eegdash = EEGDash(database="eegdash_staging")  # staging
>>> records = eegdash.find({"dataset": "ds002718"})

count(query: dict[str, Any] = None, /, **kwargs) → int[source]

Count documents matching the query.

Parameters:

query (dict, optional) – Complete query dictionary. This is a positional-only argument.
**kwargs – User-friendly field filters (same as find()).

Returns:

Number of matching documents.

Return type:

int

Examples

>>> eeg = EEGDash()
>>> count = eeg.count({})  # count all
>>> count = eeg.count(dataset="ds002718")  # count by dataset

exists(query: dict[str, Any] = None, /, **kwargs) → bool[source]

Check if at least one record matches the query.

Parameters:

query (dict, optional) – Complete query dictionary. This is a positional-only argument.
**kwargs – User-friendly field filters (same as find()).

Returns:

True if at least one matching record exists; False otherwise.

Return type:

bool

Examples

>>> eeg = EEGDash()
>>> eeg.exists(dataset="ds002718")  # check by dataset
>>> eeg.exists({"data_name": "ds002718_sub-001_eeg.set"})  # check by data_name

find(query: dict[str, Any] = None, /, **kwargs) → list[Mapping[str, Any]][source]

Find records in the collection.

Examples

>>> from eegdash import EEGDash
>>> eegdash = EEGDash()
>>> eegdash.find({"dataset": "ds002718", "subject": {"$in": ["012", "013"]}})  # pre-built query
>>> eegdash.find(dataset="ds002718", subject="012")  # keyword filters
>>> eegdash.find(dataset="ds002718", subject=["012", "013"])  # sequence -> $in
>>> eegdash.find({})  # fetch all (use with care)
>>> eegdash.find({"dataset": "ds002718"}, subject=["012", "013"])  # combine query + kwargs (AND)

Parameters:

query (dict, optional) – Complete MongoDB query dictionary. This is a positional-only argument.
**kwargs – User-friendly field filters that are converted to a MongoDB query. Values can be scalars (e.g., "sub-01") or sequences (translated to $in queries). Special parameters: limit (int) and skip (int) for pagination.

Returns:

DB records that match the query.

Return type:

list of dict

find_datasets(query: dict[str, Any] | None = None, limit: int = 1000) → list[Mapping[str, Any]][source]

Find datasets matching query.

Parameters:

query (dict) – Filter query.
limit (int) – Max number of datasets to return.

Returns:

List of dataset metadata documents.

Return type:

list of dict

find_one(query: dict[str, Any] = None, /, **kwargs) → Mapping[str, Any] | None[source]

Find a single record matching the query.

Parameters:

query (dict, optional) – Complete query dictionary. This is a positional-only argument.
**kwargs – User-friendly field filters (same as find()).

Returns:

The first matching record, or None if no match.

Return type:

dict or None

Examples

>>> eeg = EEGDash()
>>> record = eeg.find_one(data_name="ds002718_sub-001_eeg.set")

get_dataset(dataset_id: str) → Mapping[str, Any] | None[source]

Fetch metadata for a specific dataset.

Parameters:: dataset_id (str) – The unique identifier of the dataset (e.g., ‘ds002718’).
Returns:: The dataset metadata document, or None if not found.
Return type:: dict or None

insert(records: dict[str, Any] | list[dict[str, Any]]) → int[source]

Insert one or more records (requires auth_token).

Parameters:: records (dict or list of dict) – A single record or list of records to insert.
Returns:: Number of records inserted.
Return type:: int

Examples

>>> eeg = EEGDash(auth_token="...")
>>> eeg.insert({"dataset": "ds001", "subject": "01", ...})  # single
>>> eeg.insert([record1, record2, record3])  # batch

Search the dataset catalogue with friendly keyword filters.

Convenience wrapper around find_datasets() that translates a small set of human-friendly keyword arguments into a MongoDB-style query and returns a tidy summary pandas.DataFrame. This is the metadata-only entry point used by tutorials such as plot_00_first_search.

Parameters:

modality (str, optional) – Filter by recording modality (e.g., "eeg", "meeg"). Matched case-insensitively against the modality field.
task (str, optional) – Filter by BIDS task name (e.g., "rest", "FacePerception").
clinical_group (str, optional) – Filter by clinical cohort label (e.g., "healthy", "adhd"). Matched against clinical.group (nested) and falls back to the flat clinical_group field.
source (str, optional) – Filter by data source (e.g., "openneuro", "nemar", "hbn"). Matched against source and provider fields.
n_subjects_min (int, optional) – Minimum number of subjects in the dataset. Maps to {"n_subjects": {"$gte": n_subjects_min}}.
license (str, optional) – Filter by data license (e.g., "CC0", "CC-BY-4.0"). Matched against the license field.
limit (int, default 100) – Maximum number of datasets to return.

Returns:

One row per matching dataset with summary columns: dataset_id, name, modality, task, n_subjects, source, license, dataset_doi. Missing fields surface as None. The frame is empty (zero rows) when nothing matches.

Return type:

pandas.DataFrame

Notes

search_datasets does not download any signal bytes; only small JSON catalogue documents are transferred. Pair with EEGDashDataset once a candidate dataset is chosen.

Examples

>>> client = EEGDash()
>>> df = client.search_datasets(modality="eeg", n_subjects_min=10)
>>> df = client.search_datasets(task="rest", source="openneuro")

update_dataset(dataset_id: str, update: dict[str, Any]) → int[source]

Update metadata for a specific dataset (requires auth_token).

Parameters:

dataset_id (str) – The unique identifier of the dataset (e.g., ‘ds002718’).
update (dict) – Dictionary of fields to update.

Returns:

Number of documents modified (0 or 1).

Return type:

int

Examples

>>> eeg = EEGDash(auth_token="...")
>>> eeg.update_dataset("ds002718", {"clinical.is_clinical": True})

update_field(query: dict[str, Any] = None, /, *, update: dict[str, Any], **kwargs) → tuple[int, int][source]

Update fields on records matching the query (requires auth_token).

Use this to add or modify fields across matching records, e.g., after re-extracting entities with an improved algorithm.

Parameters:

query (dict, optional) – Filter query to match records. This is a positional-only argument.
update (dict) – Fields to update. Keys are field names, values are new values.
**kwargs – User-friendly field filters (same as find()).

Returns:

Number of records matched and actually modified.

Return type:

tuple of (matched_count, modified_count)

Examples

>>> eeg = EEGDash(auth_token="...")
>>> # Update entities for all records in a dataset
>>> eeg.update_field({"dataset": "ds002718"}, update={"entities": {"subject": "01"}})
>>> # Using kwargs for filter
>>> eeg.update_field(dataset="ds002718", update={"entities": new_entities})
>>> # Combine query + kwargs
>>> eeg.update_field({"dataset": "ds002718"}, subject="01", update={"entities": new_entities})

class eegdash.EEGDashDataset(cache_dir: str | Path, query: dict[str, Any] = None, description_fields: list[str] | None = None, s3_bucket: str | None = None, records: list[dict] | None = None, download: bool = True, n_jobs: int = -1, eeg_dash_instance: Any = None, database: str | None = None, auth_token: str | None = None, on_error: str = 'raise', max_concurrency: int = 20, description_precedence: str = 'participant_tsv', remove_nan_targets: bool = False, **kwargs)[source]

Bases: BaseConcatDataset

Create a new EEGDashDataset from a given query or local BIDS dataset directory and dataset name. An EEGDashDataset is pooled collection of EEGDashBaseDataset instances (individual recordings) and is a subclass of braindecode’s BaseConcatDataset.

Examples

Basic usage with dataset and subject filtering:

>>> from eegdash import EEGDashDataset
>>> dataset = EEGDashDataset(
...     cache_dir="./data",
...     dataset="ds002718",
...     subject="012"
... )
>>> print(f"Number of recordings: {len(dataset)}")

Filter by multiple subjects and specific task:

>>> subjects = ["012", "013", "014"]
>>> dataset = EEGDashDataset(
...     cache_dir="./data",
...     dataset="ds002718",
...     subject=subjects,
...     task="RestingState"
... )

Load and inspect EEG data from recordings:

>>> if len(dataset) > 0:
...     recording = dataset[0]
...     raw = recording.load()
...     print(f"Sampling rate: {raw.info['sfreq']} Hz")
...     print(f"Number of channels: {len(raw.ch_names)}")
...     print(f"Duration: {raw.times[-1]:.1f} seconds")

Advanced filtering with raw MongoDB queries:

>>> from eegdash import EEGDashDataset
>>> query = {
...     "dataset": "ds002718",
...     "subject": {"$in": ["012", "013"]},
...     "task": "RestingState"
... }
>>> dataset = EEGDashDataset(cache_dir="./data", query=query)

Working with dataset collections and braindecode integration:

>>> # EEGDashDataset is a braindecode BaseConcatDataset
>>> for i, recording in enumerate(dataset):
...     if i >= 2:  # limit output
...         break
...     print(f"Recording {i}: {recording.description}")
...     raw = recording.load()
...     print(f"  Channels: {len(raw.ch_names)}, Duration: {raw.times[-1]:.1f}s")

Parameters:

cache_dir (str | Path) – Directory where data are cached locally.
query (dict | None) – Raw MongoDB query to filter records. If provided, it is merged with keyword filtering arguments (see **kwargs) using logical AND. You must provide at least a dataset (either in query or as a keyword argument). Only fields in ALLOWED_QUERY_FIELDS are considered for filtering.
dataset (str) – Dataset identifier (e.g., "ds002718"). Required if query does not already specify a dataset.
task (str | list[str]) – Task name(s) to filter by (e.g., "RestingState").
subject (str | list[str]) – Subject identifier(s) to filter by (e.g., "NDARCA153NKE").
session (str | list[str]) – Session identifier(s) to filter by (e.g., "1").
run (str | list[str]) – Run identifier(s) to filter by (e.g., "1").
target_name (str | list[str] | None) – Name of the description field to expose as the braindecode prediction target. The field is automatically added to description_fields so the column is populated. A ValueError is raised (listing the available fields) when the target is missing for every recording — typically a misspelled name such as "p-factor" for "p_factor".
remove_nan_targets (bool, default False) – When target_name is set, drop recordings whose target value is missing (None/NaN) and emit a warning. Defaults to False to keep existing behaviour (such recordings are kept); a ValueError is still raised when all recordings have a missing target regardless of this flag.
modality (str | list[str]) – Recording modality to filter by (e.g., "eeg").
sampling_frequency (scalar) –
Additional numeric record fields that may be used as filters.

Every keyword filter above accepts a scalar (exact match) or a list/tuple/set ($in match). The complete set of filterable keys is ALLOWED_QUERY_FIELDS (dataset, subject, task, session, run, modality, sampling_frequency, nchans, ntimes, data_name). Any keyword that is not in that set is forwarded to EEGDashRaw (and on to braindecode) instead of being used as a filter — for example target_name. A keyword that is meant to be a filter but is misspelled therefore silently becomes a forwarded option rather than raising.
nchans (scalar) –
Additional numeric record fields that may be used as filters.

Every keyword filter above accepts a scalar (exact match) or a list/tuple/set ($in match). The complete set of filterable keys is ALLOWED_QUERY_FIELDS (dataset, subject, task, session, run, modality, sampling_frequency, nchans, ntimes, data_name). Any keyword that is not in that set is forwarded to EEGDashRaw (and on to braindecode) instead of being used as a filter — for example target_name. A keyword that is meant to be a filter but is misspelled therefore silently becomes a forwarded option rather than raising.
ntimes (scalar) –
Additional numeric record fields that may be used as filters.

Every keyword filter above accepts a scalar (exact match) or a list/tuple/set ($in match). The complete set of filterable keys is ALLOWED_QUERY_FIELDS (dataset, subject, task, session, run, modality, sampling_frequency, nchans, ntimes, data_name). Any keyword that is not in that set is forwarded to EEGDashRaw (and on to braindecode) instead of being used as a filter — for example target_name. A keyword that is meant to be a filter but is misspelled therefore silently becomes a forwarded option rather than raising.
target_name – Name of the description field to expose as the braindecode prediction target. Forwarded to EEGDashRaw; the named field must be one of the recording’s description_fields for indexing to succeed.
description_fields (list[str]) – Fields to extract from each record and include in dataset descriptions (e.g., “subject”, “session”, “run”, “task”).
s3_bucket (str | None) – Optional S3 bucket URI (e.g., “s3://mybucket”) to use instead of the default OpenNeuro bucket when downloading data files.
records (list[dict] | None) – Pre-fetched metadata records. If provided, the dataset is constructed directly from these records and no MongoDB query is performed.
download (bool, default True) – If False, load from local BIDS files only. Local data are expected under cache_dir / dataset; no DB or S3 access is attempted.
n_jobs (int) – Number of parallel jobs to use where applicable (-1 uses all cores).
eeg_dash_instance (EEGDash | None) – Optional existing EEGDash client to reuse for DB queries. If None, a new client is created on demand, not used in the case of no download.
database (str | None) – Database name to use (e.g., “eegdash”, “eegdash_staging”). If None, uses the default database.
auth_token (str | None) – Authentication token for accessing protected databases. Required for staging or admin operations.
max_concurrency (int, default 20) – Maximum number of parallel S3 transfer connections used when downloading data. Higher values speed up large/multi-file downloads but consume more bandwidth.
on_error (str, default "raise") –
How to handle DataIntegrityError when accessing .raw on individual recordings:
- "raise" (default): propagate the exception.
- "warn": log the error as a warning and set .raw to None.
- "skip": silently set .raw to None.
Skipped recordings are flagged via ds._skipped so callers can filter them out with a list comprehension after iteration.
description_precedence (str, default "participant_tsv") –
Which source wins when the same field appears in both the record and the embedded participant_tsv data:
- "participant_tsv" (default): the participant_tsv value overwrites the record value, including None values.
- "record": the record-level value is kept.
Raises ValueError if not one of the above.
**kwargs (dict) –
Additional keyword arguments serving two purposes:
- Filtering: any keys present in ALLOWED_QUERY_FIELDS are treated as query filters (e.g., dataset, subject, task, …).
- Dataset options: remaining keys are forwarded to EEGDashRaw.

property cumulative_sizes: list[int]

Recompute cumulative sizes from current dataset lengths.

Overrides the cached version from BaseConcatDataset because individual dataset lengths can change after lazy raw loading (estimated ntimes from JSON metadata may differ from actual n_times in the raw file).

download_all(n_jobs: int | None = None) → None[source]

Download missing remote files in parallel.

Parameters:: n_jobs (int | None) – Number of parallel workers to use. If None, defaults to self.n_jobs.

eegdash package#

Subpackages#

Submodules#

Module contents#