eegdash.schemas#

EEGDash Data Schemas#

This module defines the core data structures used throughout EEGDash to represent neuroimaging datasets and individual recording files.

It provides two types of schemas for each core object:

  1. Pydantic Models (*Model): Used for strict data validation, serialization, and schema generation (e.g., for APIs).

  2. TypedDict Definitions: Used for high-performance internal usage, static type checking, and efficient loading of large metadata collections.

Core Concepts#

The data model is organized into a two-level hierarchy:

  • Dataset: Represents a collection of data (e.g., “ds001785”). It contains study-level metadata such as: * Identity (ID, name, source) * Demographics (subject ages, sex distribution) * Clinical (diagnosis, purpose) * Experiment Paradigm (tasks, stimuli) * Provenance (timestamps, authors)

  • Record: Represents a single data file within a dataset (e.g., a specific .vhdr or .edf file). It is optimized for fast access and contains: * File location (storage backend, path) * BIDS Entities (subject, session, task, run) * Basic signal properties (sampling rate, channel names)

Usage#

Creating a Dataset:

from eegdash.schemas import create_dataset

ds = create_dataset(
    dataset_id="ds001",
    name="My Study",
    subjects_count=20,
    ages=[20, 25, 30],
    recording_modality=["eeg"],
)

Creating a Record:

from eegdash.schemas import create_record

rec = create_record(
    dataset="ds001",
    storage_base="https://my.storage.com",
    bids_relpath="sub-01/eeg/sub-01_task-rest_eeg.edf",
    subject="01",
    task="rest",
)

Functions

create_dataset(*, dataset_id[, name, ...])

Create a Dataset document.

create_record(*, dataset, storage_base, ...)

Create an EEGDash record.

validate_dataset(dataset)

Validate a dataset has required fields.

validate_record(record)

Validate a record has required fields.

Classes

DatasetModel(*, dataset_id, source, ...[, ...])

Pydantic model for dataset-level metadata.

RecordModel(*, dataset, bids_relpath, ...[, ...])

Pydantic model for a single recording file.

StorageModel(*, backend, ...)

Pydantic model for storage location details.

EntitiesModel(*[, subject, session, task, run])

Pydantic model for BIDS entities.

ManifestModel(*[, source])

Pydantic model for a dataset file manifest.

ManifestFileModel(*[, path, name])

Pydantic model for a file entry in a manifest.

Dataset

TypedDict schema for a full Dataset document.

Record

TypedDict schema for a Record document.

Storage

Remote storage location details.

Entities

BIDS entities parsed from the file path.

Demographics

Subject demographics summary for a dataset.

Clinical

Clinical classification metadata (dataset-level).

Paradigm

Experimental paradigm classification (dataset-level).

ExternalLinks

Relevant external hyperlinks for the dataset.

RepositoryStats

Statistics for git-based repositories (e.g. GIN).

Timestamps

Processing and lifecycle timestamps.

class eegdash.schemas.DatasetModel(*, dataset_id: Annotated[str, MinLen(min_length=1)], source: Annotated[str, MinLen(min_length=1)], recording_modality: Annotated[list[str], MinLen(min_length=1)], ingestion_fingerprint: str | None = None, senior_author: str | None = None, contact_info: list[str] | None = None, timestamps: dict[str, Any] | None = None, storage: StorageModel | None = None, **extra_data: Any)[source]

Bases: BaseModel

Pydantic model for dataset-level metadata.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

dataset_id: str
source: str
recording_modality: list[str]
ingestion_fingerprint: str | None
senior_author: str | None
contact_info: list[str] | None
timestamps: dict[str, Any] | None
storage: StorageModel | None
class eegdash.schemas.RecordModel(*, dataset: Annotated[str, MinLen(min_length=1)], bids_relpath: Annotated[str, MinLen(min_length=1)], storage: StorageModel, recording_modality: Annotated[list[str], MinLen(min_length=1)], datatype: str | None = None, suffix: str | None = None, extension: str | None = None, entities: EntitiesModel | dict[str, Any] | None = None, **extra_data: Any)[source]

Bases: BaseModel

Pydantic model for a single recording file.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

dataset: str
bids_relpath: str
storage: StorageModel
recording_modality: list[str]
datatype: str | None
suffix: str | None
extension: str | None
entities: EntitiesModel | dict[str, Any] | None
class eegdash.schemas.StorageModel(*, backend: ~typing.Annotated[str, ~annotated_types.MinLen(min_length=1)], base: ~typing.Annotated[str, ~annotated_types.MinLen(min_length=1)], raw_key: ~typing.Annotated[str, ~annotated_types.MinLen(min_length=1)], dep_keys: list[str] = <factory>, **extra_data: ~typing.Any)[source]

Bases: BaseModel

Pydantic model for storage location details.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

backend: str
base: str
raw_key: str
dep_keys: list[str]
class eegdash.schemas.EntitiesModel(*, subject: str | None = None, session: str | None = None, task: str | None = None, run: str | None = None, **extra_data: Any)[source]

Bases: BaseModel

Pydantic model for BIDS entities.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

subject: str | None
session: str | None
task: str | None
run: str | None
class eegdash.schemas.ManifestModel(*, source: str | None = None, files: list[str | ManifestFileModel], **extra_data: Any)[source]

Bases: BaseModel

Pydantic model for a dataset file manifest.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

source: str | None
files: list[str | ManifestFileModel]
class eegdash.schemas.ManifestFileModel(*, path: str | None = None, name: str | None = None, **extra_data: Any)[source]

Bases: BaseModel

Pydantic model for a file entry in a manifest.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config: ClassVar[ConfigDict] = {'extra': 'allow'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

path: str | None
name: str | None
path_or_name() str[source]

Return the path or name of the file.

class eegdash.schemas.Dataset[source]

Bases: TypedDict

TypedDict schema for a full Dataset document.

This Dictionary represents all metadata available for a study/dataset.

dataset_id

Unique identifier (e.g., “ds001785”).

Type:

str

name

Descriptive title of the dataset.

Type:

str

source

Origin source (e.g., “openneuro”, “nemar”).

Type:

str

readme

Content of the dataset’s README file.

Type:

str | None

recording_modality

List of recording modalities (e.g., [“eeg”, “meg”]).

Type:

list[str]

datatypes

BIDS datatypes present (e.g., [“eeg”, “anat”]).

Type:

list[str]

experimental_modalities

Stimulus types used (e.g., [“visual”, “auditory”]).

Type:

list[str] | None

bids_version

Version of the BIDS standard used.

Type:

str | None

license

License string (e.g., “CC0”).

Type:

str | None

authors

List of author names.

Type:

list[str]

funding

List of funding sources.

Type:

list[str]

dataset_doi

Digital Object Identifier for the dataset.

Type:

str | None

associated_paper_doi

DOI of the paper associated with the dataset.

Type:

str | None

tasks

List of task names found in the dataset.

Type:

list[str]

sessions

List of session names.

Type:

list[str]

total_files

Total file count.

Type:

int | None

size_bytes

Total dataset size in bytes.

Type:

int | None

data_processed

Indicates if the data has been pre-processed.

Type:

bool | None

study_domain

General domain of the study.

Type:

str | None

study_design

Description of the study design.

Type:

str | None

contributing_labs

List of labs contributing to the dataset.

Type:

list[str] | None

n_contributing_labs

Count of contributing labs.

Type:

int | None

demographics

Summary of subject demographics.

Type:

Demographics

clinical

Clinical classification details.

Type:

Clinical

paradigm

Experimental paradigm details.

Type:

Paradigm

external_links

Links to external resources.

Type:

ExternalLinks

repository_stats

Stats for the source repository (if applicable).

Type:

RepositoryStats | None

senior_author

Name of the senior author.

Type:

str | None

contact_info

Contact emails or names.

Type:

list[str] | None

timestamps

Timestamps for data processing and creation.

Type:

Timestamps

dataset_id: str
name: str
source: str
readme: str | None
recording_modality: list[str]
datatypes: list[str]
experimental_modalities: list[str] | None
bids_version: str | None
license: str | None
authors: list[str]
funding: list[str]
dataset_doi: str | None
associated_paper_doi: str | None
tasks: list[str]
sessions: list[str]
total_files: int | None
size_bytes: int | None
data_processed: bool | None
study_domain: str | None
study_design: str | None
contributing_labs: list[str] | None
n_contributing_labs: int | None
demographics: Demographics
clinical: Clinical
paradigm: Paradigm
external_links: ExternalLinks
repository_stats: RepositoryStats | None
senior_author: str | None
contact_info: list[str] | None
timestamps: Timestamps
storage: Storage | None
class eegdash.schemas.Record[source]

Bases: TypedDict

TypedDict schema for a Record document.

Represents a single data file and its metadata. This structure is kept flat and minimal to ensure fast loading times when querying millions of records.

dataset

Foreign key matching Dataset.dataset_id.

Type:

str

data_name

Unique name for the data item (e.g., “ds001_sub-01_task-rest”).

Type:

str

bidspath

Legacy path identifier (e.g., “ds001/sub-01/eeg/…”).

Type:

str

bids_relpath

Standard BIDS relative path (e.g., “sub-01/eeg/…”).

Type:

str

datatype

BIDS datatype (e.g., “eeg”).

Type:

str

suffix

Filename suffix (e.g., “eeg”).

Type:

str

extension

File extension (e.g., “.vhdr”).

Type:

str

recording_modality

Modality of the recording.

Type:

list[str] | None

entities

BIDS entities dict (subject, session, etc.).

Type:

Entities

entities_mne

BIDS entities sanitized for compatibility with MNE-Python (e.g. numeric numeric runs).

Type:

Entities

storage

Storage location details.

Type:

Storage

ch_names

List of channel names.

Type:

list[str] | None

sampling_frequency

Sampling rate in Hz.

Type:

float | None

nchans

Channel count.

Type:

int | None

ntimes

Number of time points.

Type:

int | None

digested_at

Timestamp of when this record was processed.

Type:

str

dataset: str
data_name: str
bidspath: str
bids_relpath: str
datatype: str
suffix: str
extension: str
recording_modality: list[str] | None
entities: Entities
entities_mne: Entities
storage: Storage
ch_names: list[str] | None
sampling_frequency: float | None
nchans: int | None
ntimes: int | None
digested_at: str
class eegdash.schemas.Storage[source]

Bases: TypedDict

Remote storage location details.

backend

Storage backend protocol.

Type:

{‘s3’, ‘https’, ‘local’}

base

Base URI (e.g., “s3://openneuro.org/ds000001”).

Type:

str

raw_key

Path relative to base to reach the file.

Type:

str

dep_keys

Paths relative to base for sidecar files (e.g., .json, .vhdr).

Type:

list[str]

backend: Literal['s3', 'https', 'local']
base: str
raw_key: str
dep_keys: list[str]
class eegdash.schemas.Entities[source]

Bases: TypedDict

BIDS entities parsed from the file path.

subject

Subject label (e.g., “01”).

Type:

str | None

session

Session label (e.g., “pre”).

Type:

str | None

task

Task label (e.g., “rest”).

Type:

str | None

run

Run label (e.g., “1” or “01”).

Type:

str | None

subject: str | None
session: str | None
task: str | None
run: str | None
class eegdash.schemas.Demographics[source]

Bases: TypedDict

Subject demographics summary for a dataset.

subjects_count

Total number of subjects.

Type:

int

ages

List of all subject ages (if available).

Type:

list[int]

age_min

Minimum age in the cohort.

Type:

int | None

age_max

Maximum age in the cohort.

Type:

int | None

age_mean

Mean age of subjects.

Type:

float | None

species

Species of subjects (e.g., “Human”, “Mouse”).

Type:

str | None

sex_distribution

Count of subjects by sex (e.g., {“m”: 50, “f”: 45}).

Type:

dict[str, int] | None

handedness_distribution

Count of subjects by handedness (e.g., {“r”: 80, “l”: 15}).

Type:

dict[str, int] | None

subjects_count: int
ages: list[int]
age_min: int | None
age_max: int | None
age_mean: float | None
species: str | None
sex_distribution: dict[str, int] | None
handedness_distribution: dict[str, int] | None
class eegdash.schemas.Clinical[source]

Bases: TypedDict

Clinical classification metadata (dataset-level).

is_clinical

True if the dataset contains clinical population data.

Type:

bool

purpose

The clinical condition or purpose (e.g., “epilepsy”, “depression”).

Type:

str | None

is_clinical: bool
purpose: str | None
class eegdash.schemas.Paradigm[source]

Bases: TypedDict

Experimental paradigm classification (dataset-level).

modality

The sensory or experimental modality (e.g., “visual”, “auditory”, “resting_state”).

Type:

str | None

cognitive_domain

The cognitive domain investigated (e.g., “memory”, “language”, “emotion”).

Type:

str | None

is_10_20_system

True if electrodes are positioned according to the standard 10-20 system.

Type:

bool | None

modality: str | None
cognitive_domain: str | None
is_10_20_system: bool | None
class eegdash.schemas.ExternalLinks[source]

Bases: TypedDict

Relevant external hyperlinks for the dataset.

source_url

URL to the primary data source (e.g. OpenNeuro page).

Type:

str | None

osf_url

URL to the Open Science Framework project.

Type:

str | None

github_url

URL to the associated GitHub repository.

Type:

str | None

paper_url

URL to the primary publication.

Type:

str | None

source_url: str | None
osf_url: str | None
github_url: str | None
paper_url: str | None
class eegdash.schemas.RepositoryStats[source]

Bases: TypedDict

Statistics for git-based repositories (e.g. GIN).

stars

Number of stars.

Type:

int

forks

Number of forks.

Type:

int

watchers

Number of watchers.

Type:

int

stars: int
forks: int
watchers: int
class eegdash.schemas.Timestamps[source]

Bases: TypedDict

Processing and lifecycle timestamps.

digested_at

ISO 8601 timestamp of when the data was processed by EEGDash.

Type:

str

dataset_created_at

ISO 8601 timestamp of when the dataset was originally created.

Type:

str | None

dataset_modified_at

ISO 8601 timestamp of when the dataset was last updated.

Type:

str | None

digested_at: str
dataset_created_at: str | None
dataset_modified_at: str | None
eegdash.schemas.create_dataset(*, dataset_id: str, name: str | None = None, source: str = 'openneuro', readme: str | None = None, recording_modality: list[str] | None = None, datatypes: list[str] | None = None, modalities: list[str] | None = None, experimental_modalities: list[str] | None = None, bids_version: str | None = None, license: str | None = None, authors: list[str] | None = None, funding: list[str] | None = None, dataset_doi: str | None = None, associated_paper_doi: str | None = None, tasks: list[str] | None = None, sessions: list[str] | None = None, total_files: int | None = None, size_bytes: int | None = None, data_processed: bool | None = None, study_domain: str | None = None, study_design: str | None = None, subjects_count: int | None = None, ages: list[int] | None = None, age_mean: float | None = None, species: str | None = None, sex_distribution: dict[str, int] | None = None, handedness_distribution: dict[str, int] | None = None, contributing_labs: list[str] | None = None, is_clinical: bool | None = None, clinical_purpose: str | None = None, paradigm_modality: str | None = None, cognitive_domain: str | None = None, is_10_20_system: bool | None = None, source_url: str | None = None, osf_url: str | None = None, github_url: str | None = None, paper_url: str | None = None, stars: int | None = None, forks: int | None = None, watchers: int | None = None, senior_author: str | None = None, contact_info: list[str] | None = None, digested_at: str | None = None, dataset_created_at: str | None = None, dataset_modified_at: str | None = None, storage: Storage | None = None) Dataset[source]

Create a Dataset document.

This helper function constructs a Dataset TypedDict with default values and logic to handle nested structures like demographics, clinical info, and external links.

Parameters:
  • dataset_id (str) – Dataset identifier (e.g., “ds001785”).

  • name (str, optional) – Dataset title/name.

  • source (str, default "openneuro") – Data source (“openneuro”, “nemar”, “gin”).

  • recording_modality (list[str], optional) – Recording types (e.g., [“eeg”, “meg”, “ieeg”]).

  • datatypes (list[str], optional) – BIDS datatypes present in the dataset (e.g., [“eeg”, “anat”, “beh”]).

  • experimental_modalities (list[str], optional) – Stimulus/experimental modalities (e.g., [“visual”, “auditory”, “tactile”]).

  • bids_version (str, optional) – BIDS version of the dataset.

  • license (str, optional) – Dataset license (e.g., “CC0”, “CC-BY-4.0”).

  • authors (list[str], optional) – Dataset authors.

  • funding (list[str], optional) – Funding sources.

  • dataset_doi (str, optional) – Dataset DOI.

  • associated_paper_doi (str, optional) – DOI of associated publication.

  • tasks (list[str], optional) – Tasks in the dataset.

  • sessions (list[str], optional) – Sessions in the dataset.

  • total_files (int, optional) – Total number of files.

  • size_bytes (int, optional) – Total size in bytes.

  • data_processed (bool, optional) – Whether data is processed.

  • study_domain (str, optional) – Study domain/topic.

  • study_design (str, optional) – Study design description.

  • subjects_count (int, optional) – Number of subjects.

  • ages (list[int], optional) – Subject ages.

  • age_mean (float, optional) – Mean age of subjects.

  • species (str, optional) – Species (e.g., “Human”).

  • sex_distribution (dict[str, int], optional) – Sex distribution (e.g., {“m”: 50, “f”: 45}).

  • handedness_distribution (dict[str, int], optional) – Handedness distribution (e.g., {“r”: 80, “l”: 15}).

  • contributing_labs (list[str], optional) – Labs that contributed data (for multi-site studies).

  • is_clinical (bool, optional) – Whether this is clinical data.

  • clinical_purpose (str, optional) – Clinical purpose (e.g., “epilepsy”, “depression”).

  • paradigm_modality (str, optional) – Experimental modality (e.g., “visual”, “auditory”, “resting_state”).

  • cognitive_domain (str, optional) – Cognitive domain (e.g., “attention”, “memory”, “motor”).

  • is_10_20_system (bool, optional) – Whether electrodes follow the 10-20 system.

  • source_url (str, optional) – Primary URL to the dataset source.

  • osf_url (str, optional) – Open Science Framework URL.

  • github_url (str, optional) – GitHub repository URL.

  • paper_url (str, optional) – URL to associated paper.

  • stars (int, optional) – Repository stars count (for git-based sources).

  • forks (int, optional) – Repository forks count.

  • watchers (int, optional) – Repository watchers count.

  • digested_at (str, optional) – ISO 8601 timestamp. If not provided, no timestamp is set (for deterministic output).

  • dataset_modified_at (str, optional) – Last modification timestamp.

Returns:

A fully populated Dataset document.

Return type:

Dataset

eegdash.schemas.create_record(*, dataset: str, storage_base: str, bids_relpath: str, subject: str | None = None, session: str | None = None, task: str | None = None, run: str | None = None, dep_keys: list[str] | None = None, datatype: str = 'eeg', suffix: str = 'eeg', storage_backend: Literal['s3', 'https', 'local'] = 's3', recording_modality: list[str] | None = None, ch_names: list[str] | None = None, sampling_frequency: float | None = None, nchans: int | None = None, ntimes: int | None = None, digested_at: str | None = None) Record[source]

Create an EEGDash record.

Helper to construct a valid Record TypedDict.

Parameters:
  • dataset (str) – Dataset identifier (e.g., “ds000001”).

  • storage_base (str) – Remote storage base URI (e.g., “s3://openneuro.org/ds000001”).

  • bids_relpath (str) – BIDS-relative path to the raw file (e.g., “sub-01/eeg/sub-01_task-rest_eeg.vhdr”).

  • subject (str, optional) – BIDS entities.

  • session (str, optional) – BIDS entities.

  • task (str, optional) – BIDS entities.

  • run (str, optional) – BIDS entities.

  • dep_keys (list[str], optional) – Dependency paths relative to storage_base.

  • datatype (str, default "eeg") – BIDS datatype.

  • suffix (str, default "eeg") – BIDS suffix.

  • storage_backend ({"s3", "https", "local"}, default "s3") – Storage backend type.

  • recording_modality (list[str], optional) – Recording modalities (e.g., [“eeg”, “meg”, “ieeg”]).

  • digested_at (str, optional) – ISO 8601 timestamp. Defaults to current time.

Returns:

A slim EEGDash record optimized for loading.

Return type:

Record

Notes

Clinical and paradigm info is stored at the Dataset level, not per-file.

Examples

>>> record = create_record(
...     dataset="ds000001",
...     storage_base="s3://openneuro.org/ds000001",
...     bids_relpath="sub-01/eeg/sub-01_task-rest_eeg.vhdr",
...     subject="01",
...     task="rest",
... )
eegdash.schemas.validate_dataset(dataset: dict[str, Any]) list[str][source]

Validate a dataset has required fields. Returns list of errors.

eegdash.schemas.validate_record(record: dict[str, Any]) list[str][source]

Validate a record has required fields. Returns list of errors.