eegdash.schemas#
EEGDash Data Schemas#
This module defines the core data structures used throughout EEGDash to represent neuroimaging datasets and individual recording files.
It provides two types of schemas for each core object:
Pydantic Models (
*Model): Used for strict data validation, serialization, and schema generation (e.g., for APIs).TypedDict Definitions: Used for high-performance internal usage, static type checking, and efficient loading of large metadata collections.
Core Concepts#
The data model is organized into a two-level hierarchy:
Dataset: Represents a collection of data (e.g., “ds001785”). It contains study-level metadata such as: * Identity (ID, name, source) * Demographics (subject ages, sex distribution) * Clinical (diagnosis, purpose) * Experiment Paradigm (tasks, stimuli) * Provenance (timestamps, authors)
Record: Represents a single data file within a dataset (e.g., a specific .vhdr or .edf file). It is optimized for fast access and contains: * File location (storage backend, path) * BIDS Entities (subject, session, task, run) * Basic signal properties (sampling rate, channel names)
Usage#
Creating a Dataset:
from eegdash.schemas import create_dataset
ds = create_dataset(
dataset_id="ds001",
name="My Study",
subjects_count=20,
ages=[20, 25, 30],
recording_modality=["eeg"],
)
Creating a Record:
from eegdash.schemas import create_record
rec = create_record(
dataset="ds001",
storage_base="https://my.storage.com",
bids_relpath="sub-01/eeg/sub-01_task-rest_eeg.edf",
subject="01",
task="rest",
)
Functions
|
Create a Dataset document. |
|
Create an EEGDash record. |
|
Validate a dataset has required fields. |
|
Validate a record has required fields. |
Classes
|
Pydantic model for dataset-level metadata. |
|
Pydantic model for a single recording file. |
|
Pydantic model for storage location details. |
|
Pydantic model for BIDS entities. |
|
Pydantic model for a dataset file manifest. |
|
Pydantic model for a file entry in a manifest. |
|
TypedDict schema for a full Dataset document. |
|
TypedDict schema for a Record document. |
|
Remote storage location details. |
|
BIDS entities parsed from the file path. |
|
Subject demographics summary for a dataset. |
|
Clinical classification metadata (dataset-level). |
|
Experimental paradigm classification (dataset-level). |
|
Relevant external hyperlinks for the dataset. |
|
Statistics for git-based repositories (e.g. GIN). |
|
Processing and lifecycle timestamps. |
- class eegdash.schemas.DatasetModel(*, dataset_id: Annotated[str, MinLen(min_length=1)], source: Annotated[str, MinLen(min_length=1)], recording_modality: Annotated[list[str], MinLen(min_length=1)], ingestion_fingerprint: str | None = None, senior_author: str | None = None, contact_info: list[str] | None = None, timestamps: dict[str, Any] | None = None, storage: StorageModel | None = None, **extra_data: Any)[source]
Bases:
BaseModelPydantic model for dataset-level metadata.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- dataset_id: str
- source: str
- recording_modality: list[str]
- ingestion_fingerprint: str | None
- senior_author: str | None
- contact_info: list[str] | None
- timestamps: dict[str, Any] | None
- storage: StorageModel | None
- class eegdash.schemas.RecordModel(*, dataset: Annotated[str, MinLen(min_length=1)], bids_relpath: Annotated[str, MinLen(min_length=1)], storage: StorageModel, recording_modality: Annotated[list[str], MinLen(min_length=1)], datatype: str | None = None, suffix: str | None = None, extension: str | None = None, entities: EntitiesModel | dict[str, Any] | None = None, **extra_data: Any)[source]
Bases:
BaseModelPydantic model for a single recording file.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- dataset: str
- bids_relpath: str
- storage: StorageModel
- recording_modality: list[str]
- datatype: str | None
- suffix: str | None
- extension: str | None
- entities: EntitiesModel | dict[str, Any] | None
- class eegdash.schemas.StorageModel(*, backend: ~typing.Annotated[str, ~annotated_types.MinLen(min_length=1)], base: ~typing.Annotated[str, ~annotated_types.MinLen(min_length=1)], raw_key: ~typing.Annotated[str, ~annotated_types.MinLen(min_length=1)], dep_keys: list[str] = <factory>, **extra_data: ~typing.Any)[source]
Bases:
BaseModelPydantic model for storage location details.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- backend: str
- base: str
- raw_key: str
- dep_keys: list[str]
- class eegdash.schemas.EntitiesModel(*, subject: str | None = None, session: str | None = None, task: str | None = None, run: str | None = None, **extra_data: Any)[source]
Bases:
BaseModelPydantic model for BIDS entities.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- subject: str | None
- session: str | None
- task: str | None
- run: str | None
- class eegdash.schemas.ManifestModel(*, source: str | None = None, files: list[str | ManifestFileModel], **extra_data: Any)[source]
Bases:
BaseModelPydantic model for a dataset file manifest.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- source: str | None
- files: list[str | ManifestFileModel]
- class eegdash.schemas.ManifestFileModel(*, path: str | None = None, name: str | None = None, **extra_data: Any)[source]
Bases:
BaseModelPydantic model for a file entry in a manifest.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- model_config: ClassVar[ConfigDict] = {'extra': 'allow'}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- path: str | None
- name: str | None
- path_or_name() str[source]
Return the path or name of the file.
- class eegdash.schemas.Dataset[source]
Bases:
TypedDictTypedDict schema for a full Dataset document.
This Dictionary represents all metadata available for a study/dataset.
- dataset_id
Unique identifier (e.g., “ds001785”).
- Type:
str
- name
Descriptive title of the dataset.
- Type:
str
- source
Origin source (e.g., “openneuro”, “nemar”).
- Type:
str
- readme
Content of the dataset’s README file.
- Type:
str | None
- recording_modality
List of recording modalities (e.g., [“eeg”, “meg”]).
- Type:
list[str]
- datatypes
BIDS datatypes present (e.g., [“eeg”, “anat”]).
- Type:
list[str]
- experimental_modalities
Stimulus types used (e.g., [“visual”, “auditory”]).
- Type:
list[str] | None
- bids_version
Version of the BIDS standard used.
- Type:
str | None
- license
License string (e.g., “CC0”).
- Type:
str | None
- authors
List of author names.
- Type:
list[str]
- funding
List of funding sources.
- Type:
list[str]
- dataset_doi
Digital Object Identifier for the dataset.
- Type:
str | None
- associated_paper_doi
DOI of the paper associated with the dataset.
- Type:
str | None
- tasks
List of task names found in the dataset.
- Type:
list[str]
- sessions
List of session names.
- Type:
list[str]
- total_files
Total file count.
- Type:
int | None
- size_bytes
Total dataset size in bytes.
- Type:
int | None
- data_processed
Indicates if the data has been pre-processed.
- Type:
bool | None
- study_domain
General domain of the study.
- Type:
str | None
- study_design
Description of the study design.
- Type:
str | None
- contributing_labs
List of labs contributing to the dataset.
- Type:
list[str] | None
- n_contributing_labs
Count of contributing labs.
- Type:
int | None
- demographics
Summary of subject demographics.
- Type:
Demographics
- clinical
Clinical classification details.
- Type:
Clinical
- paradigm
Experimental paradigm details.
- Type:
Paradigm
- external_links
Links to external resources.
- Type:
ExternalLinks
- repository_stats
Stats for the source repository (if applicable).
- Type:
RepositoryStats | None
- senior_author
Name of the senior author.
- Type:
str | None
- contact_info
Contact emails or names.
- Type:
list[str] | None
- timestamps
Timestamps for data processing and creation.
- Type:
Timestamps
- dataset_id: str
- name: str
- source: str
- readme: str | None
- recording_modality: list[str]
- datatypes: list[str]
- experimental_modalities: list[str] | None
- bids_version: str | None
- license: str | None
- authors: list[str]
- funding: list[str]
- dataset_doi: str | None
- associated_paper_doi: str | None
- tasks: list[str]
- sessions: list[str]
- total_files: int | None
- size_bytes: int | None
- data_processed: bool | None
- study_domain: str | None
- study_design: str | None
- contributing_labs: list[str] | None
- n_contributing_labs: int | None
- demographics: Demographics
- clinical: Clinical
- paradigm: Paradigm
- external_links: ExternalLinks
- repository_stats: RepositoryStats | None
- senior_author: str | None
- contact_info: list[str] | None
- timestamps: Timestamps
- storage: Storage | None
- class eegdash.schemas.Record[source]
Bases:
TypedDictTypedDict schema for a Record document.
Represents a single data file and its metadata. This structure is kept flat and minimal to ensure fast loading times when querying millions of records.
- dataset
Foreign key matching
Dataset.dataset_id.- Type:
str
- data_name
Unique name for the data item (e.g., “ds001_sub-01_task-rest”).
- Type:
str
- bidspath
Legacy path identifier (e.g., “ds001/sub-01/eeg/…”).
- Type:
str
- bids_relpath
Standard BIDS relative path (e.g., “sub-01/eeg/…”).
- Type:
str
- datatype
BIDS datatype (e.g., “eeg”).
- Type:
str
- suffix
Filename suffix (e.g., “eeg”).
- Type:
str
- extension
File extension (e.g., “.vhdr”).
- Type:
str
- recording_modality
Modality of the recording.
- Type:
list[str] | None
- entities
BIDS entities dict (subject, session, etc.).
- Type:
Entities
- entities_mne
BIDS entities sanitized for compatibility with MNE-Python (e.g. numeric numeric runs).
- Type:
Entities
- storage
Storage location details.
- Type:
Storage
- ch_names
List of channel names.
- Type:
list[str] | None
- sampling_frequency
Sampling rate in Hz.
- Type:
float | None
- nchans
Channel count.
- Type:
int | None
- ntimes
Number of time points.
- Type:
int | None
- digested_at
Timestamp of when this record was processed.
- Type:
str
- dataset: str
- data_name: str
- bidspath: str
- bids_relpath: str
- datatype: str
- suffix: str
- extension: str
- recording_modality: list[str] | None
- entities: Entities
- entities_mne: Entities
- storage: Storage
- ch_names: list[str] | None
- sampling_frequency: float | None
- nchans: int | None
- ntimes: int | None
- digested_at: str
- class eegdash.schemas.Storage[source]
Bases:
TypedDictRemote storage location details.
- backend
Storage backend protocol.
- Type:
{‘s3’, ‘https’, ‘local’}
- base
Base URI (e.g., “s3://openneuro.org/ds000001”).
- Type:
str
- raw_key
Path relative to base to reach the file.
- Type:
str
- dep_keys
Paths relative to base for sidecar files (e.g., .json, .vhdr).
- Type:
list[str]
- backend: Literal['s3', 'https', 'local']
- base: str
- raw_key: str
- dep_keys: list[str]
- class eegdash.schemas.Entities[source]
Bases:
TypedDictBIDS entities parsed from the file path.
- subject
Subject label (e.g., “01”).
- Type:
str | None
- session
Session label (e.g., “pre”).
- Type:
str | None
- task
Task label (e.g., “rest”).
- Type:
str | None
- run
Run label (e.g., “1” or “01”).
- Type:
str | None
- subject: str | None
- session: str | None
- task: str | None
- run: str | None
- class eegdash.schemas.Demographics[source]
Bases:
TypedDictSubject demographics summary for a dataset.
- subjects_count
Total number of subjects.
- Type:
int
- ages
List of all subject ages (if available).
- Type:
list[int]
- age_min
Minimum age in the cohort.
- Type:
int | None
- age_max
Maximum age in the cohort.
- Type:
int | None
- age_mean
Mean age of subjects.
- Type:
float | None
- species
Species of subjects (e.g., “Human”, “Mouse”).
- Type:
str | None
- sex_distribution
Count of subjects by sex (e.g., {“m”: 50, “f”: 45}).
- Type:
dict[str, int] | None
- handedness_distribution
Count of subjects by handedness (e.g., {“r”: 80, “l”: 15}).
- Type:
dict[str, int] | None
- subjects_count: int
- ages: list[int]
- age_min: int | None
- age_max: int | None
- age_mean: float | None
- species: str | None
- sex_distribution: dict[str, int] | None
- handedness_distribution: dict[str, int] | None
- class eegdash.schemas.Clinical[source]
Bases:
TypedDictClinical classification metadata (dataset-level).
- is_clinical
True if the dataset contains clinical population data.
- Type:
bool
- purpose
The clinical condition or purpose (e.g., “epilepsy”, “depression”).
- Type:
str | None
- is_clinical: bool
- purpose: str | None
- class eegdash.schemas.Paradigm[source]
Bases:
TypedDictExperimental paradigm classification (dataset-level).
- modality
The sensory or experimental modality (e.g., “visual”, “auditory”, “resting_state”).
- Type:
str | None
- cognitive_domain
The cognitive domain investigated (e.g., “memory”, “language”, “emotion”).
- Type:
str | None
- is_10_20_system
True if electrodes are positioned according to the standard 10-20 system.
- Type:
bool | None
- modality: str | None
- cognitive_domain: str | None
- is_10_20_system: bool | None
- class eegdash.schemas.ExternalLinks[source]
Bases:
TypedDictRelevant external hyperlinks for the dataset.
- source_url
URL to the primary data source (e.g. OpenNeuro page).
- Type:
str | None
- osf_url
URL to the Open Science Framework project.
- Type:
str | None
- github_url
URL to the associated GitHub repository.
- Type:
str | None
- paper_url
URL to the primary publication.
- Type:
str | None
- source_url: str | None
- osf_url: str | None
- github_url: str | None
- paper_url: str | None
- class eegdash.schemas.RepositoryStats[source]
Bases:
TypedDictStatistics for git-based repositories (e.g. GIN).
- stars
Number of stars.
- Type:
int
- forks
Number of forks.
- Type:
int
- watchers
Number of watchers.
- Type:
int
- stars: int
- forks: int
- watchers: int
- class eegdash.schemas.Timestamps[source]
Bases:
TypedDictProcessing and lifecycle timestamps.
- digested_at
ISO 8601 timestamp of when the data was processed by EEGDash.
- Type:
str
- dataset_created_at
ISO 8601 timestamp of when the dataset was originally created.
- Type:
str | None
- dataset_modified_at
ISO 8601 timestamp of when the dataset was last updated.
- Type:
str | None
- digested_at: str
- dataset_created_at: str | None
- dataset_modified_at: str | None
- eegdash.schemas.create_dataset(*, dataset_id: str, name: str | None = None, source: str = 'openneuro', readme: str | None = None, recording_modality: list[str] | None = None, datatypes: list[str] | None = None, modalities: list[str] | None = None, experimental_modalities: list[str] | None = None, bids_version: str | None = None, license: str | None = None, authors: list[str] | None = None, funding: list[str] | None = None, dataset_doi: str | None = None, associated_paper_doi: str | None = None, tasks: list[str] | None = None, sessions: list[str] | None = None, total_files: int | None = None, size_bytes: int | None = None, data_processed: bool | None = None, study_domain: str | None = None, study_design: str | None = None, subjects_count: int | None = None, ages: list[int] | None = None, age_mean: float | None = None, species: str | None = None, sex_distribution: dict[str, int] | None = None, handedness_distribution: dict[str, int] | None = None, contributing_labs: list[str] | None = None, is_clinical: bool | None = None, clinical_purpose: str | None = None, paradigm_modality: str | None = None, cognitive_domain: str | None = None, is_10_20_system: bool | None = None, source_url: str | None = None, osf_url: str | None = None, github_url: str | None = None, paper_url: str | None = None, stars: int | None = None, forks: int | None = None, watchers: int | None = None, senior_author: str | None = None, contact_info: list[str] | None = None, digested_at: str | None = None, dataset_created_at: str | None = None, dataset_modified_at: str | None = None, storage: Storage | None = None) Dataset[source]
Create a Dataset document.
This helper function constructs a
DatasetTypedDict with default values and logic to handle nested structures like demographics, clinical info, and external links.- Parameters:
dataset_id (str) – Dataset identifier (e.g., “ds001785”).
name (str, optional) – Dataset title/name.
source (str, default "openneuro") – Data source (“openneuro”, “nemar”, “gin”).
recording_modality (list[str], optional) – Recording types (e.g., [“eeg”, “meg”, “ieeg”]).
datatypes (list[str], optional) – BIDS datatypes present in the dataset (e.g., [“eeg”, “anat”, “beh”]).
experimental_modalities (list[str], optional) – Stimulus/experimental modalities (e.g., [“visual”, “auditory”, “tactile”]).
bids_version (str, optional) – BIDS version of the dataset.
license (str, optional) – Dataset license (e.g., “CC0”, “CC-BY-4.0”).
authors (list[str], optional) – Dataset authors.
funding (list[str], optional) – Funding sources.
dataset_doi (str, optional) – Dataset DOI.
associated_paper_doi (str, optional) – DOI of associated publication.
tasks (list[str], optional) – Tasks in the dataset.
sessions (list[str], optional) – Sessions in the dataset.
total_files (int, optional) – Total number of files.
size_bytes (int, optional) – Total size in bytes.
data_processed (bool, optional) – Whether data is processed.
study_domain (str, optional) – Study domain/topic.
study_design (str, optional) – Study design description.
subjects_count (int, optional) – Number of subjects.
ages (list[int], optional) – Subject ages.
age_mean (float, optional) – Mean age of subjects.
species (str, optional) – Species (e.g., “Human”).
sex_distribution (dict[str, int], optional) – Sex distribution (e.g., {“m”: 50, “f”: 45}).
handedness_distribution (dict[str, int], optional) – Handedness distribution (e.g., {“r”: 80, “l”: 15}).
contributing_labs (list[str], optional) – Labs that contributed data (for multi-site studies).
is_clinical (bool, optional) – Whether this is clinical data.
clinical_purpose (str, optional) – Clinical purpose (e.g., “epilepsy”, “depression”).
paradigm_modality (str, optional) – Experimental modality (e.g., “visual”, “auditory”, “resting_state”).
cognitive_domain (str, optional) – Cognitive domain (e.g., “attention”, “memory”, “motor”).
is_10_20_system (bool, optional) – Whether electrodes follow the 10-20 system.
source_url (str, optional) – Primary URL to the dataset source.
osf_url (str, optional) – Open Science Framework URL.
github_url (str, optional) – GitHub repository URL.
paper_url (str, optional) – URL to associated paper.
stars (int, optional) – Repository stars count (for git-based sources).
forks (int, optional) – Repository forks count.
watchers (int, optional) – Repository watchers count.
digested_at (str, optional) – ISO 8601 timestamp. If not provided, no timestamp is set (for deterministic output).
dataset_modified_at (str, optional) – Last modification timestamp.
- Returns:
A fully populated Dataset document.
- Return type:
Dataset
- eegdash.schemas.create_record(*, dataset: str, storage_base: str, bids_relpath: str, subject: str | None = None, session: str | None = None, task: str | None = None, run: str | None = None, dep_keys: list[str] | None = None, datatype: str = 'eeg', suffix: str = 'eeg', storage_backend: Literal['s3', 'https', 'local'] = 's3', recording_modality: list[str] | None = None, ch_names: list[str] | None = None, sampling_frequency: float | None = None, nchans: int | None = None, ntimes: int | None = None, digested_at: str | None = None) Record[source]
Create an EEGDash record.
Helper to construct a valid
RecordTypedDict.- Parameters:
dataset (str) – Dataset identifier (e.g., “ds000001”).
storage_base (str) – Remote storage base URI (e.g., “s3://openneuro.org/ds000001”).
bids_relpath (str) – BIDS-relative path to the raw file (e.g., “sub-01/eeg/sub-01_task-rest_eeg.vhdr”).
subject (str, optional) – BIDS entities.
session (str, optional) – BIDS entities.
task (str, optional) – BIDS entities.
run (str, optional) – BIDS entities.
dep_keys (list[str], optional) – Dependency paths relative to storage_base.
datatype (str, default "eeg") – BIDS datatype.
suffix (str, default "eeg") – BIDS suffix.
storage_backend ({"s3", "https", "local"}, default "s3") – Storage backend type.
recording_modality (list[str], optional) – Recording modalities (e.g., [“eeg”, “meg”, “ieeg”]).
digested_at (str, optional) – ISO 8601 timestamp. Defaults to current time.
- Returns:
A slim EEGDash record optimized for loading.
- Return type:
Record
Notes
Clinical and paradigm info is stored at the Dataset level, not per-file.
Examples
>>> record = create_record( ... dataset="ds000001", ... storage_base="s3://openneuro.org/ds000001", ... bids_relpath="sub-01/eeg/sub-01_task-rest_eeg.vhdr", ... subject="01", ... task="rest", ... )
- eegdash.schemas.validate_dataset(dataset: dict[str, Any]) list[str][source]
Validate a dataset has required fields. Returns list of errors.
- eegdash.schemas.validate_record(record: dict[str, Any]) list[str][source]
Validate a record has required fields. Returns list of errors.