EEGdash›NeMAR›NM000104

Iss. 104 · 108 subjects · 1136 recordings · CC-BY-NC-SA-4.0

Dataset Brief · emg2qwerty

NM000104: emg dataset, 108 subjects#

emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography

Citation: Viswanath Sivakumar, Jeffrey Seely, Alan Du, Sean R. Bittner, Adam Berenzweig, Anuoluwapo Bolarinwa, Alexandre Gramfort, Michael I. Mandel (2024). emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography. 10.82901/nemar.nm000104

108-participant EMG dataset — emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography.

Data & curation Viswanath Sivakumar · Jeffrey Seely · Alan Du · Sean R. Bittner · Adam Berenzweig · Anuoluwapo Bolarinwa · …
Year 2024 · Distributed via NeMAR
Funding Meta Reality Labs

EMG · 32 ch2000 HzBIDS 1.11.0Task · typing1135 sessions

Layer 01Study

What was asked

Hypothesis, independent & dependent variables, paradigm, cohort, and the editorial caveats around what the recordings can and cannot answer.

Layer 02Signal · BIDS

What was recorded

Sidecars, channels & electrodes, coordinate system, event semantics, and quality stats from the NEMAR pipeline when available.

Layer 03Training · ML

What you can train on

Recommended access modes — MNE Raw, braindecode windows, PyTorch DataLoader — plus the targets the metadata makes addressable.

§ 01Access · Get started

Quickstart#

Get Started

Install

pip install eegdash

Access the data

from eegdash.dataset import NM000104

dataset = NM000104(cache_dir="./data")
# Get the raw object of the first recording
raw = dataset.datasets[0].raw
print(raw.info)

Query & Filter

Filter by subject

dataset = NM000104(cache_dir="./data", subject="01")

Advanced query

dataset = NM000104(
    cache_dir="./data",
    query={"subject": {"$in": ["01", "02"]}},
)

Iterate recordings

for rec in dataset:
    print(rec.subject, rec.raw.info['sfreq'])

Cite This Dataset

If you use this dataset in your research, please cite the original authors.

BibTeX

@dataset{nm000104,
  title = {emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography},
  author = {Viswanath Sivakumar and Jeffrey Seely and Alan Du and Sean R. Bittner and Adam Berenzweig and Anuoluwapo Bolarinwa and Alexandre Gramfort and Michael I. Mandel},
  doi = {10.82901/nemar.nm000104},
  url = {https://doi.org/10.82901/nemar.nm000104},
}

§ 02Study · The README

About This Dataset#

Dataset: emg2qwerty - Touch typing from wrist-based surface electromyography

Task: Touch typing on QWERTY keyboard Participants: 108 subjects Sessions: 1,135 total (average 10 per subject, range 1-18) Duration: 346.4 hours total (9.5-47.5 min per session) Publication: Sivakumar et al., 2024 - “emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography”

This dataset captures wrist-based sEMG signals during touch typing on a physical keyboard. The goal is to enable keyboard-free text input by decoding typing intent directly from neuromuscular activity, with applications in AR/VR, mobile computing, and brain-computer interfaces.

emg2qwerty: Touch Typing from Surface Electromyography

Overview

This is the largest public sEMG dataset to date, specifically designed to study: - Cross-user generalization - Cross-session adaptation (domain shift from electrode placement) - Sequence-to-sequence learning (analogous to automatic speech recognition) - High-bandwidth neuromotor interfaces
View full README
emg2qwerty: Touch Typing from Surface Electromyography

Overview

This is the largest public sEMG dataset to date, specifically designed to study: - Cross-user generalization - Cross-session adaptation (domain shift from electrode placement) - Sequence-to-sequence learning (analogous to automatic speech recognition) - High-bandwidth neuromotor interfaces

Dataset Details

Participants

Sample size: 108 participants Demographics: Not available (age, sex, handedness marked as n/a) Screening: Touch typists with >90% correct finger-to-key mapping Typing speed: 130-439 keys/min (mean: 265 keys/min, ~4.4 keys/sec)

Hardware

Device: sEMG Research Device (sEMG-RD) Configuration: Two wristbands (left and right wrists) Channels: 32 total (16 per wrist) Sampling rate: 2000 Hz Bit depth: 12 bits Dynamic range: ±6.6 mV Bandwidth: 20-850 Hz Connectivity: Bluetooth Electrode type: Dry gold-plated differential pairs

Recording Setup

Keyboard: Apple Magic Keyboard (US English) Text prompts: - Random words from dictionary - Sentences from English Wikipedia - Filtered for offensive terms - Lowercase with basic punctuation only

Ground truth: Keylogger recording key-down and key-up timestamps (±0.5 ms precision) Backspace usage: Allowed (natural typing behavior)

Session Protocol

Participant dons two sEMG-RDs (one per wrist)

Types prompted text on physical keyboard

Keylogger records all keystrokes with timestamps

sEMG signals streamed via Bluetooth

Between sessions: Bands doffed and re-donned (realistic electrode placement variability)

Session duration: 9.5-47.5 minutes (depends on typing speed) Inter-session protocol: Complete band removal and replacement to simulate real-world usage

Data Contents

Files per Session
sub-XXXXXXXX/ses-YYYYYYYYYY/emg/
├── sub-XXXXXXXX_ses-YYYYYYYYYY_task-typing_emg.edf
├── sub-XXXXXXXX_ses-YYYYYYYYYY_task-typing_emg.json
├── sub-XXXXXXXX_ses-YYYYYYYYYY_task-typing_channels.tsv
├── sub-XXXXXXXX_ses-YYYYYYYYYY_task-typing_events.tsv
└── sub-XXXXXXXX_ses-YYYYYYYYYY_electrodes.tsv
Channel Configuration

Total channels: 32 - EMG0-EMG15: Left wrist - EMG16-EMG31: Right wrist

Channel naming: Unique across entire dataset (EMG0-EMG31) Electrode naming: E0-E15 (reused for left and right wrists) Reference: Bipolar (differential sensing) channels.tsv columns: - name: Channel identifier (EMG0-EMG31) - type: EMG - units: V - signal_electrode: Physical electrode name (E0-E15) - reference: bipolar - group: left or right (wrist) - target_muscle: forearm muscles

electrodes.tsv columns: - name: Electrode identifier (E0-E15) - x, y, z: 3D coordinates (percent units, no decimals) - coordinate_system: leftForearm or rightForearm - group: left or right

Events

events.tsv contains: - Keystroke events: Individual key-press and key-release

type: keystroke_X (where X is the key character)

latency: Sample index of keystroke

duration: Samples from press to release

key: Character typed

Prompt events: Text prompts shown to participant - type: prompt - prompt_text: Displayed text

Total keystrokes: 5,262,671 across all sessions

Coordinate Systems

Two separate coordinate systems (space entities): Left Forearm (space-leftForearm_coordsystem.json):
EMGCoordinateSystem: Other
EMGCoordinateUnits: percent
X: USP → RSP (0-100%)
Y: Right-hand rule perpendicular (limits: Olecranon Process → Cubital Fossa)
Z: Midpoint RSP-USP → Lateral Humeral Epicondyle
Right Forearm (space-rightForearm_coordsystem.json):
EMGCoordinateSystem: Other
EMGCoordinateUnits: percent
X: RSP → USP (0-100%, reversed from left)
Y: Right-hand rule perpendicular (limits: Olecranon Process → Cubital Fossa)
Z: Midpoint RSP-USP → Lateral Humeral Epicondyle
Anatomical landmarks: - RSP: Radial Styloid Process - USP: Ulnar Styloid Process - LHE: Lateral Humeral Epicondyle

Note: Same physical device worn on both wrists with reversed differential polarity

Signal Processing

Preprocessing Applied

High-pass filtering: 40 Hz cutoff (removes DC drift, motion artifacts)

Clock drift correction: Synchronization between devices and laptop

Temporal alignment: Left/right wristband sample alignment (±0.5 ms)

Irregular sampling handling: Resampling applied when deviation >1%

Signal Characteristics

Typical features: - Muscle activation precedes keystroke by ~tens of milliseconds - Different muscles activate for different fingers - “Co-articulation” effects: sEMG affected by adjacent keystrokes - Bigram/trigram context important for fast typists

Receptive field: Models typically need ~1 second context

Baseline Performance

Published Results (Sivakumar et al., 2024)

Generic Model (100 training users): - Validation CER: 52.10 ± 5.54% (with 6-gram LM) - Test CER: 51.78 ± 4.61% (with 6-gram LM) - Interpretation: Unusable without personalization

Personalized Model (finetuned from generic): - Validation CER: 8.31 ± 3.19% (with 6-gram LM) - Test CER: 6.95 ± 3.61% (with 6-gram LM) - Best user: 3.16% CER - Usability threshold: ~10% CER

Model architecture: Time Depth Separable ConvNets (TDS) Loss function: Connectionist Temporal Classification (CTC) Language model: 6-gram modified Kneser-Ney (trained on WikiText-103)

Key Findings

Generalization emerges at scale: 100+ users needed for meaningful representations

Personalization essential: Generic model alone has >50% CER

Domain shift is severe: Cross-user variation much larger than cross-session

No obvious user clusters: Every user requires individual adaptation

Data Splits

Benchmark Setup (from paper)

Training set: 100 users (all sessions except 2 validation per user) Validation set: 2 sessions from each of 100 training users Test set: 8 held-out users

Each test user: Multiple sessions split into train/val/test

Used for personalization experiments

Note: This split ensures test users don’t influence generic model hyperparameters

Use Cases

Machine Learning

Sequence-to-sequence learning: Similar to ASR but with different generative process

Domain adaptation: Cross-user, cross-session generalization

Transfer learning: Generic models with user-specific fine-tuning

Few-shot learning: Data-efficient personalization

Language modeling: Backspace-aware beam search decoding

Neuroscience

Motor control: Understand muscle coordination during fine motor tasks

Motor learning: Track typing skill changes across sessions

Neuromuscular variability: Study individual differences in muscle recruitment

Applications

Keyboard-free typing: Text entry without physical keyboard

AR/VR interfaces: Text input for head-mounted displays

Silent communication: Private text entry in public spaces

Accessibility: Alternative input for users with limited mobility

Known Issues and Limitations

By Design

Touch typing required: Not representative of hunt-and-peck typists

English only: Language-specific

Physical keyboard: Not actual keyboard-free typing

Typing style variation: Individual strategies differ (especially non-fluent typists)

No demographic data: Age, sex, handedness not collected

Technical

Domain shift: Large variations across users and sessions

Signal amplitude: Varies with typing force (not normalized)

Backspace handling: More complex than speech (can modify history)

Hardware unavailable: sEMG-RD not commercially available

Data Quality

Irregular sampling: Some sessions required resampling (up to 9290% deviation detected)

Electrode placement: Intentionally varies across sessions (creates realistic challenge)

Session length: Varies by typing speed (9.5-47.5 min)

Access and Contact

Original data: facebookresearch/emg2qwerty BIDS conversion: Custom MATLAB tools using EEGLAB BIDS plugin Data curator: Yahya Shirazi, SCCN, INC, UCSD Contact: See original publication for corresponding author

License

Non-Commercial, Share Alike CC-BY-NC-SA 4.0

Citation
Sivakumar, V., Seely, J., Du, A., Bittner, S.R., Berenzweig, A.,
Bolarinwa, A., Gramfort, A., & Mandel, M.I. (2024).
emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography.
arXiv:2410.20081. https://github.com/facebookresearch/emg2qwerty
Data Curator

Yahya Shirazi SCCN (Swartz Center for Computational Neuroscience) INC (Institute for Neural Computation) University of California San Diego

Version History

v1.0 (2025-10-01): Initial BIDS conversion

BIDS Version: 1.11 | EMG-BIDS: BEP-042 | Updated: Oct 1, 2025

§ 03Cohort · Participants

Cohort#

Dataset Statistics#

Channel counts: 32 ch (n=1136 recordings)

Sampling frequencies: 2000.0 Hz (n=1136 recordings)

Total recording duration: 346 h

§ 04Signal · Electrodes & trace

Signal · Electrodes & live trace#

Fig. 01 Signal & montage 32 ch · EMG · 2000 Hz · 108 subjects, 1136 recordings

Live trace viewer — sub-70495563 · task-typing

Showing one representative recording out of 108 subjects and 1136 recordings in this dataset. Browse the full set on OpenNeuro; drop any other _emg.{set,edf,bdf,vhdr} file onto the viewer (or pass ?emg=<url>) to inspect it.

Electrode layout — EMG · 32 sensors — 32 channels

NEMAR Processing Statistics#

The plots below are generated by NEMAR’s automated EEG pipeline. The histogram shows pipeline success for data cleaning and ICA decomposition, the percentage of data frames and EEG channels retained after artefact removal, line noise per channel (RMS, dB), and the age/gender distribution of participants.

HED event descriptors word cloud

§ 05Manifest · BIDS tree

Manifest#

File Explorer#

Browse the BIDS file structure of this dataset. Records are fetched on demand from the EEGDash catalog the first time you open the explorer.

Recordings—

Files—

Subjects—

Modalities—

Click to load file structure…

§ 06API · Programmatic access

API Reference#

Signature

eegdash.dataset

class

eegdash.dataset.NM000104(cache_dir, query=None, s3_bucket=None, **kwargs)

Bases: EEGDashDataset

Author (year)Sivakumar2024

Canonical—

Importable asNM000104 · Sivakumar2024

Sourceeegdash/dataset/registry.py · [source ↗]

class eegdash.dataset.NM000104(cache_dir: str, query: dict | None = None, s3_bucket: str | None = None, **kwargs)[source]#

emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography

Study:: nm000104 (NeMAR)
Author (year):: Sivakumar2024
Canonical:: —

Also importable as: NM000104, Sivakumar2024.

Modality: emg. Subjects: 108; recordings: 1136; tasks: 1.

Parameters:

cache_dir (str | Path) – Directory where data are cached locally.
query (dict | None) – Additional MongoDB-style filters to AND with the dataset selection. Must not contain the key dataset.
s3_bucket (str | None) – Base S3 bucket used to locate the data.
**kwargs (dict) – Additional keyword arguments forwarded to EEGDashDataset.

data_dir#

Local dataset cache directory (cache_dir / dataset_id).

Type:: Path

query#

Merged query with the dataset filter applied.

Type:: dict

records#

Metadata records used to build the dataset, if pre-fetched.

Type:: list[dict] | None

Notes

Each item is a recording; recording-level metadata are available via dataset.description. query supports MongoDB-style filters on fields in ALLOWED_QUERY_FIELDS and is combined with the dataset filter. Dataset-specific caveats are not provided in the summary metadata.

References

OpenNeuro dataset: https://openneuro.org/datasets/nm000104 NeMAR dataset: https://nemar.org/dataexplorer/detail?dataset_id=nm000104 DOI: https://doi.org/10.82901/nemar.nm000104

Examples

>>> from eegdash.dataset import NM000104
>>> dataset = NM000104(cache_dir="./data")
>>> recording = dataset[0]
>>> raw = recording.load()

__init__(cache_dir: str, query: dict | None = None, s3_bucket: str | None = None, **kwargs)[source]#

save(path: str, overwrite: bool = False, offset: int = 0)[source]#

Save datasets to files by creating one subdirectory for each dataset:

path/
    0/
        0-raw.fif | 0-epo.fif
        description.json
        raw_preproc_kwargs.json (if raws were preprocessed)
        window_kwargs.json (if this is a windowed dataset)
        window_preproc_kwargs.json  (if windows were preprocessed)
        target_name.json (if target_name is not None and dataset is raw)
    1/
        1-raw.fif | 1-epo.fif
        description.json
        raw_preproc_kwargs.json (if raws were preprocessed)
        window_kwargs.json (if this is a windowed dataset)
        window_preproc_kwargs.json  (if windows were preprocessed)
        target_name.json (if target_name is not None and dataset is raw)

Parameters:

path (str) –

Directory in which subdirectories are created to store
-raw.fif | -epo.fif and .json files to.
overwrite (bool) – Whether to delete old subdirectories that will be saved to in this call.
offset (int) – If provided, the integer is added to the id of the dataset in the concat. This is useful in the setting of very large datasets, where one dataset has to be processed and saved at a time to account for its original position.

Access modesMNE → braindecode → PyTorch → ML

.rawMNE Raw object — standard tools (filter, epoch, ICA, plot_psd).mne

BaseConcatDatasetEach record is a lazy BaseDataset from braindecode — windowed via create_windows_from_events.braindecode

DataLoaderWraps the windowed dataset into a PyTorch DataLoader; supports parallel workers and on-the-fly augmentations.pytorch

Zarr cacheOptional braindecode Zarr mirror for fast resume; persisted to cache_dir.zarr

Hugging FaceNo per-dataset mirror published yet — browse the EEGDash org listing for sibling datasets. See the datasets loader API.huggingface

Croissant 1.0Machine-readable JSON-LD descriptor — NM000104.croissant.json (MLCommons schema, ingestible by PyTorch / TensorFlow / JAX).mlcommons

Examples using EEGDashcurated · start here

Find datasets with the EEGDash APIQuery the catalogue, filter by task or modality, list candidates.

Load one EEG recordingResolve a single record to an MNE Raw with channels and events.

EEG recording to PyTorch DataLoaderWrap braindecode windows in a DataLoader for model training.

Preprocess EEG and create windowsFilter, resample, epoch — and persist the windowed dataset.

Save and reload prepared dataCache a windowed dataset to disk and reattach it without recompute.

Download a dataset locallyPrefetch BIDS files to a local cache and validate the layout.

Swap any load_dataset(...) call for nm000104 to reproduce the tutorial on this dataset.

Citation

Viswanath Sivakumar, Jeffrey Seely, Alan Du, Sean R. Bittner, Adam Berenzweig, … (2024). emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography. 10.82901/nemar.nm000104

Provenance

¹Contributed to nemar in BIDS format.

²Curated & ingested by the EEGDash catalog; see CITATION.cff for canonical reference.

³Persistent identifier: 10.82901/nemar.nm000104.

Related & sibling datasets

NM000106EMG · 100 subj NM000107EMG · 100 subj NM000105EMG · 100 subj NM000108EMG · 20 subj NM000159EMG · 16 subj

+ 1 more — see See Also below →

BIDS

BIDS 1.11.0

Sidecars

events · channels · electrodes

Provenance

CC-BY-NC-SA-4.0 · 10.82901/nemar.nm000104

Machine-readable

schema.org/Dataset · Croissant

Mirrors

OpenNeuro · NEMAR · HF org

Dataset ID	`NM000104`
Title	emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography
Author (year)	`Sivakumar2024`
Canonical	—
Importable as	`NM000104`, `Sivakumar2024`
Year	2024
Authors	Viswanath Sivakumar, Jeffrey Seely, Alan Du, Sean R. Bittner, Adam Berenzweig, Anuoluwapo Bolarinwa, Alexandre Gramfort, Michael I. Mandel
License	CC-BY-NC-SA-4.0
Citation / DOI	10.82901/nemar.nm000104
Source links	OpenNeuro \| NeMAR \| Source URL

NM000104: emg dataset, 108 subjects#

Quickstart#

About This Dataset#

Cohort#

Dataset Statistics#

Signal · Electrodes & live trace#

NEMAR Processing Statistics#

Manifest#

File Explorer#

API Reference#

Citation

Provenance

Related & sibling datasets

See Also#