EEGdashNeMARNM000104
Iss. 104 · 108 subjects · 1136 recordings · CC-BY-NC-SA-4.0
Dataset Brief · emg2qwerty

NM000104: emg dataset, 108 subjects#

emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography

Citation: Viswanath Sivakumar, Jeffrey Seely, Alan Du, Sean R. Bittner, Adam Berenzweig, Anuoluwapo Bolarinwa, Alexandre Gramfort, Michael I. Mandel (2024). emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography. 10.82901/nemar.nm000104

108-participant EMG dataset — emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography.

EMG · 32 ch2000 HzBIDS 1.11.0Task · typing1135 sessions
Layer 01Study
What was asked
Hypothesis, independent & dependent variables, paradigm, cohort, and the editorial caveats around what the recordings can and cannot answer.
Layer 02Signal · BIDS
What was recorded
Sidecars, channels & electrodes, coordinate system, event semantics, and quality stats from the NEMAR pipeline when available.
Layer 03Training · ML
What you can train on
Recommended access modes — MNE Raw, braindecode windows, PyTorch DataLoader — plus the targets the metadata makes addressable.
§ 01Access · Get started

Quickstart#

Install

pip install eegdash

Access the data

from eegdash.dataset import NM000104

dataset = NM000104(cache_dir="./data")
# Get the raw object of the first recording
raw = dataset.datasets[0].raw
print(raw.info)

Filter by subject

dataset = NM000104(cache_dir="./data", subject="01")

Advanced query

dataset = NM000104(
    cache_dir="./data",
    query={"subject": {"$in": ["01", "02"]}},
)

Iterate recordings

for rec in dataset:
    print(rec.subject, rec.raw.info['sfreq'])

If you use this dataset in your research, please cite the original authors.

BibTeX

@dataset{nm000104,
  title = {emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography},
  author = {Viswanath Sivakumar and Jeffrey Seely and Alan Du and Sean R. Bittner and Adam Berenzweig and Anuoluwapo Bolarinwa and Alexandre Gramfort and Michael I. Mandel},
  doi = {10.82901/nemar.nm000104},
  url = {https://doi.org/10.82901/nemar.nm000104},
}
§ 02Study · The README

About This Dataset#

Dataset: emg2qwerty - Touch typing from wrist-based surface electromyography

Task: Touch typing on QWERTY keyboard Participants: 108 subjects Sessions: 1,135 total (average 10 per subject, range 1-18) Duration: 346.4 hours total (9.5-47.5 min per session) Publication: Sivakumar et al., 2024 - “emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography”

This dataset captures wrist-based sEMG signals during touch typing on a physical keyboard. The goal is to enable keyboard-free text input by decoding typing intent directly from neuromuscular activity, with applications in AR/VR, mobile computing, and brain-computer interfaces.

emg2qwerty: Touch Typing from Surface Electromyography

Overview

This is the largest public sEMG dataset to date, specifically designed to study: - Cross-user generalization - Cross-session adaptation (domain shift from electrode placement) - Sequence-to-sequence learning (analogous to automatic speech recognition) - High-bandwidth neuromotor interfaces

View full README

emg2qwerty: Touch Typing from Surface Electromyography

Overview

This is the largest public sEMG dataset to date, specifically designed to study: - Cross-user generalization - Cross-session adaptation (domain shift from electrode placement) - Sequence-to-sequence learning (analogous to automatic speech recognition) - High-bandwidth neuromotor interfaces

Dataset Details

Participants

Sample size: 108 participants Demographics: Not available (age, sex, handedness marked as n/a) Screening: Touch typists with >90% correct finger-to-key mapping Typing speed: 130-439 keys/min (mean: 265 keys/min, ~4.4 keys/sec)

Hardware

Device: sEMG Research Device (sEMG-RD) Configuration: Two wristbands (left and right wrists) Channels: 32 total (16 per wrist) Sampling rate: 2000 Hz Bit depth: 12 bits Dynamic range: ±6.6 mV Bandwidth: 20-850 Hz Connectivity: Bluetooth Electrode type: Dry gold-plated differential pairs

Recording Setup

Keyboard: Apple Magic Keyboard (US English) Text prompts: - Random words from dictionary - Sentences from English Wikipedia - Filtered for offensive terms - Lowercase with basic punctuation only

Ground truth: Keylogger recording key-down and key-up timestamps (±0.5 ms precision) Backspace usage: Allowed (natural typing behavior)

Session Protocol

  1. Participant dons two sEMG-RDs (one per wrist)

  2. Types prompted text on physical keyboard

  3. Keylogger records all keystrokes with timestamps

  4. sEMG signals streamed via Bluetooth

  5. Between sessions: Bands doffed and re-donned (realistic electrode placement variability)

Session duration: 9.5-47.5 minutes (depends on typing speed) Inter-session protocol: Complete band removal and replacement to simulate real-world usage

Data Contents

Files per Session

sub-XXXXXXXX/ses-YYYYYYYYYY/emg/
├── sub-XXXXXXXX_ses-YYYYYYYYYY_task-typing_emg.edf
├── sub-XXXXXXXX_ses-YYYYYYYYYY_task-typing_emg.json
├── sub-XXXXXXXX_ses-YYYYYYYYYY_task-typing_channels.tsv
├── sub-XXXXXXXX_ses-YYYYYYYYYY_task-typing_events.tsv
└── sub-XXXXXXXX_ses-YYYYYYYYYY_electrodes.tsv

Channel Configuration

Total channels: 32 - EMG0-EMG15: Left wrist - EMG16-EMG31: Right wrist

Channel naming: Unique across entire dataset (EMG0-EMG31) Electrode naming: E0-E15 (reused for left and right wrists) Reference: Bipolar (differential sensing) channels.tsv columns: - name: Channel identifier (EMG0-EMG31) - type: EMG - units: V - signal_electrode: Physical electrode name (E0-E15) - reference: bipolar - group: left or right (wrist) - target_muscle: forearm muscles

electrodes.tsv columns: - name: Electrode identifier (E0-E15) - x, y, z: 3D coordinates (percent units, no decimals) - coordinate_system: leftForearm or rightForearm - group: left or right

Events

events.tsv contains: - Keystroke events: Individual key-press and key-release

  • type: keystroke_X (where X is the key character)

  • latency: Sample index of keystroke

  • duration: Samples from press to release

  • key: Character typed

  • Prompt events: Text prompts shown to participant - type: prompt - prompt_text: Displayed text

Total keystrokes: 5,262,671 across all sessions

Coordinate Systems

Two separate coordinate systems (space entities): Left Forearm (space-leftForearm_coordsystem.json):

EMGCoordinateSystem: Other
EMGCoordinateUnits: percent
X: USP → RSP (0-100%)
Y: Right-hand rule perpendicular (limits: Olecranon Process → Cubital Fossa)
Z: Midpoint RSP-USP → Lateral Humeral Epicondyle

Right Forearm (space-rightForearm_coordsystem.json):

EMGCoordinateSystem: Other
EMGCoordinateUnits: percent
X: RSP → USP (0-100%, reversed from left)
Y: Right-hand rule perpendicular (limits: Olecranon Process → Cubital Fossa)
Z: Midpoint RSP-USP → Lateral Humeral Epicondyle

Anatomical landmarks: - RSP: Radial Styloid Process - USP: Ulnar Styloid Process - LHE: Lateral Humeral Epicondyle

Note: Same physical device worn on both wrists with reversed differential polarity

Signal Processing

Preprocessing Applied

  1. High-pass filtering: 40 Hz cutoff (removes DC drift, motion artifacts)

  2. Clock drift correction: Synchronization between devices and laptop

  3. Temporal alignment: Left/right wristband sample alignment (±0.5 ms)

  4. Irregular sampling handling: Resampling applied when deviation >1%

Signal Characteristics

Typical features: - Muscle activation precedes keystroke by ~tens of milliseconds - Different muscles activate for different fingers - “Co-articulation” effects: sEMG affected by adjacent keystrokes - Bigram/trigram context important for fast typists

Receptive field: Models typically need ~1 second context

Baseline Performance

Published Results (Sivakumar et al., 2024)

Generic Model (100 training users): - Validation CER: 52.10 ± 5.54% (with 6-gram LM) - Test CER: 51.78 ± 4.61% (with 6-gram LM) - Interpretation: Unusable without personalization

Personalized Model (finetuned from generic): - Validation CER: 8.31 ± 3.19% (with 6-gram LM) - Test CER: 6.95 ± 3.61% (with 6-gram LM) - Best user: 3.16% CER - Usability threshold: ~10% CER

Model architecture: Time Depth Separable ConvNets (TDS) Loss function: Connectionist Temporal Classification (CTC) Language model: 6-gram modified Kneser-Ney (trained on WikiText-103)

Key Findings

  1. Generalization emerges at scale: 100+ users needed for meaningful representations

  2. Personalization essential: Generic model alone has >50% CER

  3. Domain shift is severe: Cross-user variation much larger than cross-session

  4. No obvious user clusters: Every user requires individual adaptation

Data Splits

Benchmark Setup (from paper)

Training set: 100 users (all sessions except 2 validation per user) Validation set: 2 sessions from each of 100 training users Test set: 8 held-out users

  • Each test user: Multiple sessions split into train/val/test

  • Used for personalization experiments

Note: This split ensures test users don’t influence generic model hyperparameters

Use Cases

Machine Learning

  • Sequence-to-sequence learning: Similar to ASR but with different generative process

  • Domain adaptation: Cross-user, cross-session generalization

  • Transfer learning: Generic models with user-specific fine-tuning

  • Few-shot learning: Data-efficient personalization

  • Language modeling: Backspace-aware beam search decoding

Neuroscience

  • Motor control: Understand muscle coordination during fine motor tasks

  • Motor learning: Track typing skill changes across sessions

  • Neuromuscular variability: Study individual differences in muscle recruitment

Applications

  • Keyboard-free typing: Text entry without physical keyboard

  • AR/VR interfaces: Text input for head-mounted displays

  • Silent communication: Private text entry in public spaces

  • Accessibility: Alternative input for users with limited mobility

Known Issues and Limitations

By Design

  • Touch typing required: Not representative of hunt-and-peck typists

  • English only: Language-specific

  • Physical keyboard: Not actual keyboard-free typing

  • Typing style variation: Individual strategies differ (especially non-fluent typists)

  • No demographic data: Age, sex, handedness not collected

Technical

  • Domain shift: Large variations across users and sessions

  • Signal amplitude: Varies with typing force (not normalized)

  • Backspace handling: More complex than speech (can modify history)

  • Hardware unavailable: sEMG-RD not commercially available

Data Quality

  • Irregular sampling: Some sessions required resampling (up to 9290% deviation detected)

  • Electrode placement: Intentionally varies across sessions (creates realistic challenge)

  • Session length: Varies by typing speed (9.5-47.5 min)

Access and Contact

Original data: facebookresearch/emg2qwerty BIDS conversion: Custom MATLAB tools using EEGLAB BIDS plugin Data curator: Yahya Shirazi, SCCN, INC, UCSD Contact: See original publication for corresponding author

License

Non-Commercial, Share Alike CC-BY-NC-SA 4.0

Citation

Sivakumar, V., Seely, J., Du, A., Bittner, S.R., Berenzweig, A.,
Bolarinwa, A., Gramfort, A., & Mandel, M.I. (2024).
emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography.
arXiv:2410.20081. https://github.com/facebookresearch/emg2qwerty

Data Curator

Yahya Shirazi SCCN (Swartz Center for Computational Neuroscience) INC (Institute for Neural Computation) University of California San Diego

Version History

v1.0 (2025-10-01): Initial BIDS conversion

BIDS Version: 1.11 | EMG-BIDS: BEP-042 | Updated: Oct 1, 2025

§ 03Cohort · Participants

Cohort#

Dataset Statistics#

Channel counts: 32 ch (n=1136 recordings)

Sampling frequencies: 2000.0 Hz (n=1136 recordings)

Total recording duration: 346 h

§ 04Signal · Electrodes & trace

Signal · Electrodes & live trace#

Fig. 01 Signal & montage 32 ch · EMG · 2000 Hz · 108 subjects, 1136 recordings
Live trace viewer — sub-70495563 · task-typing

Showing one representative recording out of 108 subjects and 1136 recordings in this dataset. Browse the full set on OpenNeuro; drop any other _emg.{set,edf,bdf,vhdr} file onto the viewer (or pass ?emg=<url>) to inspect it.

Electrode layout — EMG · 32 sensors — 32 channels

NEMAR Processing Statistics#

The plots below are generated by NEMAR’s automated EEG pipeline. The histogram shows pipeline success for data cleaning and ICA decomposition, the percentage of data frames and EEG channels retained after artefact removal, line noise per channel (RMS, dB), and the age/gender distribution of participants.

HED event descriptors word cloud HED event descriptors word cloud — NM000104
§ 05Manifest · BIDS tree

Manifest#

File Explorer#

Browse the BIDS file structure of this dataset. Records are fetched on demand from the EEGDash catalog the first time you open the explorer.

Recordings
Files
Subjects
Modalities
Click to load file structure…
Full dataset metadata table

Dataset ID

NM000104

Title

emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography

Author (year)

Sivakumar2024

Canonical

Importable as

NM000104, Sivakumar2024

Year

2024

Authors

Viswanath Sivakumar, Jeffrey Seely, Alan Du, Sean R. Bittner, Adam Berenzweig, Anuoluwapo Bolarinwa, Alexandre Gramfort, Michael I. Mandel

License

CC-BY-NC-SA-4.0

Citation / DOI

10.82901/nemar.nm000104

Source links

OpenNeuro | NeMAR | Source URL

Copy-paste BibTeX
@dataset{nm000104,
  title = {emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography},
  author = {Viswanath Sivakumar and Jeffrey Seely and Alan Du and Sean R. Bittner and Adam Berenzweig and Anuoluwapo Bolarinwa and Alexandre Gramfort and Michael I. Mandel},
  doi = {10.82901/nemar.nm000104},
  url = {https://doi.org/10.82901/nemar.nm000104},
}
§ 06API · Programmatic access

API Reference#

Signature
eegdash.dataset
class
eegdash.dataset.NM000104(cache_dir, query=None, s3_bucket=None, **kwargs)
Bases: EEGDashDataset
Author (year)Sivakumar2024
Canonical
Importable asNM000104 · Sivakumar2024
Sourceeegdash/dataset/registry.py · [source ↗]
class eegdash.dataset.NM000104(cache_dir: str, query: dict | None = None, s3_bucket: str | None = None, **kwargs)[source]#

emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography

Study:

nm000104 (NeMAR)

Author (year):

Sivakumar2024

Canonical:

Also importable as: NM000104, Sivakumar2024.

Modality: emg. Subjects: 108; recordings: 1136; tasks: 1.

Parameters:
  • cache_dir (str | Path) – Directory where data are cached locally.

  • query (dict | None) – Additional MongoDB-style filters to AND with the dataset selection. Must not contain the key dataset.

  • s3_bucket (str | None) – Base S3 bucket used to locate the data.

  • **kwargs (dict) – Additional keyword arguments forwarded to EEGDashDataset.

data_dir#

Local dataset cache directory (cache_dir / dataset_id).

Type:

Path

query#

Merged query with the dataset filter applied.

Type:

dict

records#

Metadata records used to build the dataset, if pre-fetched.

Type:

list[dict] | None

Notes

Each item is a recording; recording-level metadata are available via dataset.description. query supports MongoDB-style filters on fields in ALLOWED_QUERY_FIELDS and is combined with the dataset filter. Dataset-specific caveats are not provided in the summary metadata.

References

OpenNeuro dataset: https://openneuro.org/datasets/nm000104 NeMAR dataset: https://nemar.org/dataexplorer/detail?dataset_id=nm000104 DOI: https://doi.org/10.82901/nemar.nm000104

Examples

>>> from eegdash.dataset import NM000104
>>> dataset = NM000104(cache_dir="./data")
>>> recording = dataset[0]
>>> raw = recording.load()
__init__(cache_dir: str, query: dict | None = None, s3_bucket: str | None = None, **kwargs)[source]#
save(path: str, overwrite: bool = False, offset: int = 0)[source]#

Save datasets to files by creating one subdirectory for each dataset:

path/
    0/
        0-raw.fif | 0-epo.fif
        description.json
        raw_preproc_kwargs.json (if raws were preprocessed)
        window_kwargs.json (if this is a windowed dataset)
        window_preproc_kwargs.json  (if windows were preprocessed)
        target_name.json (if target_name is not None and dataset is raw)
    1/
        1-raw.fif | 1-epo.fif
        description.json
        raw_preproc_kwargs.json (if raws were preprocessed)
        window_kwargs.json (if this is a windowed dataset)
        window_preproc_kwargs.json  (if windows were preprocessed)
        target_name.json (if target_name is not None and dataset is raw)
Parameters:
  • path (str) –

    Directory in which subdirectories are created to store

    -raw.fif | -epo.fif and .json files to.

  • overwrite (bool) – Whether to delete old subdirectories that will be saved to in this call.

  • offset (int) – If provided, the integer is added to the id of the dataset in the concat. This is useful in the setting of very large datasets, where one dataset has to be processed and saved at a time to account for its original position.

Access modesMNE → braindecode → PyTorch → ML
.rawMNE Raw object — standard tools (filter, epoch, ICA, plot_psd).mne
DataLoaderWraps the windowed dataset into a PyTorch DataLoader; supports parallel workers and on-the-fly augmentations.pytorch
Zarr cacheOptional braindecode Zarr mirror for fast resume; persisted to cache_dir.zarr
Hugging FaceNo per-dataset mirror published yet — browse the EEGDash org listing for sibling datasets. See the datasets loader API.huggingface
Croissant 1.0Machine-readable JSON-LD descriptorNM000104.croissant.json (MLCommons schema, ingestible by PyTorch / TensorFlow / JAX).mlcommons
Examples using EEGDashcurated · start here

Swap any load_dataset(...) call for nm000104 to reproduce the tutorial on this dataset.

Citation

Viswanath Sivakumar, Jeffrey Seely, Alan Du, Sean R. Bittner, Adam Berenzweig, … (2024). emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography. 10.82901/nemar.nm000104

Provenance

¹Contributed to nemar in BIDS format.

²Curated & ingested by the EEGDash catalog; see CITATION.cff for canonical reference.

³Persistent identifier: 10.82901/nemar.nm000104.

BIDS
BIDS 1.11.0
Sidecars
events · channels · electrodes
Provenance
CC-BY-NC-SA-4.0 · 10.82901/nemar.nm000104
Machine-readable
Mirrors

See Also#