NM000104#

emg2qwerty: Touch Typing from Surface Electromyography

Access recordings and metadata through EEGDash.

Citation: Viswanath Sivakumar, Jeffrey Seely, Alan Du, Sean R. Bittner, Adam Berenzweig, Anuoluwapo Bolarinwa, Alexandre Gramfort, Michael I. Mandel (2025). emg2qwerty: Touch Typing from Surface Electromyography. 10.5281/zenodo.17287903

Modality: emg Subjects: 108 Recordings: 1135 License: CC-BY-NC-SA-4.0 Source: nemar

Metadata: Complete (100%)

Quickstart#

Install

pip install eegdash

Access the data

from eegdash.dataset import NM000104

dataset = NM000104(cache_dir="./data")
# Get the raw object of the first recording
raw = dataset.datasets[0].raw
print(raw.info)

Filter by subject

dataset = NM000104(cache_dir="./data", subject="01")

Advanced query

dataset = NM000104(
    cache_dir="./data",
    query={"subject": {"$in": ["01", "02"]}},
)

Iterate recordings

for rec in dataset:
    print(rec.subject, rec.raw.info['sfreq'])

If you use this dataset in your research, please cite the original authors.

BibTeX

@dataset{nm000104,
  title = {emg2qwerty: Touch Typing from Surface Electromyography},
  author = {Viswanath Sivakumar and Jeffrey Seely and Alan Du and Sean R. Bittner and Adam Berenzweig and Anuoluwapo Bolarinwa and Alexandre Gramfort and Michael I. Mandel},
  doi = {10.5281/zenodo.17287903},
  url = {https://doi.org/10.5281/zenodo.17287903},
}

About This Dataset#

emg2qwerty: Touch Typing from Surface Electromyography

Overview

Dataset: emg2qwerty - Touch typing from wrist-based surface electromyography Task: Touch typing on QWERTY keyboard Participants: 108 subjects Sessions: 1,135 total (average 10 per subject, range 1-18)

View full README

emg2qwerty: Touch Typing from Surface Electromyography

Overview

Dataset: emg2qwerty - Touch typing from wrist-based surface electromyography Task: Touch typing on QWERTY keyboard Participants: 108 subjects Sessions: 1,135 total (average 10 per subject, range 1-18) Duration: 346.4 hours total (9.5-47.5 min per session) Publication: Sivakumar et al., 2024 - “emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography”

Purpose

This dataset captures wrist-based sEMG signals during touch typing on a physical keyboard. The goal is to enable keyboard-free text input by decoding typing intent directly from neuromuscular activity, with applications in AR/VR, mobile computing, and brain-computer interfaces. This is the largest public sEMG dataset to date, specifically designed to study: - Cross-user generalization - Cross-session adaptation (domain shift from electrode placement) - Sequence-to-sequence learning (analogous to automatic speech recognition) - High-bandwidth neuromotor interfaces

Dataset Details

Participants

Sample size: 108 participants Demographics: Not available (age, sex, handedness marked as n/a) Screening: Touch typists with >90% correct finger-to-key mapping Typing speed: 130-439 keys/min (mean: 265 keys/min, ~4.4 keys/sec)

Hardware

Device: sEMG Research Device (sEMG-RD) Configuration: Two wristbands (left and right wrists) Channels: 32 total (16 per wrist) Sampling rate: 2000 Hz Bit depth: 12 bits Dynamic range: ±6.6 mV Bandwidth: 20-850 Hz Connectivity: Bluetooth Electrode type: Dry gold-plated differential pairs

Recording Setup

Keyboard: Apple Magic Keyboard (US English) Text prompts: - Random words from dictionary - Sentences from English Wikipedia - Filtered for offensive terms - Lowercase with basic punctuation only

Ground truth: Keylogger recording key-down and key-up timestamps (±0.5 ms precision) Backspace usage: Allowed (natural typing behavior)

Session Protocol

  1. Participant dons two sEMG-RDs (one per wrist)

  2. Types prompted text on physical keyboard

  3. Keylogger records all keystrokes with timestamps

  4. sEMG signals streamed via Bluetooth

  5. Between sessions: Bands doffed and re-donned (realistic electrode placement variability)

Session duration: 9.5-47.5 minutes (depends on typing speed) Inter-session protocol: Complete band removal and replacement to simulate real-world usage

Data Contents

Files per Session

sub-XXXXXXXX/ses-YYYYYYYYYY/emg/
├── sub-XXXXXXXX_ses-YYYYYYYYYY_task-typing_emg.edf
├── sub-XXXXXXXX_ses-YYYYYYYYYY_task-typing_emg.json
├── sub-XXXXXXXX_ses-YYYYYYYYYY_task-typing_channels.tsv
├── sub-XXXXXXXX_ses-YYYYYYYYYY_task-typing_events.tsv
└── sub-XXXXXXXX_ses-YYYYYYYYYY_electrodes.tsv

Channel Configuration

Total channels: 32 - EMG0-EMG15: Left wrist - EMG16-EMG31: Right wrist

Channel naming: Unique across entire dataset (EMG0-EMG31) Electrode naming: E0-E15 (reused for left and right wrists) Reference: Bipolar (differential sensing) channels.tsv columns: - name: Channel identifier (EMG0-EMG31) - type: EMG - units: V - signal_electrode: Physical electrode name (E0-E15) - reference: bipolar - group: left or right (wrist) - target_muscle: forearm muscles

electrodes.tsv columns: - name: Electrode identifier (E0-E15) - x, y, z: 3D coordinates (percent units, no decimals) - coordinate_system: leftForearm or rightForearm - group: left or right

Events

events.tsv contains: - Keystroke events: Individual key-press and key-release

  • type: keystroke_X (where X is the key character)

  • latency: Sample index of keystroke

  • duration: Samples from press to release

  • key: Character typed

  • Prompt events: Text prompts shown to participant - type: prompt - prompt_text: Displayed text

Total keystrokes: 5,262,671 across all sessions

Coordinate Systems

Two separate coordinate systems (space entities): Left Forearm (space-leftForearm_coordsystem.json):

EMGCoordinateSystem: Other
EMGCoordinateUnits: percent
X: USP → RSP (0-100%)
Y: Right-hand rule perpendicular (limits: Olecranon Process → Cubital Fossa)
Z: Midpoint RSP-USP → Lateral Humeral Epicondyle

Right Forearm (space-rightForearm_coordsystem.json):

EMGCoordinateSystem: Other
EMGCoordinateUnits: percent
X: RSP → USP (0-100%, reversed from left)
Y: Right-hand rule perpendicular (limits: Olecranon Process → Cubital Fossa)
Z: Midpoint RSP-USP → Lateral Humeral Epicondyle

Anatomical landmarks: - RSP: Radial Styloid Process - USP: Ulnar Styloid Process - LHE: Lateral Humeral Epicondyle

Note: Same physical device worn on both wrists with reversed differential polarity

Signal Processing

Preprocessing Applied

  1. High-pass filtering: 40 Hz cutoff (removes DC drift, motion artifacts)

  2. Clock drift correction: Synchronization between devices and laptop

  3. Temporal alignment: Left/right wristband sample alignment (±0.5 ms)

  4. Irregular sampling handling: Resampling applied when deviation >1%

Signal Characteristics

Typical features: - Muscle activation precedes keystroke by ~tens of milliseconds - Different muscles activate for different fingers - “Co-articulation” effects: sEMG affected by adjacent keystrokes - Bigram/trigram context important for fast typists

Receptive field: Models typically need ~1 second context

Baseline Performance

Published Results (Sivakumar et al., 2024)

Generic Model (100 training users): - Validation CER: 52.10 ± 5.54% (with 6-gram LM) - Test CER: 51.78 ± 4.61% (with 6-gram LM) - Interpretation: Unusable without personalization

Personalized Model (finetuned from generic): - Validation CER: 8.31 ± 3.19% (with 6-gram LM) - Test CER: 6.95 ± 3.61% (with 6-gram LM) - Best user: 3.16% CER - Usability threshold: ~10% CER

Model architecture: Time Depth Separable ConvNets (TDS) Loss function: Connectionist Temporal Classification (CTC) Language model: 6-gram modified Kneser-Ney (trained on WikiText-103)

Key Findings

  1. Generalization emerges at scale: 100+ users needed for meaningful representations

  2. Personalization essential: Generic model alone has >50% CER

  3. Domain shift is severe: Cross-user variation much larger than cross-session

  4. No obvious user clusters: Every user requires individual adaptation

Data Splits

Benchmark Setup (from paper)

Training set: 100 users (all sessions except 2 validation per user) Validation set: 2 sessions from each of 100 training users Test set: 8 held-out users

  • Each test user: Multiple sessions split into train/val/test

  • Used for personalization experiments

Note: This split ensures test users don’t influence generic model hyperparameters

Use Cases

Machine Learning

  • Sequence-to-sequence learning: Similar to ASR but with different generative process

  • Domain adaptation: Cross-user, cross-session generalization

  • Transfer learning: Generic models with user-specific fine-tuning

  • Few-shot learning: Data-efficient personalization

  • Language modeling: Backspace-aware beam search decoding

Neuroscience

  • Motor control: Understand muscle coordination during fine motor tasks

  • Motor learning: Track typing skill changes across sessions

  • Neuromuscular variability: Study individual differences in muscle recruitment

Applications

  • Keyboard-free typing: Text entry without physical keyboard

  • AR/VR interfaces: Text input for head-mounted displays

  • Silent communication: Private text entry in public spaces

  • Accessibility: Alternative input for users with limited mobility

Known Issues and Limitations

By Design

  • Touch typing required: Not representative of hunt-and-peck typists

  • English only: Language-specific

  • Physical keyboard: Not actual keyboard-free typing

  • Typing style variation: Individual strategies differ (especially non-fluent typists)

  • No demographic data: Age, sex, handedness not collected

Technical

  • Domain shift: Large variations across users and sessions

  • Signal amplitude: Varies with typing force (not normalized)

  • Backspace handling: More complex than speech (can modify history)

  • Hardware unavailable: sEMG-RD not commercially available

Data Quality

  • Irregular sampling: Some sessions required resampling (up to 9290% deviation detected)

  • Electrode placement: Intentionally varies across sessions (creates realistic challenge)

  • Session length: Varies by typing speed (9.5-47.5 min)

Access and Contact

Original data: facebookresearch/emg2qwerty BIDS conversion: Custom MATLAB tools using EEGLAB BIDS plugin Data curator: Yahya Shirazi, SCCN, INC, UCSD Contact: See original publication for corresponding author

License

Non-Commercial, Share Alike CC-BY-NC-SA 4.0

Citation

Sivakumar, V., Seely, J., Du, A., Bittner, S.R., Berenzweig, A.,
Bolarinwa, A., Gramfort, A., & Mandel, M.I. (2024).
emg2qwerty: A Large Dataset with Baselines for Touch Typing using Surface Electromyography.
arXiv:2410.20081. https://github.com/facebookresearch/emg2qwerty

Data Curator

Yahya Shirazi SCCN (Swartz Center for Computational Neuroscience) INC (Institute for Neural Computation) University of California San Diego

Version History

v1.0 (2025-10-01): Initial BIDS conversion

BIDS Version: 1.11 | EMG-BIDS: BEP-042 | Updated: Oct 1, 2025

Dataset Information#

Dataset ID

NM000104

Title

emg2qwerty: Touch Typing from Surface Electromyography

Year

2025

Authors

Viswanath Sivakumar, Jeffrey Seely, Alan Du, Sean R. Bittner, Adam Berenzweig, Anuoluwapo Bolarinwa, Alexandre Gramfort, Michael I. Mandel

License

CC-BY-NC-SA-4.0

Citation / DOI

10.5281/zenodo.17287903

Source links

OpenNeuro | NeMAR | Source URL

Copy-paste BibTeX
@dataset{nm000104,
  title = {emg2qwerty: Touch Typing from Surface Electromyography},
  author = {Viswanath Sivakumar and Jeffrey Seely and Alan Du and Sean R. Bittner and Adam Berenzweig and Anuoluwapo Bolarinwa and Alexandre Gramfort and Michael I. Mandel},
  doi = {10.5281/zenodo.17287903},
  url = {https://doi.org/10.5281/zenodo.17287903},
}

Found an issue with this dataset?

If you encounter any problems with this dataset (missing files, incorrect metadata, loading errors, etc.), please let us know!

Report an Issue on GitHub

Technical Details#

Subjects & recordings
  • Subjects: 108

  • Recordings: 1135

  • Tasks: 1

Channels & sampling rate
  • Channels: 32

  • Sampling rate (Hz): 2000.0

  • Duration (hours): 0.0

Tags
  • Pathology: Healthy

  • Modality: Visual

  • Type: Motor

Files & format
  • Size on disk: 290.5 MB

  • File count: 1135

  • Format: BIDS

License & citation
  • License: CC-BY-NC-SA-4.0

  • DOI: 10.5281/zenodo.17287903

Provenance

API Reference#

Use the NM000104 class to access this dataset programmatically.

class eegdash.dataset.NM000104(cache_dir: str, query: dict | None = None, s3_bucket: str | None = None, **kwargs)[source]#

Bases: EEGDashDataset

OpenNeuro dataset nm000104. Modality: emg; Experiment type: Motor; Subject type: Healthy. Subjects: 108; recordings: 1135; tasks: 1.

Parameters:
  • cache_dir (str | Path) – Directory where data are cached locally.

  • query (dict | None) – Additional MongoDB-style filters to AND with the dataset selection. Must not contain the key dataset.

  • s3_bucket (str | None) – Base S3 bucket used to locate the data.

  • **kwargs (dict) – Additional keyword arguments forwarded to EEGDashDataset.

data_dir#

Local dataset cache directory (cache_dir / dataset_id).

Type:

Path

query#

Merged query with the dataset filter applied.

Type:

dict

records#

Metadata records used to build the dataset, if pre-fetched.

Type:

list[dict] | None

Notes

Each item is a recording; recording-level metadata are available via dataset.description. query supports MongoDB-style filters on fields in ALLOWED_QUERY_FIELDS and is combined with the dataset filter. Dataset-specific caveats are not provided in the summary metadata.

References

OpenNeuro dataset: https://openneuro.org/datasets/nm000104 NeMAR dataset: https://nemar.org/dataexplorer/detail?dataset_id=nm000104

Examples

>>> from eegdash.dataset import NM000104
>>> dataset = NM000104(cache_dir="./data")
>>> recording = dataset[0]
>>> raw = recording.load()
__init__(cache_dir: str, query: dict | None = None, s3_bucket: str | None = None, **kwargs)[source]#
save(path, overwrite=False)[source]#

Save the dataset to disk.

Parameters:
  • path (str or Path) – Destination file path.

  • overwrite (bool, default False) – If True, overwrite existing file.

Return type:

None

See Also#