DS004718#

Le Petit Prince Hong Kong: Naturalistic fMRI and EEG dataset from older Cantonese speakers

Access recordings and metadata through EEGDash.

Citation: Mohammad Momenian, Zhengwu Ma, Shuyi Wu, Chengcheng Wang, Jixing Li (2023). Le Petit Prince Hong Kong: Naturalistic fMRI and EEG dataset from older Cantonese speakers. 10.18112/openneuro.ds004718.v1.1.2

Modality: eeg Subjects: 52 Recordings: 758 License: CC0 Source: openneuro Citations: 1.0

Metadata: Complete (100%)

Quickstart#

Install

pip install eegdash

Access the data

from eegdash.dataset import DS004718

dataset = DS004718(cache_dir="./data")
# Get the raw object of the first recording
raw = dataset.datasets[0].raw
print(raw.info)

Filter by subject

dataset = DS004718(cache_dir="./data", subject="01")

Advanced query

dataset = DS004718(
    cache_dir="./data",
    query={"subject": {"$in": ["01", "02"]}},
)

Iterate recordings

for rec in dataset:
    print(rec.subject, rec.raw.info['sfreq'])

If you use this dataset in your research, please cite the original authors.

BibTeX

@dataset{ds004718,
  title = {Le Petit Prince Hong Kong: Naturalistic fMRI and EEG dataset from older Cantonese speakers},
  author = {Mohammad Momenian and Zhengwu Ma and Shuyi Wu and Chengcheng Wang and Jixing Li},
  doi = {10.18112/openneuro.ds004718.v1.1.2},
  url = {https://doi.org/10.18112/openneuro.ds004718.v1.1.2},
}

About This Dataset#

Update note

Since the auditory stimuli were presented sentence by sentence, we decided to include the original audio files instead of a continuous file. We presented the story in 4 different sections. After each section, there was a time for 5 comprehension check questions. The file “lppHK_timing_word_information.xlsx” includes all the timing information for each section of the story. Information about each column of the file is included in the same file. We also included another file called “EEG_trigger_and_sentence_number.xlsx”. This included information about how to match sentence ID and trigger number in the EEG data. These two files are useful for EEG data analysis. For timing issues in fMRI data analysis, we included the Eprime output files which have all the necessary information for aligning in fMRI analysis. Since Eprime usually shows some delay in the presentation of audio files, the delay could be considered in the analysis which can help with better alignment.

Per OpenNeuro’s new formatting requirements, all annotation, quiz, and stimuli files are located in the “sourcedata” folder.

Overview

View full README

Update note

Since the auditory stimuli were presented sentence by sentence, we decided to include the original audio files instead of a continuous file. We presented the story in 4 different sections. After each section, there was a time for 5 comprehension check questions. The file “lppHK_timing_word_information.xlsx” includes all the timing information for each section of the story. Information about each column of the file is included in the same file. We also included another file called “EEG_trigger_and_sentence_number.xlsx”. This included information about how to match sentence ID and trigger number in the EEG data. These two files are useful for EEG data analysis. For timing issues in fMRI data analysis, we included the Eprime output files which have all the necessary information for aligning in fMRI analysis. Since Eprime usually shows some delay in the presentation of audio files, the delay could be considered in the analysis which can help with better alignment.

Per OpenNeuro’s new formatting requirements, all annotation, quiz, and stimuli files are located in the “sourcedata” folder.

Overview

In the field of neurobiology of language, existing research predominantly focuses on data from a limited number of Indo-European languages and primarily involves younger adults, overlooking other age groups. This experiment aims to address these gaps by creating a comprehensive multimodal database. The primary goal is to advance our understanding of language processing in older adults and the impact of healthy aging on brain-behavior relationships.

The experiment involves collecting task-based and resting-state fMRI, structural MRI, and EEG data from 52 healthy right-handed older Cantonese participants over 65 years old as they listen to excerpts from “The Little Prince” in Cantonese. Additionally, the database includes detailed information on participants’ language history, lifetime experiences, linguistic and cognitive skills, as well as extensive audio and text annotations, such as time-aligned speech segmentation and prosodic features, along with word-by-word predictors from natural language processing (NLP) tools. Quality diagnostics of the MRI and EEG data confirm their robustness, positioning this database as a valuable resource for studying the spatiotemporal dynamics of language comprehension in older adults.

Methods

Participants

We recruited 52 healthy, right-handed older Cantonese participants (40 females, mean age=69.12, SD=3.52) from Hong Kong for the experiment, which consists of an fMRI and an EEG session. In both sessions, participants listened to the same sections of The Little Prince in Cantonese for approximately 20 minutes. We made sure each participant was right-handed and a native Cantonese speaker using the Language History Questionnaire8 (LHQ3). Additionally, participants reported normal or corrected normal hearing. They confirmed they had no cognitive decline. Two participants did not take part in the fMRI session and an additional 4 participants’ fMRI data were removed due to excessive head movement, resulting in a total of 46 participants (39 females, mean age=69.08yrs, SD=3.58) for the fMRI session and 52 participants (40 females, mean age=69.12yrs, SD=3.52) for the EEG session. Prior to the experiment, all participants were provided with written informed consent. All participants received monetary compensation after each session. Ethical approval was obtained from the Human Subjects Ethics Application Committee at the Hong Kong Polytechnic University (application number HSEARS20210302001). This study was performed in accordance with the Declaration of Helsinki and all other regulations set by the Ethics Committee.

Experiment Procedures

The study consisted of an fMRI session and an EEG session. The order of the EEG and fMRI sessions was counterbalanced across all participants, and a minimum two-week interval was maintained between sessions.

fMRI experiment

Before the scanning day, an MRI safety screening form was sent to the participants to make sure MRI scanning was safe for them. We also sent them simple readings and videos about MRI scanning so that they could have an idea of what it would be like to be in a scanner. On the day of scanning, participants were initially introduced to the MRI facility and comfortably positioned inside the scanner, with their heads securely supported using paddings. An MRI-safe headphone (Sinorad package) was provided for participants to wear inside the head coil. The audio volume for the listening task was adjusted to ensure audibility for each participant. A mirror attached to the head coil allowed participants to view the stimuli presented on a screen. Participants were instructed to stay focused on the visual fixation sign while listening to the audiobook. The scanning session commenced with the acquisition of structural (T1-weighted) scans. Subsequently, participants engaged in the listening task concurrently with fMRI scanning. The task-based fMRI experiment was divided into four runs, each corresponding to a section of the audiobook. Comprehension was assessed by a series of 5 yes/no questions (20 questions in total) on the content they had listened to. These questions were presented on the screen, with participants indicating their answers by pressing a button. The session concluded with the collection of resting-state fMRI data.

Cognitive tasks

Four cognitive tasks were selected to assess participants’ cognitive abilities in various domains, including the forward digit span task, picture naming task, verbal fluency task, and Flanker task. These tasks were delivered after the fMRI session in a separate soundproof booth.

EEG experiment

During the EEG experiment, participants were seated comfortably in a quiet room and standard procedures were followed for electrode placement and EEG cap preparation. Participants were instructed to focus on a fixation sign displayed on a monitor. The EEG recording was then initiated, with participants listening to the audiobook. The audio volume was adapted to each participant’s hearing ability before the recording using a different set of stimuli. We used Foam Ear Inserts (Medium 14mm). Similar to the fMRI experiment, participants listened to four sections of the audiobook, each lasting approximately 5 minutes. After each run, participants were asked to answer a total of 20 yes/no questions, with 5 questions assigned to each run. They indicated their answers by pressing a button. The EEG recording was conducted continuously throughout all four runs until their completion.

Questionnaires.

We administered LHQ3 and the Lifetime of Experiences Questionnaire (LEQ) during EEG cap preparation. The participants did not need to move or fill in these questionnaires themselves; a research assistant asked the questions one by one in Cantonese and input the responses in an online Google form. LHQ is designed to document language history by producing aggregate scores for language proficiency, exposure, and dominance in all the languages spoken by the participants. LEQ is a tool to document what sorts of activities (e.g. sports, music, education, profession, etc) participants engage in over their lifetime. It measures lifetime experiences in three periods of life: from 13 to 30 (young adulthood), from 30 to 65 (midlife), and after 65 (late life). LEQ produces a total score (see participants.tsv) which is an indication of cognitive activity. Collecting data using these two questionnaires allowed us to have a thicker description of our participants’ linguistic, social, and cognitive experiences.

Acquisition

The MRI data were collected at the University Research Facility in Behavioral and Systems Neuroscience (UBSN) at The Hong Kong Polytechnic University. EEG data was collected at the Speech and Language Sciences Laboratory within the Department of Chinese and Bilingual Studies at the same university. Data acquisition for this project started in July 2021 and ended in December 2022.

fMRI data.

MRI imaging data were acquired using a 3T Siemens MAGNETOM Prisma system MRI scanner with a 20-channel coil. Structural MRI was acquired for each participant using a T1-weighted sequence with the following parameters: repetition time (TR) = 2,500 ms, echo time (TE) = 2.22 ms, inversion time (TI) = 1,120 ms, flip angle α (FA) = 8°, field of view (FOV) = 240 × 256 × 167 mm, resolution = 0.8 mm isotropic, acquisition time = 4 min and 32s. The acquisition parameters for echo planar T2-weighted imaging (EPI) were as follows: 60 oblique axial slices, TR = 2000 ms, TE = 22 ms, FA= 80°, FOV = 204 × 204 × 165 mm, 2.5 mm isotropic, and acceleration factor 3. E-Prime 2.0 (Psychology Software Tools) was used to present the stimuli.

EEG data.

A gel-based 64-channel Neuroscan system on a 10-20 electrode template was used for data acquisition, sampling at a rate of 1000 Hz. To mark the onset of each sentence, triggers were set at the beginning of each sentence. STIM2 software (Compumedics Neuroscan) was used for stimulus presentation.

Stimuli

The experimental stimuli utilized in both the EEG and fMRI consisted of approximately 20 minutes of the story The Little Prince in Cantonese audiobook. It was translated and narrated in Cantonese by a native male speaker. The stimuli consist of a total of 4,473 words and 535 sentences. To facilitate data analysis and participant engagement, the stimuli were further segmented into four distinct sections, each spanning nearly five minutes. To assess listening comprehension, participants were presented with five yes/no questions after completing each section, resulting in a total of 20 questions throughout the experiment. To make sure the speed of story narration was normal for the participants, we asked a few people who were different from the participants in this study to judge the speed and comprehensibility. They all reported the speed was normal, neither so slow nor so fast.

Annotation

We present audio and text annotations, including time-aligned speech segmentation and prosodic information, as well as word-by-word predictors derived from natural language processing (NLP) tools. These predictors include aspects of lexical semantic information, such as part-of-speech (POS) tagging and word frequency.

Prosodic information.

We extracted the root mean square intensity and the fundamental frequency (f0) from every 10 ms interval of the audio segments by utilizing the Voicebox toolbox (http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html). Peak RMS intensity and peak f0 for each word in the naturalistic stimuli were used to represent the intensity and pitch information for each word.

Word frequency.

Word segmentation was performed manually by two native Cantonese speakers. The log-transformed frequency of each word was also estimated using PyCantonese20, Version 3.4.0 (https://pycantonese.org/). The built-in corpus in PyCantonese is the Hong Kong Cantonese Corpus21 (HKCancor), collected from transcribed conversations between March 1997 and August 1998.

Part-of-speech tagging.

Part-of-speech (POS) tagging for each word in the stimuli was extracted using the PyCantonese20, Version 3.4.0 (https://pycantonese.org/). Following the manual segmentation of words, we input these segments into the Cantonese-exclusive NLP tool PyCantonese, which then provided POS tags for each word according to the Universal Dependencies v2 tagset22 (UDv2).

Preprocessing

All MRI data were preprocessed using the NeuroScholar cloud platform (http://www.humanbrain.cn, Beijing Intelligent Brain Cloud, Inc.), provided by The Hong Kong Polytechnic University. This platform uses an enhanced pipeline based on fMRIPrep 20.2.6 (RRID: SCR_016216) and supported by Nipype 1.7.0 (RRID: SCR_002502). Then we used the pydeface (poldracklab/pydeface) package to remove the voxels corresponding to the faces from both anatomical and preprocessed data to anonymize participants’ facial information.

Anatomical MRI.

The structural MRI data underwent intensity non-uniformity correction, skull-stripping, and brain tissue segmentation of cerebrospinal fluid (CSF), white matter (WM), and gray matter (GM) based on the reference T1w image. The resulting anatomical images were nonlinearly aligned to the ICBM 152 Nonlinear Asymmetrical template version 2009c (MNI152NLin2009cAsym) template brain. Radiological reviews were performed on MRI images by a medical specialist in the lab. Incidental findings were noticed for participants sub-HK031 and sub-HK049. There was a sub-centimeter (0.7cm) blooming artefact in the right putamen, likely a cavernoma for participant sub-HK031. For participant sub-HK049, there was a left thalamic (0.7cm) oval-shaped susceptibility artefact and a 2.6 cm cystic collection in the right posterior fossa.

Functional MRI.

The preprocessing of both resting and functional MRI data included the following steps: (1) skull-stripping, (2) slice-timing correction with the temporal realignment of slices according to the reference slice, (3) BOLD time-series co-registration to the T1w reference image, (4) head-motion estimations and spatial realignment to adjust for linear head motion, (5) applying parameters from structural images to spatially normalize functional images into Montreal Neurological Institute (MNI) template, and (6) smoothing by a 6mm FWHM (full-width half-maximum) Gaussian kernel.

EEG.

The pre-processing was carried out using EEGLAB and in-house MATLAB functions. The preprocessing of EEG data included the following steps: (1) a cutoff frequency filter with 1 Hz high pass and 40.0 Hz low pass cut-off was applied followed by a notch filter at 50 Hz to reduce electrical line noise, (2) use of kurtosis measure to identify and remove bad channels, (3) application of the RUNICA algorithm (from EEGLab toolbox, 2023 version), a machine learning algorithm that evaluates ICA-derived components, for automated rejection of artifacts, including signal noise from eye and muscle, high-amplitude artifacts (e.g., blinks), and signal discontinuities (e.g., electrodes losing contact with the scalp), (4) interpolating data for bad channels using spherical splines for each segment. (5) re-referencing the data by using both electrodes M1 and M2 as the reference for all channels and (6) down-sampling all the data to 250 Hz.

Dataset Structure

Participant responses

  1. Location: participants.json, participants.tsv

  2. File format: tab-separated value

  3. Participants’ sex, age, and accuracy of quiz questions for each fMRI and EEG experiment, scan number and LEQ scores in tab-separated value (tsv) files. Data is structured as one line per participant.

Audio files

  1. Location: sourcedata/stimuli/task-lppHK_run-1[2-4].wav

  2. File format: wav

  3. The 4-section audiobook from The Little Prince in Cantonese

Anatomical data files

  1. Location: sub-HK<ID>/anat/sub-HK<ID>_T1w.nii.gz

  2. File format: NIfTI, gzip-compressed

  3. The raw high-resolution anatomical image after defacing

Functional data files

  1. Location: sub-HK<ID>/func/sub-HK<ID>_task-lppHK_run-1[2–4]_bold.nii.gz

  2. File format: NIfTI, gzip-compressed.

  3. Sequence protocol: sub-HK<ID>/func/sub-HK<ID>_task-lppHK_run-1[2–4]_bold.json.

  4. The preprocessed data are also available as:derivatives/sub-HK<ID>/func/sub-HK<ID>_task-lppHK_run-1[2–4]_desc-preprocessed_bold.nii.gz

Resting-state MRI data files

  1. Location: sub-HK<ID>/func/sub-HK<ID>_task-rest_bold.nii.gz

  2. File format: NIfTI, gzip-compressed

  3. Sequence protocol: sub-HK<ID>/func/sub-HK<ID>_task-rest_bold.json.

  4. The preprocessed data are also available as: derivatives/sub-HK<ID>/func/sub-HK<ID>_rest_bold.nii.gz

EEG data files

  1. Location: sub-HK<ID>/eeg/sub-HK<ID>_task-lppHK_eeg.set

  2. File format: set (a type of MATLAB file, with a file in the .fdt extension containing raw data)

  3. The preprocessed data are also available as: derivatives/sub-HK<ID>/eeg/sub-HK<ID>_task-lppHK_eeg.set (together with a file in the .fdt extension containing raw data)

Annotations

  1. Location: annotation/snts.txt, annotation/lppHK_word_information.txt, annotation/wav_acoustic.csv

  2. File format: comma-separated value

  3. Annotation of speech and linguistic features for the audio and text of the stimuli

Quiz questions

  1. Location: quiz/lppHK_quiz_questions.csv

  2. File format: comma-separated value

  3. The 20 yes/no quiz questions were employed in both the fMRI and EEG experiments

Usage Note

If you want to know more about the dataset, please refer to our paper “Le Petit Prince Hong Kong (LPPHK): Naturalistic fMRI and EEG Data from Older Cantonese Speakers”, https://doi.org/10.1101/2024.04.24.590842

This dataset is still under maintenance.

Contact

For any question regarding this data, please contact: 1. Dr. Mohammad Momenian, momenian@hku.hk 2. Ms. Zhengwu Ma, zhengwu.ma@my.cityu.edu.hk 3. Ms. Shuyi Wu, shuyiwu2017@gmail.com 4. Ms. Chengcheng Wang, cwang495-c@my.cityu.edu.hk 5. Dr. Jixing Li, jixingli@cityu.edu.hk

Dataset Information#

Dataset ID

DS004718

Title

Le Petit Prince Hong Kong: Naturalistic fMRI and EEG dataset from older Cantonese speakers

Year

2023

Authors

Mohammad Momenian, Zhengwu Ma, Shuyi Wu, Chengcheng Wang, Jixing Li

License

CC0

Citation / DOI

doi:10.18112/openneuro.ds004718.v1.1.2

Source links

OpenNeuro | NeMAR | Source URL

Copy-paste BibTeX
@dataset{ds004718,
  title = {Le Petit Prince Hong Kong: Naturalistic fMRI and EEG dataset from older Cantonese speakers},
  author = {Mohammad Momenian and Zhengwu Ma and Shuyi Wu and Chengcheng Wang and Jixing Li},
  doi = {10.18112/openneuro.ds004718.v1.1.2},
  url = {https://doi.org/10.18112/openneuro.ds004718.v1.1.2},
}

Found an issue with this dataset?

If you encounter any problems with this dataset (missing files, incorrect metadata, loading errors, etc.), please let us know!

Report an Issue on GitHub

Technical Details#

Subjects & recordings
  • Subjects: 52

  • Recordings: 758

  • Tasks: 2

Channels & sampling rate
  • Channels: 64

  • Sampling rate (Hz): 1000.0

  • Duration (hours): 0.0

Tags
  • Pathology: Not specified

  • Modality: —

  • Type: —

Files & format
  • Size on disk: 37.0 GB

  • File count: 758

  • Format: BIDS

License & citation
  • License: CC0

  • DOI: doi:10.18112/openneuro.ds004718.v1.1.2

Provenance

API Reference#

Use the DS004718 class to access this dataset programmatically.

class eegdash.dataset.DS004718(cache_dir: str, query: dict | None = None, s3_bucket: str | None = None, **kwargs)[source]#

Bases: EEGDashDataset

OpenNeuro dataset ds004718. Modality: eeg; Experiment type: Learning; Subject type: Healthy. Subjects: 51; recordings: 51; tasks: 1.

Parameters:
  • cache_dir (str | Path) – Directory where data are cached locally.

  • query (dict | None) – Additional MongoDB-style filters to AND with the dataset selection. Must not contain the key dataset.

  • s3_bucket (str | None) – Base S3 bucket used to locate the data.

  • **kwargs (dict) – Additional keyword arguments forwarded to EEGDashDataset.

data_dir#

Local dataset cache directory (cache_dir / dataset_id).

Type:

Path

query#

Merged query with the dataset filter applied.

Type:

dict

records#

Metadata records used to build the dataset, if pre-fetched.

Type:

list[dict] | None

Notes

Each item is a recording; recording-level metadata are available via dataset.description. query supports MongoDB-style filters on fields in ALLOWED_QUERY_FIELDS and is combined with the dataset filter. Dataset-specific caveats are not provided in the summary metadata.

References

OpenNeuro dataset: https://openneuro.org/datasets/ds004718 NeMAR dataset: https://nemar.org/dataexplorer/detail?dataset_id=ds004718

Examples

>>> from eegdash.dataset import DS004718
>>> dataset = DS004718(cache_dir="./data")
>>> recording = dataset[0]
>>> raw = recording.load()
__init__(cache_dir: str, query: dict | None = None, s3_bucket: str | None = None, **kwargs)[source]#
save(path, overwrite=False)[source]#

Save the dataset to disk.

Parameters:
  • path (str or Path) – Destination file path.

  • overwrite (bool, default False) – If True, overwrite existing file.

Return type:

None

See Also#