The HRI ASR Challenge provides an evaluation database, which consist of 12 hours of multichannel evaluation data recorded in a real HRI scenario. This database was recorded using a PR2 robot that performed different translational and azimuthal movements.


1. Evaluation Database

The evaluation database consists of 16 test sets that correspond to re-recorded versions of the clean test set from Aurora-4 database. The test sets were recorded in a meeting room considering different relative movements between the speech source and a PR2 robot. The recording process was performed with the PR2’s Kinect sensor which has a four-microphone array, leading to a multichannel test set.

To obtain the test files, please write an e-mail to:
nbecerra@ing.uchile.cl

You will be asked to submit the proof that you obtained the right to use the WSJ database from LDC (ref. CSR-I (WSJ0), LDC93S6). This may be a signed license agreement or LDC invoice.

If you use this evaluation data in your research please reference it as:
J. Novoa, J. Wuth, J. P. Escudero, J. Fredes, R. Mahu and N. Becerra Yoma, “Multichannel Robot Speech Recognition Database: MChRSR,” ArXiv e-prints: 1801.00061, 2017.
[Download citation]


2. Training Database

A first baseline ASR system that uses the Evaluation Database was presented in the HRI-2018 conference.
In this section you can download the impulse responses (IRs), noises and scripts to replicate the baseline experiment.

2.1. CLEAN DATABASE

We used the clean Aurora-4 training database, which in turn is based on the November 1992 ARPA Continuous Speech Recognition Corpora. Specifically, the ARPA database is the WSJ0 SI 84, also called SI-short Sennheiser training. The clean database consists of 7138 utterances (15.2 hours) from 83 native English speakers and contains only data recorded with a high-quality microphone: the Sennheiser HMD-414.

2.2. ENVIRONMENT-BASED DATABASE

To generate the Environment-based Training (EbT) set, we changed the clean training database as follows:

  • 25% of the utterances was convolved an IR estimated when the robot-source distance was equal to 1m, and the angle between the robot head and source was 0º.
  • The remaining utterances were convolved with 32 IRs that model the time varying acoustic channel due to the robot’s movements. Also, recorded noises from the robot’s motors was added using the Filtering and Noise Adding Tool FaNT at SNR between 10 and 20 dB.

2.3. SCRIPTS AND RECIPES

2.4. IMPULSE RESPONSES
33 IRs were computed with the robot placed at three fixed positions. For each robot position, the head was oriented at 11 different angles with respect to the source. The head angle was varied from –150º to 150º in steps of 30º. The impulse responses were estimated using the Farina’s sine sweep method.

 

2.5. PR2 ROBOT’S NOISE
Robot’s noise was recorded by the Kinect microphone array in 16 robot movement conditions, according to the following table:

ω=0.00 [rad/s] ω=0.28 [rad/s] ω=0.42 [rad/s] ω=0.56 [rad/s]
ν=0.00 [m/s] PR2_DY0_MH0 PR2_DY0_MH1 PR2_DY0_MH2 PR2_DY0_MH3
ν=0.30 [m/s] PR2_DY1_MH0 PR2_DY1_MH1 PR2_DY1_MH2 PR2_DY1_MH3
ν=0.45 [m/s] PR2_DY2_MH0 PR2_DY2_MH1 PR2_DY2_MH2 PR2_DY2_MH3
ν=0.60 [m/s] PR2_DY3_MH0 PR2_DY3_MH1 PR2_DY3_MH2 PR2_DY3_MH3

DY: This specifies the displacement speed of PR2 when the noise was recorded.
MH: This specifies the angular speed of the PR2’s head when the noise was recorded.

Link to download the PR2’s noise:

 


3. External Tools

To replicate the paper’s experiments, you need to install the following external tools: