This repository aims to provide public availability for the whole data obtained from the Let’s Go dialog system and its derivatives. You can follow the instructions to download the dataset easily. Please create Github issues if you encounter problems and our team will assist you as soon as possible. To get started, go to introduction.
Let’s Go! is a spoken dialog system that can be used by the general public. While there has been success in building spoken dialog systems that can interact well with people, these systems often work only for a limited group of people. The system we are developing for Let’s Go! is designed to work with a much wider population, including groups that typically have trouble interacting with dialog systems, such as non-native English speakers and the elderly. Let’s Go! works in the domain of bus information for Pittsburgh’s Port Authority Transit bus system, providing a telephone-based interface to access bus schedules and route information.
The project page is at http://www.speech.cs.cmu.edu/letsgo/
NEWS: Let’s Go has been integrated to the DialPort project, talk to it here!
There are eight important components from the corpus available for you to download and do research:
The complete Let’s Go dataset includes every session the system recorded from 2005 to 2016, to be more specific, 08/01/2005 to 03/15/2016. This includes the WAV file, the log file, and labels automatically generated by the ASR (Sphinx).
The dataset is divided by months. Each month of data has the following directory structure (an example from the month July, 2014):
201407 │ └───20140701 │ │ index.html (the summary sheet for the day) │ │ other files for index.html │ │ │ └───000 (a folder for each session) │ │ │ index.html (session summary) │ │ │ *.raw (raw speech data) │ │ │ *.txt (automatic generated labels) │ │ │ *.log (system logs including ASR results) │ │ │ other data by system variants │ │ │ └───001 │ │ │ │ │ └─── ... │ │ └───20140702 │ ... │ └───20140731
To learn about how to download the complete Let’s Go dataset, please go to Download.
The Spoken Dialog Challenge was an exercise to investigate how different spoken dialog systems perform on the same task. The existing Let’s Go Pittsburgh Bus Information System was used as a task and four teams provided systems that were first tested in controlled conditions with speech researchers as users. The three most stable systems were then deployed to real callers.
The goal of the Spoken Dialog Challenge (SDC) is to investigate how different dialog systems perform on a similar task. It is designed as a regularly recurring challenge. The first one took place in 2010. SDC participants were to provide one or more of three things: a system; a simulated user, and/or an evaluation metric. The task chosen for the first SDC was one that already had a large number of real callers, which was the CMU Let’s Go Bus Information system. (Black et al. 2010 and Black et al. 2011)
To download The Spoken Dialog Challenge data, please use the script
The script will create one directory named letsgo_sdc in your current path and download the Spoken Dialog Challenge data into the new directory. The new directory will also contain the readme description of the dataset and the license.
The Dialog State Tracking Challenge (DSTC) is an ongoing series of research community challenge tasks. Each task released dialog data labeled with dialog state information, such as the user’s desired restaurant search query given all of the dialog history up to the current turn. The challenge is to create a “tracker” that can predict the dialog state for new dialogs. In each challenge, trackers are evaluated using held-out dialog data. (Williams et al. 2013)
DSTC1 used human-computer dialogs in the CMU Let’s Go Bus Information system.
To download DSTC1 data, please go to https://www.microsoft.com/en-us/research/event/dialog-state-tracking-challenge/
An excel file that describes all significant changes to the system and events such as challenges that occurred.
The log file can be found in the directory
This contains word transcriptions of each dialog from 200810 to 200909. It includes the WAV file id, ASR outputs with confidence, and crowdsourced transcriptions with confidence. (Parent et al. 2010)
The log file can be found in the directory
The Let’s Go Daily Report 2006 - 2016 provides the statistics of the Let’s Go bus information system each day that it was deployed from 2006 to 2016. It also includes the weekly summary. (the links in each file)
The emails reside in the project repository under
You can obtain the complete Let’s Go dataset using the shell script we provide in the repository. e.g. To get Let’s Go transactions for data recorded during July 2014 to August 2014, simply do:
bash get_letsgo_raw_data.sh 201407 201408
The script will create one directory named letsgo_dataset in your current path and download all the data specified within that time range into the new directory. To uncompress the data, you need to run a simple tar command, e.g. for 201407:
tar xvjf 201407.tar.bz2
brew install coreutils) to use the script properly. After GNU coreutils is installed, simply change
datein the script to
If you have more questions about the Let’s Go systems and dataset. Please contact us:
Yulun Du (Carnegie Mellon University)
Alan W. Black (Carnegie Mellon University)
Maxine Eskenazi (Carnegie Mellon University)
Antoine Raux, Dan Bohus, Brian Langner, Alan W Black, and Maxine Eskenazi. Doing research on a deployed spoken dialogue system: One year of let’s go! experience in Proc. of Interspeech, 2006.
Antoine Raux, Brian Langner, Alan W. Black, and Maxine Eskenazi. LET’S GO: Improving Spoken Dialog Systems for the Elderly and Non-Natives In Proc. of Eurospeech, 2003.
Antoine Raux, Brian Langner, Dan Bohus, Alan W Black, and Maxine Eskenazi. Let’s Go Public! Taking a Spoken Dialog System to the Real World. in Proc. of Interspeech, 2005.
Alan W Black, Susanne Burger, Brian Langner, Gabriel Parent, and Maxine Eskenazi. Spoken Dialog Challenge 2010 in Proc. of SLT, 2010.
Alan W Black, Susanne Burger, Alistair Conkie, Helen Hastie, Simon Keizer, Oliver Lemon, Nicolas Merigaud, Gabriel Parent, Gabriel Schubiner, Blaise Thomson, Jason D Williams, Kai Yu, Steve Young, and Maxine Eskenazi. Spoken Dialog Challenge 2010: comparison of live and control test results in Proc. of SIGDIAL, 2011.
Jason Williams, Antoine Raux , Deepak Ramachandran, and Alan Black. The Dialog State Tracking Challenge in Proc. of SIGDIAL, 2013.
Gabriel Parent and Maxine Eskenazi. Toward better crowdsourced transcription: Transcription of a year of the let’s go bus information system data in Proc. of SLT, 2010.
Please download and agree to the license.
If you download and use the Let’s Go data, you agree that you will cite it in all publications resulting from its use.
This work was supported by the US National Science Foundation under grants number 0208835 and 0855058, “LET’S GO: Improved Speech Interfaces For The General Public” and “CI-ADDO-NEW: Dialog Research Center (DialRC)”. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
We would like to thank the following researchers for their contributions to the Let’s Go system and dataset: Antoine Raux, Brian Langner, Dan Bohus, Gabriel Parent, Jim Valenti, Gabriel Schubiner, Sungjin Lee, Yulun Du, Alan Black, Maxine Eskenazi