A group of eleven major universities and 30 top speech science and technology experts from around Australia obtained funding under a Linkage Infrastructure, Equipment and Facilities (LIEF) grant from the Australian Government and a number of sponsors to create the largest-ever auditory-visual database of Australian speech. The project involved building twelve identical stand-alone recording set-ups and shipping these around Australia to collect audio-visual speech in each capital city and in various regional centres across the country (17 different collections in 15 locations).
A standard set of tasks was collaboratively designed and used to collect the data at all collection sites. Such standard recording conditions and standard tasks allow us to carefully capture regional and social variations across the country .
The recording set-ups will remain at each centre for subsequent data collection projects. The AusTalk corpus of audio visual speech data, incorporating annotations, transcriptions and metadata, is accessible via a centralised storage facility. The original corpus, and later additions to it, will provide a significant resource and a boost to speech science, speech technology, and human communication research in Australia.
The AusTalk participants were not necessarily born in Australia, but they all completed their schooling in Australia, ensuring inclusion of a range of speakers from various cultural backgrounds.
As part of the anonymisation of the data, each participant was given a unique identifier consisting of a colour name followed by the name of an Australian animal. These identifier also have a numerical value used as a short-form name for the participant. For example, participant <strong>Gold - Fuscous Honeyeater</strong> is also identified as 1_371. We expect that most researchers will use the short-form numerical identifier, but the colour-animal names are retained so that participants can identify their contribution to the corpus and gain access to their own recordings (through the Participant Portal).
Prior to the recording session, each speaker completed an extensive online questionnaire to collect a comprehensive set of demographic, family, historical and language background data. Each speaker was recorded over three 1-hour sessions, separated by at least one week to capture natural variation in voice quality. Each session comprised a series of both read and spontaneous speech tasks to capture style shifting from highly formal word-list to more informal spontaneous conversation. In the third and final session, speakers were paired for two Map Tasks along the lines of (Anderson et al., 1991) but re-designed for Australian English.
The components of the AusTalk corpus and the time taken for each task across the three recording sessions (S1, S2, S3) are shown in the Table below.
Spontaneous speech makes up approximately half of the collected data with a minimum of 40 minutes per speaker (Yes/No responses, Interview with RA, Re-told Story) and 40 minutes for 2 Map Task interactions with another participant as partner, followed by 5 minutes of conversation with that partner (see Wagner et al., 2010 and Burnham et al., 2011 for details).
All recordings were made on the Black Box, a dedicated computer with audio and video interfaces configured in a portable equipment rack that could be moved between sites if needed.
The Black Box closed for transportation and storage:
and the Black Box deployed for recording:
Specialised software was designed to run the collection protocol on the Black Box and display prompts simultaneously on dual screens - one for the RA running the session, and one for the participant being recorded. The software was responsible for management of the components listed in the table above by sequentially prompting for each word or sentence and directly recording the audio and video channels to disk. After each item was recorded, files were saved on disk and a metadata record (which included the time of recording and the text of the prompt) was written. The file names used to save the data were structured to include information about the item and some meta-data. For example, the file 1_207_1_11_002-ch6-speaker.wav was recorded from speaker 1_207, in session 1, component 11, item 2 and contains audio from channel 6 (the speaker headset microphone). Files are grouped into a separate directory per component and these in turn are grouped by session and by speaker.
The complete corpus of audio visual data collected for AusTalk is freely available to researchers from all the original Big ASC project partner institutions. This includes the demographic metadata provided by the participants.
For researchers from other institutions, all the audio data is available upon request and after agreeing to the AusTalk Corpus Licence Terms. At this time, only some of the video data can be released outside the project partners, due to legal issues concerning the original participant consent form.
Some of the AusTalk audio data has been transcribed and annotated, both manually and automatically. The long term goal is to have all AusTalk annotations available via the Alveo Virtual Lab for use via the API. Until such time as we can make that happen, we are able to offer downloads of the various AusTalk annotations that have already been created.
To access the AusTalk data, please register here.