Finding the Right Words

Investigating Machine-Generated Video Description Quality Using a Corpus-Based Approach




audio description, video captioning, automation, audiovisual content, corpus-based approach


This paper examines first steps in identifying and compiling human-generated corpora for the purpose of determining the quality of computer-generated video descriptions. This is part of a study whose general ambition is to broaden the reach of accessible audiovisual content through semi-automation of its description for the benefit of both end-users (content consumers) and industry professionals (content creators). Working in parallel with machine-derived video and image description datasets created for the purposes of advancing computer vision research, such as Microsoft COCO (Lin et al., 2015) and TGIF (Li et al., 2016), we examine the usefulness of audio descriptive texts as a direct comparator. Cognisant of the limitations of this approach, we also explore alternative human-generated video description datasets including bespoke content description. Our research forms part of the MeMAD (Methods for Managing Audiovisual Data) project, funded by the EU Horizon 2020 programme.


Download data is not yet available.

Author Biographies

Sabine Braun, University of Surrey

Professor of Translation Studies and Director of the Centre for Translation Studies at the University of Surrey (UK). Her research focuses on new modalities and socio-technological practices of interpreting and audiovisual translation. She has led several multi-national research projects on video-mediated distance interpreting and interpreting in virtual-reality environments (AVIDICUS 1-3, IVY, EVIVA), while contributing her expertise in technology-mediated interpreting to many other projects (e.g. QUALITAS, Understanding Justice, SHIFT In Orality) with the aim of investigating and informing the integration of communication technologies into professional interpreting practice as a means of improving access to public services. She is also currently a partner in a European H2020 project which combines computer vision technologies, machine learning approaches and human input to create semi-automatic descriptions of audiovisual content (MeMAD) as a way to improve media access; she is responsible for Workpackage 5 (Human processing in multimodal content description and translation).

Kim Starr, University of Surrey

Research Fellow in the Centre for Translation Studies at the University of Surrey in the UK. She previously worked in the financial and broadcast television sectors, finding time along the way to pursue a degree in politics and law (Queen Mary, University of London), and Master’s degrees in journalism (Westminster) and audiovisual translation (Surrey). She was awarded a doctoral scholarship by the Arts and Humanities Research Council/TECHNE, completing her PhD in audio description for cognitive diversity in late 2017. Her doctoral research focused on remodelling AD as a bespoke accessibility service for young autistic audiences experiencing emotion recognition difficulties. She maintains an interest in multimodal and intersemiotic translation services for cognitive inclusivity. For the past two years, Kim has worked on the EU funded ‘Methods for Managing Audiovisual Data (MeMAD) project.  




How to Cite

Braun, S., & Starr, K. (2019). Finding the Right Words: Investigating Machine-Generated Video Description Quality Using a Corpus-Based Approach. Journal of Audiovisual Translation, 2(2), 11–35.