Taking a Cue From the Human

Linguistic and Visual Prompts for the Automatic Sequencing of Multimodal Narrative





audiovisual translation, computer vision, video description, audio description, machine learning, audiovisual content, accessibility, content description, content retrieval, MeMAD, automatic captioning


Human beings find the process of narrative sequencing in written texts and moving imagery a relatively simple task. Key to the success of this activity is establishing coherence by using critical cues to identify key characters, objects, actions and locations as they contribute to plot development. In the drive to make audiovisual media more widely accessible (through audio description), and media archives more searchable (through content description), computer vision experts strive to automate video captioning in order to supplement human description activities. Existing models for automating video descriptions employ deep convolutional neural networks for encoding visual material and feature extraction (Krizhevsky, Sutskever, & Hinton, 2012; Szegedy et al., 2015; He, Zhang, Ren, & Sun, 2016). Recurrent neural networks decode the visual encodings and supply a sentence that describes the moving images in a manner mimicking human performance. However, these descriptions are currently “blind” to narrative coherence.

Our study examines the human approach to narrative sequencing and coherence creation using the MeMAD [Methods for Managing Audiovisual Data: Combining Automatic Efficiency with Human Accuracy] film corpus involving five-hundred extracts chosen as stand-alone narrative arcs. We examine character recognition, object detection and temporal continuity as indicators of coherence, using linguistic analysis and qualitative assessments to inform the development of more narratively sophisticated computer models in the future.


Download data is not yet available.

Author Biographies

Kim Linda Starr, University of Surrey

Kim is a Research Fellow in the Centre for Translation Studies at the University of Surrey in the UK. She previously worked in the financial and broadcast television sectors, finding time along the way to pursue a degree in politics and law (Queen Mary, University of London), and Master’s degrees in journalism (Westminster) and audiovisual translation (Surrey). She was awarded a doctoral scholarship by the Arts and Humanities Research Council/TECHNE, completing her PhD in audio description for cognitive diversity in late 2017. Her doctoral research focused on remodelling AD as a bespoke accessibility service for young autistic audiences experiencing emotion recognition difficulties. She maintains an interest in multimodal and intersemiotic translation services for cognitive inclusivity. For the past two years, Kim has worked on the EU funded ‘Methods for Managing Audiovisual Data (MeMAD) project.  

Sabine Braun, University of Surrey

Sabine Braun is Professor of Translation Studies and Director of the Centre for Translation Studies at the University of Surrey (UK). Her research focuses on new modalities and socio-technological practices of interpreting and audiovisual translation. She has led several multi-national research projects on video-mediated distance interpreting and interpreting in virtual-reality environments (AVIDICUS 1-3, IVY, EVIVA), while contributing her expertise in technology-mediated interpreting to many other projects (e.g. QUALITAS, Understanding Justice, SHIFT In Orality) with the aim of investigating and informing the integration of communication technologies into professional interpreting practice as a means of improving access to public services. She is also currently a partner in a European H2020 project which combines computer vision technologies, machine learning approaches and human input to create semi-automatic descriptions of audiovisual content (MeMAD) as a way to improve media access; she is responsible for Workpackage 5 (Human processing in multimodal content description and translation).

Jaleh Delfani, University of Surrey

Dr Jaleh Delfani is a post-doctoral research fellow at the Centre for Translation Studies (University of Surrey, UK), working on the EU-funded H2020 research Project ‘MeMAD - Methods for Managing Audiovisual Data: Combining Automatic Efficiency with Human Accuracy’, which involves researching human-generated vs. machine-generated descriptions of moving images. Previously she has gained research experience in relation to a related form of audiovisual translation, by investigating the transfer of extralinguistic cultural references in unofficial interlingual subtitling practices. 




How to Cite

Starr, K. L., Braun, S., & Delfani, J. (2020). Taking a Cue From the Human: Linguistic and Visual Prompts for the Automatic Sequencing of Multimodal Narrative. Journal of Audiovisual Translation, 3(2), 140–169. https://doi.org/10.47476/jat.v3i2.2020.138



Special Issue: November 2020