Methodological Preparation of a Within-Subject Audiovisual Cognition, Reception and Perception Study

of Antwerp _________________________________________________________ Abstract In the past decade, cognitive empirical AVT research has been on the rise. The majority of these studies are between-subject studies, focused on subtitles for the deaf and hard of hearing (SDH). The few experimental studies that are aimed at other audiences tend to have small sample sizes. Within-subject studies are rarely used in experimental AVT cognition, reception and perception research, although they can increase statistical power due to the repeated testing and shed light on the idiosyncratic nature of the matter. This paper pleads for the introduction of complementary within-subject designs by illustrating the contrasts between the within-subject and between-subject research design. Drawing from the broader spectrum of Translation Studies and the case of the Subtitles for Access to Education (S4AE) research project, this paper highlights obstacles in the preparation of a within-subject AVT cognition, reception and perception experiment and proposes a possible approach to prepare similar


Introduction
Audiovisual translation (AVT) has become a booming and multi-faceted research field over the past decades (Díaz-Cintas, 2020). The start of the new millennium saw the emergence of cognitive and empirical AVT studies, which tend to focus on subtitles for the deaf and hard of hearing (SDH) and audio description (Díaz-Cintas, 2020). Experimental research into the reception of AVT for other audiences and purposes other than language learning remains scarce (Díaz-Cintas, 2020;Díaz-Cintas & Szarkowska, 2020). However, as Díaz Cintas and Szarkowska (2020) point out, there is a need for such experimental research as it not only allows us to test new practices, but also enables us to verify old assumptions and theories. This research could "feed back straight into professional practices and processes" (Díaz-Cintas, 2020, p. 222). These scholars also underline the importance of sound methodologies, replicability and reproducibility in said research.
Adhering to the aforementioned importance of methodological transparency, replicability and reproducibility, the aim of this paper is to present the methodological preparation of a large-scale, within-subject (repeated measures) study into the reception and perception of and cognitive load posed by subtitles, the so-called Subtitles for Access to Education (S4AE) project. This article follows in the footsteps of a number of publications that lay out possible methods and methodologies or recommend certain approaches for experimental AVT reception research (e.g., Kruger et al., 2015). Another important precursor is the position paper by Orero et al. (2018), which can be used as a solid guideline for research as it lists many previously conducted AVT studies, proposes numerous measurement tools and recommends various approaches and research designs. One design, however, receives relatively little attention in these publications, namely the within-subject design. What is more, within-subject designs appear to be scant in AVT cognition, reception and perception studies (for brevity purposes: AVTCRP studies) as a whole, with exceptions such as Jensema et al. (2000), Tsaousi (2016), Montero Perez (2019) and Liao et al. (2020). Slightly more frequent is the use of mixed designs, including both within-subject and between-subject components, e.g., Orrego-Carmona (2015), Gerber-Morón and  and Gerber-Morón (2018a, 2018b). These are, however, also limited in number. This article aims to shed light on the advantages and drawbacks of a within-subject design and the possible challenges that arise when preparing such a study. This paper is structured as follows: Section 2 elucidates the contrast between within-subject and between-subject designs, based on literature sourced from the broader field of Translation Studies. In Section 3, the design of the S4AE project and the methodological preparations are explained in detail. The paper concludes with some methodological recommendations for future within-subject studies as well as a discussion of some limitations in our study.

Designs in Experimental AV Cognition, Reception and Perception Studies
The design of any experimental study is determined according to the main research question. Balling and Hvelplund (2015) classify three types of research design: (a) an independent (or between) groups design, comparing two groups; (b) a within-subject (repeated measures) design, examining the same group in various conditions; and (c) a functional relations design, focusing on relations between variables rather than participants' behaviour in various conditions. Combinations of these designs, mixed designs, are also possible. In this paper, we will mainly focus on the repeated measures design, contrasting its characteristics with the between-subject design. We chose this focus as we expect most readers to be familiar with between-subject designs, but not necessarily with within-subject designs, especially given the scarcity of such designs in experimental AVT research. For the basis of this paper, we draw from both research in AVT as well as from the broader field of Translation Studies.
A between-group (or between-subject) design is commonplace in AVTCRP studies. It tests different participants in various conditions or in one condition. There are numerous ways to plan a betweensubject design, ranging from using a test group and control group in a regular and doctored condition (e.g., Bisson et al., 2014;Kruger & Steyn, 2014;Montero Perez, 2020;Szarkowska et al., 2011) to testing of participants by comparing conditions without control groups (e.g., Moreno & Mayer, 2002;Perego et al., 2010;Vulchanova et al., 2015). In contrast, a within-subject (or repeated measures) design is an experimental design in which the same participants are tested a number of times. Again, the specifics may vary depending on the research goal. Researchers can, for example, test the same participants in multiple conditions to examine how varying situations influence the participants (e.g., S4AE project, see Section 3.1) or they can compare before and after data in one condition (e.g., Montero Perez, 2019). All tests may take place in one session (e.g., pilot tests of the S4AE project) or may span over a longer period of time to assess developments (e.g., Moreno et al., 2011).
These designs have various contrasting advantages and disadvantages. The largest advantage of a within-subject design is the mitigation of variability due to the same participants being used for each condition (Mellinger & Hanson, 2017, p. 137). As a result of this lowered variability, the number of participants required to make reliable conclusions is smaller as well, which may be interesting for participant recruiting and possible recruiting costs as well. Between-subject designs are limited in their ability to account for differences between participants, which reduces statistical power in the case of smaller sample sizes (Mellinger & Hanson, 2017). Díaz-Cintas (2020, p. 7) stated that limited sample sizes are a present problem in the few experimental AVTCRP studies that are not focused on SDH. Complementary within-subject designs could therefore be a possible means to increase validity and reliability in experimental AVT research. Though the repeated testing might increase internal validity (i.e., accurate measurement), and reliability (i.e., experimental replicability and reproducibility) to some extent by repeatedly confirming certain findings, revealing patterns or showing consistency, it reduces external validity, i.e., ecological validity, as it is evidently conducted in a more experimental setting compared to a between-subject study (Frey et al., 1991;Saldanha & O'Brien, 2013, p. 33). The mitigation of personal variability can also be of benefit for the idiosyncratic nature of particular research topics, such as perception and cognition, which we expect to be different for every individual. Within-subject designs could filter out any of these undesired individual influences and could, in combination with biographical surveys or participant profiling, also help identify influencing factors. In sum, within-subject designs would be a viable option to strengthen studies with smaller samples and mitigate, and possibly identify, influences resulting from personal differences. These two advantages have already been highlighted by Bernardini in 2001, when she addressed the frequent use of between-subject designs in TAP (Think-Aloud Protocols) based translation process research, often conducted with a very limited number of participants. Another advantage of a within-subject design is that it generally does not require control groups, which reduces the chance of contamination. Contamination occurs when an experimental group (un)intentionally passes on essential information about the experiment to the control group or viceversa, which may mask the actual effects of what is tested. The reduced chance of contamination in within-subject designs can be considered a substantial advantage. It is nevertheless difficult to estimate how realistic and/or frequent this risk of data contamination is, since there have not been any reports -to our knowledge -in AVT research.
However, a within-subject design also has a number of drawbacks in contrast to a between-subject design. One contrast is the time required to adequately set up and execute an experiment. As Section 3.2 will reveal, it takes considerable effort to prepare a within-subject study compared to a betweensubject study. The repeated testing also lengthens the experiment. Another contrast is that due to the extended length, a within-subject experiment becomes more prone to attrition and data loss (Mellinger & Hanson, 2017, pp. 7, 105). In the case of multiple tests at different points in time, participants may simply not be present for the repeated tests. Additionally, multiple tests increase the chances of data being unusable, especially in the case of eye tracking with poor calibration or low tracking ratios. A third drawback of the repeated testing is the influence of certain confounding variables (Charness et al., 2012). Mellinger and Hanson (2017, pp. 7, 105) distinguish three of these variables: (a) fatigue, (b) order effects, and (c) carryover effects. The multitude of tests can be tiresome for participants, which in turn may lead to decreased concentration and/or motivation, especially in later stages of the experiment. The participants' behaviour may also be different dependent on the order of the tests. Carryover effects imply that participants learn and improve over the course of an experiment, e.g., by conversing with one another, reading/watching relevant material (outside the experimental design) or becoming familiar with the way of testing, which may result in higher scores in later stages. Evidently, these confounding variables can significantly influence the results of a within-subject study, whereas they are less important in a between-subject study. One common solution is to employ counterbalancing. Nevertheless, Mellinger and Hanson (2018, p. 16) warn these confounding variables may still be present.

Research Background, Goal and Design
To introduce the S4AE project, we would first like to illustrate its research background. Following modern globalization and migration, higher education institutions (HEIs) face increasingly multilingual and multicultural audiences. To cater to these audiences, many HEIs are starting to use English as a medium of instruction (EMI) (Wächter & Maiworm, 2014). The introduction of EMI, however, may have a negative impact on comprehension, cognitive load and retention for students less proficient in L2 English. Subtitles might help to overcome these language barriers and make EMI lectures more accessible. However, adding subtitles to the classroom implies that students suddenly must process a new source of visual information alongside the already present audiovisual information from the lecturer, the lecture slides, the whiteboard, etc. This increases the amount of information that needs to be processed and might thus be more cognitively demanding for students. Delving into this matter, the S4AE project builds on three previous studies exploring the effects of subtitles on comprehension and cognition in a standard educational context (Chan et al., 2019;Kruger & Steyn, 2014) and aims to answer the following question: To what extent do the presence of subtitles (present/not present), the subtitle language (L1/L2), the level of L2 proficiency and the students' prior knowledge influence (1) the (perception of) cognitive load and (2) the comprehension and retention of an L2 English lecture?
Interestingly, studies into subtitle processing and the effects of subtitles on comprehension and retention in an educational context that is aimed at content and not language learning seem scarce. We know only of the three studies mentioned earlier (Chan et al., 2019;Kruger & Steyn, 2014). These use self-report effort, frustration and comprehension questionnaires, comprehension tests and eye tracking. They also distinguish visual attention from actual subtitle reading, using the Reading Index for Dynamic Texts (RIDT) developed by Kruger and Steyn (2014).
Complementing these three (between-subject) studies, the S4AE project revolves around a central within-subject design. However, following the advice of Mellinger and Hanson (2017, pp. 163-164), we extended the initial within-subject design to include between-group independent variables, which in turn allows us to assess the interactions between cognitive load and comprehension, and student L2 proficiency levels and prior knowledge of the subject as well. The inclusion of these variables does not alter the advantages, disadvantages or necessary preparation of a within-subject study that this paper discusses.
In this design, Dutch (Flemish) students will view three different recorded EMI lectures. These lectures will be provided in three conditions: (a) with intralingual (English) subtitles; (b) with interlingual (Dutch) subtitles; (c) with no subtitles. To minimise fatigue, order and carryover effects (Mellinger & Hanson, 2017, p. 105), the order of the lectures and the conditions will be counterbalanced completely. The students will watch the lectures individually in an eye tracking laboratory. Eye tracking will allow us to measure cognitive load and actual subtitle reading using Kruger and Steyn's (2014) RIDT as a complementary tool. After each lecture, the students will fill out an extended version of the psychometric questionnaire on cognitive load validated by Leppink and van den Heuvel (2015) and, subsequently, a comprehension test. Using both a psychometric selfreport questionnaire and eye tracking to assess cognitive load allows triangulation of data from objective and subjective measures, as recommended by Orero et al. (2018). One month after the experiment, all participants will complete the same comprehension tests again to measure retention. The scores on the psychometric questionnaires and comprehension tests, as well as the eye tracking data, will be correlated with the students' biographical data, language proficiency and learning preferences, which will be collected one month prior to the experiment.
Although within-subject designs, and mixed designs for that matter, remove personal variability, they may be prone to influences originating from the materials used in the experiment. Therefore, meticulous preparation, preferably including pre-testing, and analysis of the materials is required. The aim of this paper is to show how this may be carried out.

The Ten Steps
A number of preparatory steps need to be taken to ensure the use of comparable materials in a within-subject AVTCRP study to safeguard the validity of future results. Based on our own experiences, we propose to divide the initial process of preparation into the ten distinct steps listed below: (a) Careful preparation of materials (b) Lecture content and feature analyses (c) First pilot study (d) Re-evaluation (e) Optimization (f) Second pilot study (g) Production of comparable subtitles (h) Subtitles analyses (i) Third pilot study with subtitles (j) Finalisation of materials In the following paragraphs, the first six steps will be explained in detail, integrating relevant research. Each step will generate results which (if applicable) might be carried over and integrated into the next step. Given the limited scope of this article, we will focus exclusively on the preparation of the lectures (which can be considered source texts) and the comprehension tests (steps a-f). The complex production, analyses and testing of comparable interlingual and intralingual subtitles are beyond the scope of this paper and will be published in a future article.

Careful Preparation of Materials
Comparable materials are of the utmost importance for a within-subject design. In the S4AE project we examine the effect of no subtitles, interlingual (Dutch) subtitles and intralingual (English) subtitles. This implies we need three lectures that are comparable in content and language (complexity), length, style, etc. Content-wise, all three lectures focused on philosophy, which was realistic and viable, since optional courses in philosophy are part of the study program of the intended participants. Professor Frank Albers, philosophy lecturer at the University of Antwerp, wrote three comparable lectures on the views on inequality of three renowned philosophers, Thomas Piketty, Jean-Jacques Rousseau and Alexis de Tocqueville. The lecture texts were subsequently analysed and recorded (see Section 3.2.2).
In addition to the lecture texts, the measurement tools had to be selected and prepared. We used eye tracking and an (existing) psychometric self-report questionnaire to measure cognitive load (Leppink & van den Heuvel, 2015). This validated questionnaire consisted of eight general questions for which each participant had to rate complexity on a scale from 1 to 10, 1 representing low complexity and 10 representing high complexity. The first four questions asked about content complexity and as such provided insight into the overall perceived intrinsic load. The last four concerned instructional complexity and thus provided data on perceived extraneous load. To measure retention, we used a (repeated) comprehension test. This tool had frequently been used successfully in earlier AVTCRP research (e.g., Lavaur & Bairstow, 2011;Montero Perez et al., 2014). We designed the comprehension tests as if they were exams for a philosophy course. All three tests consisted of twelve questions and had equal numbers of multiple-choice questions, input questions, memory questions and insight questions 2 . Finally, we used a biographical survey and would employ additional tests in the main experiment, e.g., proficiency tests aimed at assessing listening and reading competences in both English and Dutch and supplementary surveys, to accurately examine the participants' profiles, proficiency level and prior knowledge.

Lecture Content and Feature Analyses
The lecture texts were first compared in terms of readability to ensure their comparability 3 . To this end, we used the Flesch Reading Ease, the Flesch-Kincaid Grade Level and the New Dale-Chall. The first two calculate readability based on the average sentence length and the average number of syllables per word. The Flesch Reading Ease gives a score out of 100, for which above 90 is considered very easy and below 30 is considered very hard; the Flesch-Kincaid Grade Level indicates the American grade-school level necessary to be able to read the text. Sentence and word length are considered accurate indicators of readability (Smeuninx, 2018), but to include different measures, we also chose to add the New Dale-Chall formula, which calculates readability based on a list of familiar words and the average sentence length and gives a score ranging from 0 to 10 or above corresponding to a grade level (Table 1 reports the grade level). As shown in Table 1, the texts receive very similar scores and are estimated to be difficult texts aimed at twelfth grade (17-18yo) students. We then analysed the texts using Perego et al.'s (2018) construct for film complexity. These researchers distinguish three types of complexity: (a) structural-informative complexity, i.e., number of cuts as a measure of newly introduced information, pace and total number of one and two-line subtitles, (b) linguistic complexity, i.e., total word count, standardised type-token ratio (TTR), words per minute (WPM), total sentence count and average sentence length, and (c) narrative complexity, i.e., number of film locations, number of characters and number of flashbacks. Structural-informative complexity is not relevant at this stage given the absence of subtitles and cuts. Table 2 shows the relevant indices for linguistic complexity, with the word count and standardised TTR being very similar. Sentence count and length vary, but this is deemed less important as these texts will be recorded as lectures (oral texts). WPM/WPS is discussed below (Table 3). Perego et al. (2018) mention chronology and amount of information as key aspects of narrative complexity. After analysing the texts, we concluded similar information was presented in a comparable order. The lectures were subsequently recorded in a recording studio using an identical format. In each of the three lecture recordings, Professor Albers is shown against a black background. This talking head format is, of course, a more artificial setting than a normal classroom environment, i.e., lower external validity, but the research project aims to assess the impact of subtitles in a more controlled environment. Additionally, minimising the effects of the lecturer also reduces extraneous load and increases information transfer following the coherence effect (Mayer & Moreno, 2003). This may enable the students to read and process the subtitles better, which has been shown to correlate directly with performance (Kruger & Steyn, 2014).
Finally, the lecture recordings were analysed. Each lecture is approximately 7 minutes long. The professor does not use hand gestures nor does he cough, he has a constant intonation, rarely stutters and has a relatively constant facial expression across all three lectures. One notable difference from the lecture texts is that the professor tends to explicitly mention quotation marks or add various expressions for indirect speech to signalise quotes. This results in a slightly different total number of words in the lecture. Table 3 shows the length of each recording, the adjusted word count, the overall speech rate in words per second (WPS) and the mean speech rate across 14 intervals of 30 seconds in WPS. Based on these aspects, our team considered the lecture recordings comparable.

First Pilot Study
To verify the conclusions drawn from step 2, a first pilot study without subtitles was set up and conducted in May 2018 with 75 2nd-year students of the BA in Applied Linguistics from the University of Antwerp. They all completed the biographical survey, self-report cognitive load questionnaires and the comprehension tests. Eye tracking, pre-testing and post-testing were excluded to focus on the materials themselves and to keep data analysis feasible. For the statistical analyses, we have one within-subject variable with three levels, i.e., the three lectures, and two independent betweengroup variables with two levels: the study of English (i.e., studying English in their BA or not) and prior knowledge of philosophy (i.e., having followed an optional philosophy course taught by the professor featured in the lectures or not) 4 . We consistently use mixed ANOVAs as these can compare the mean differences between the lectures and take into account the two between-group variables. However, it is important to note that these between-group variables only provide rough indications of the students' profiles based on the biographical survey since extensive pre-testing (which will be done in the main study) was foregone at this stage. Consequently, we mainly focus on the within-subject effects for all participants and will only briefly discuss interactions with these between-group variables.
A number of conclusions could be drawn from this experiment 5 : Firstly, T3 appears to induce significantly lower total load (mean of all questions in the psychometric self-report) than P1 and R2 for all participants (Appendix , Table 5.3). The same can be observed for intrinsic load (mean of questions 1-4; Appendix, Table 6.3). In contrast, no significant main effects were found for extraneous load (mean of questions 5-8; Appendix, Table 7.2). As far as interaction effects are concerned, we observed a significant interaction effect between total load and betweengroup philosophy variable (Appendix , Table 5.2), and between extraneous load ratings and philosophy (Appendix , Table 7.2). In terms of between-group effects, those studying English show significantly lower total load ratings than those that do not (Appendix , Table 7.3).
For comprehension, we found a significant main effect of the lectures, but no significant interactions when the between-group variables are considered (Appendix , Table 8.2). It was revealed that participants scored significantly lower on R2 than on P1 or T3 (Appendix , Table 8.3). For the betweengroup effects, those studying philosophy were found to perform better than those that did not (Appendix , Table 8.4).
In this pilot study, we were mainly interested in the differences regardless of groups, which explains why the comprehension results are particularly problematic. These tests need revising since the lack of comparability might not reside in the lectures but in the comprehension questions themselves. In this light, the overall difference in total load, and consequently intrinsic load, between T3 and the other lectures may also be problematic, since it might indeed hint at a difference between the lectures. However, we believe that data noise could be an issue. By data noise we mean the data produced by participants who did not follow the instructions properly 6 , e.g., a participant rating all psychometric questions with the same number just to be done with the experiment or always choosing the first multiple-choice answer in the comprehension tests. We did not verify whether the participants actually watched the videos or followed the instructions and were therefore unable to filter this possibly conflicting, inaccurate or meaningless data. Accordingly, we will first focus on the revision of the comprehension tests and implement some sort of participant surveillance.

Re-evaluation
Following the results from the first pilot study, all materials were re-evaluated in an attempt to pinpoint a possible cause for the differences. Our team of researchers unanimously agreed that, although T3 could be considered slightly easier content-wise due to it being less philosophical and more focused on political rather than monetary (in)equality, the main problem resided in the comprehension tests and the lack of data noise prevention. Consequently, the need for optimization of the comprehension tests arose.

Optimization
We recomposed the tests in view of our within-subject component. We no longer focused on creating tests similar to actual lecture exams, but instead aimed to strengthen comparability between questions for all lectures, including not only main ideas but also secondary details. Due to a lack of research on how to develop comparable within-subject comprehension tests, we devised our own approach. First, all originally used questions were considered, disregarding scores, to establish socalled matches (i.e., comparable questions across the three tests), using a large number of variables such as question type, answer type, question length, answer length, in-text location of the first mention of the answer, in-text repetition of the answer and "hearing guesses" (i.e., the probability of guessing correctly based on listening to the lecture). If no match could be found for a particular question, it was discarded. If a match could be found between two lectures only, we explored the possibility of creating a similar question for the remaining lecture. 7 Consequently, each test contained twelve questions comparable to the questions in the other two tests. Although this may have eliminated undesirable influences from varying degrees of difficulty in the comprehension tests, we expect a possible increase in order and/or carryover effects (Mellinger & Hanson, 2017) and will verify this in statistical analyses. Lastly, we logged mouse activity to check whether participants watched the entire video and monitored participants more closely to prevent inattentive behaviour.

Second Pilot Study Without Subtitles
To test the optimised comprehension tests, we conducted a second pilot study without subtitles in March 2019 with 50 2 nd -year students of the BA in Applied Linguistics of the University of Antwerp (33 female; 17 male) 8 . The same within-subject (the lectures) and between-group (English and philosophy) variables from the first pilot study were used. The participants filled in the biographical survey first. Then they watched the three lectures, each time followed by filling in the psychometric questionnaire (Leppink & van den Heuvel, 2015) and the respective comprehension test. As in the first pilot study, mixed ANOVAs were used to analyse the data.
The mean total, intrinsic and extraneous load were relatively similar for all three lectures (Appendix, Tables 9.1, 10.1 and 11.1). Additionally, the average scores for the three types of cognitive load for each lecture individually were very similar to the scores from the first pilot study.
For intrinsic load (Appendix , Table 10.1), Maulchy's Test of Sphericity revealed a violation of sphericity, X 2 (2) = 9.018, p = 0.011. With a Greenhouse-Geisser correction for non-spherical data, a mixed ANOVA showed no statistically significant main within-subject effect of the lectures on intrinsic load for all participants, F(1.676, 72.074) = 2.913, p = 0.070, and no interaction effects (Appendix ,  Table 10.2). Furthermore, no significant between-group effects were found (Appendix , Table 10.3).
Lastly, we looked at extraneous load (Appendix , Table 11.1). After assuming sphericity, as Maulchy's test of sphericity revealed spherical data, X 2 (2) = 5.082, p = 0.079, a mixed ANOVA indicated that the lectures did not significantly differ for all participants in extraneous load either, F(2, 86) = 1.581, p = 0.212 (Appendix , Table 11.2). Furthermore, it only showed a significant interaction between the extraneous load ratings and the English variable, F(2, 86) = 6.567, p = 0.002 (Appendix , Table 11.2). We consider this interaction of the English variable not to be problematic, since more extensive testing of the proficiency level will be done for the main experiment and the extraneous load ratings regardless of groups do not differ significantly. In terms of between-group effects, those studying English perceived a significantly lower extraneous load across the videos, F(1, 43) = 9.414, p = 0.004, r = 0.47, than the others. A significantly lower extraneous load was also found for the philosophy students when compared to those who did not follow any philosophy course, F(1, 43) = 5.535, p = 0.023, r = 0.36 (Appendix , Table 11.3). It could be expected that the English students had fewer problems with a course taught in English (extraneous load) given their proficiency level. Experience with philosophy on the other hand was expected to have an effect on content comprehension (intrinsic load) instead of extraneous load, which was not found.
In contrast to the findings of the first pilot study, we no longer found a difference between the cognitive loads of the three lectures for all participants. However, we found two significant interaction effects with the between-group English variable.
We draw two conclusions from these findings. Firstly, the importance of participant surveillance and double-checking mechanisms to verify the viewing and proper answering of the questionnaires and tests cannot be underestimated. The significant main effect of the lectures on total and intrinsic load in the first pilot study disappeared in the second pilot study. The contrast between these findings, with data noise filtering being the only difference, is striking. Secondly, in similar within-subject studies it is key to accurately assess participant profiles, prior knowledge and language proficiency.
With regard to the comprehension scores, R2 again received the lowest mean score and T3 the highest, with P1 scoring in between (Appendix , Table 12.1). However, a mixed ANOVA revealed no significant main within-subject effects for all participants and no interaction effects (Appendix , Table  12.2). The optimisation of the comprehension tests for this particular within-subject experiment clearly helped. Both the cognitive load ratings and comprehension test scores indicate that the lectures and the comprehension tests are comparable.
Despite these already promising results, we decided to improve the comprehension test even further to flatten out minor insignificant differences that may still be present between the tests. Based on several guidelines (Demeuse & Henry, 2004;Professional Testing, 2020), advice from statisticians from the University of Antwerp on test item analyses and parts of the Item Response Theory (Baker, 2001), we decided to disregard a number of questions in the test. Three variables were used to decide which questions to disregard: difficulty, discrimination and reliability.
The difficulty score of a question is based on the percentage of examinees having answered that question correctly. Since the threshold for what is considered to be a difficult or an easy question is arbitrary, we adhere to the guidelines of the University of Antwerp: if less than 10% of the participants answer correctly, the question is considered to be difficult, whereas a question is easy when more than 90% answer it correctly. Questions that are too difficult or too easy would no longer be considered in future testing. The discrimination score reveals whether a question is in line with what is assessed, assuming that an examinee with high overall testing scores has a higher chance of answering a question correctly. If a question tends to be answered correctly more often by examinees who obtain lower overall scores, while the better examinees tend to answer that question incorrectly, that question can be considered to not be discriminating and not in line with what is assessed. In our university guidelines, the discrimination score for each question is calculated by deducting the number of correct answers in the worst scoring 25% of the participants from the number of correct answers in the best scoring 25% of the students and dividing that number by the largest of those two numbers. It is advised that questions with discrimination scores lower than 0.20 are disregarded in future testing. Our university guidelines determine reliability/consistency with the Pearson point-biserial correlation coefficient between the question and the total scores and should ideally be equal to or higher than 0.15. Similar to the discrimination score, this variable reveals whether the question is in line with what the entire test wants to assess.
When a question was flagged for two of the three variables, we decided to disregard the question in further analyses. We chose to exempt questions instead of discarding them to maintain an equal number of questions and safeguard the similarities between the comprehension tests for the three lectures. This eventually led to three twelve-question tests, but for P1 the scores of only 10 distinct questions were considered, for R2 10 questions and for T3 11 questions. When we compared the newly weighted average scores of the three tests (Appendix , Table 13.1), we see highly similar scores. After verifying sphericity with Maulchy's test of sphericity, X 2 (2) = 2.503, p = 0.286, a mixed ANOVA again revealed no significant main within-subject effect for all participants, F(2, 88) = 0.469, p = 0.627, and no interaction effects (Appendix , Table 13.2).
Although this additional enhancement was not required, it clearly strengthened the similarity of the tests in terms of test scores and can be used as another example of adjusting the materials to benefit comparability in within-subject studies.
Since Mellinger and Hanson (2018, p. 16) warned that there might still be order effects despite having counterbalanced orders, we checked whether psychometric ratings, comprehension scores and recoded comprehension scores for each lecture differed when it was watched first compared to when it was watched second or third. No real pattern in cognitive load or comprehension could be detected (see Table 4).

Conclusions
The use of within-subject designs is rather scarce in the body of research into AVT cognition, reception and perception. The aim of this article is not to plea for within-subject studies to take over the world of AVTCRP research. As Bernardini did in 2001 for TAP-based research, we advocate for more frequent use of within-subject designs in AVT research, conducted alongside between-subject studies. Within-subject designs could give additional insight into the idiosyncratic nature of perception, cognition and comprehension of AVT, and could increase statistical power in studies with limited sample sizes.
However, to safeguard validity, careful preparation and pre-testing of research materials and experimental set-up (preferably using at least two pilot studies) is key. A within-subject design might minimize characteristic influences due to the same participants being tested, but it also has a higher risk of undesirable influences from the materials or experimental setup. In this paper, we proposed a ten-step preparation of a within-subject AVTCRP study, which may guide or inspire future research. Based on the experience gained in the S4AE project, we can conclude it is rather challenging to develop materials and tools that are comparable in content and language (complexity), style, length, etc. We have also demonstrated the necessity to be cautious of initial subjective or intuitive assessment of comparability and to pre-test materials and measurement tools using objective measures. Additionally, we have shown that, instead of creating new materials or refurbishing measurement tools, there are other options to allow for valid and reliable results, for example, by recoding comprehension test scores based on an approach from educational research. Methodological input from the aforementioned field of education, other domains of Translation Studies (e.g., translation process research) or even other fields (e.g., Psychology) may be useful to guide this pre-testing phase.
We acknowledge that the ten-step proposal needs adaptation dependent on specific research goals, as well as further refinement. We acknowledge limitations in our approach, such as potential bias in the initial preparatory steps and the relatively small participant sub-groups, particularly in the second pilot study. However, we hope that this proposal will spark a disciplinary debate on the use of withinsubject (or mixed) design in AVT research and the ways in which methodological preparations can be approached.