Towards testing post-editing performance: a futureproof diagnostic tool

En aquest article descrivim un instrument de diagnostic per avaluar la practica de la post-edicio. Tot i que hi ha exemples d'aquests instruments a l'abast, rara vegada es fan servir els estudis empirics com a base per a avaluar-lo. Esperem que la nostra eina pugui ajudar a seleccionar professionals de la traduccio o alumnat de traduccio adequats per a projectes de post-edicio amb la deteccio dels coneixements, les competencies o les actituds importants per fer aquesta feina i que fins ara  faltaven en el comportament dels aspirants a fer-la.


Resum
En aquest article descrivim un instrument de diagnòstic per avaluar la pràctica de la post-edició. Tot i que hi ha exemples d'aquests instruments a l'abast, rara vegada es fan servir els estudis empírics com a base per a avaluarlo. Esperem que la nostra eina pugui ajudar a seleccionar professionals de la traducció o alumnat de traducció adequats per a projectes de post-edició amb la detecció dels coneixements, les competències o les actituds importants per fer aquesta feina i que fins ara faltaven en el comportament dels aspirants a fer-la.

Resumen
En el presente artículo perfilamos una herramienta de diagnóstico para evaluar la práctica de la posedición. No es habitual que se utilicen estudios empíricos como base para su evaluación, a pesar de disponer de ejemplos de tales instrumentos. Esperamos que nuestra herramienta ayude a seleccionar profesionales de la traducción o alumnado de traducción adecuado para proyectos de posedición con la detección de conocimientos, competencias o actitudes importantes para este trabajo que no se encontraban hasta ahora en el comportamiento de los aspirantes a desempeñarlo.

Introduction
In recent years, the topics of machine translation (MT) and post-editing (PE) have cut an exceptionally wide swath in Translation Studies. These topics have gained prominence in large part because the quality of MT output has improved significantly with the advent of phrase-based statistical MT and, more recently, neural MT. Previous research has shown that post-editing of MT output can be less time-consuming than "from scratch" human translation with no negative effect on product quality (e.g. Green et al., 2013;Daems et al., 2016;Plitt and Masselot, 2010). This gives MT the potential to be integrated into human translation workflows in ways that ultimately change the degree of human intervention in translation production. In addition, language service providers and software developers have found ways to tap into new markets where texts of suboptimal quality do not impede communication and do not affect (business) relationships (see Massardo and Van der Meer, 2017). For these reasons, it is safe to assume that MT and PE have been true sources of disruption, not just in the traditional sense (disturbance, commotion) but also in the technical sense of the word (for a detailed discussion of "disruptive (technological) change", see Christensen, 1997).
On the other hand, despite considerable technological advances made in the past few years, improvements in MT quality and the use of MT in PE tasks have been broadly in line with industry predictions. In some respects, the integration of MT into the translation workflow can be seen as just another step in the direction of (further) automation of translation services, a process that was set in motion decades ago (see Bowker and Fischer, 2010). The same can be said of the alleged popularity of (target) texts of suboptimal quality. It has been claimed that, as the cultural focus shifts from the written word to the multimodal digital environment, reader expectations of linguistic quality have been in decline, and clients and other stakeholders are becoming more willing to make do with low(er)-quality translations (Massardo and Van der Meer, 2017).
While, to our knowledge, longitudinal trends in reader expectations have not to date been confirmed by strong empirical evidence, the fact remains that MT diversifies the notion of quality and gives new status to fitness of purpose and content perishability as quality criteria (see Way 2018). From this angle, 'disruptive' may no longer apply to the use of MT in the industry, as this is now increasingly commonplace.
Regardless of one's perspective on or understanding of disruption, it remains undeniable that finding language professionals who are not only qualified, but also willing to incorporate MT into their processes is a crucial aspect of coping with increasing demands for post-edited content (Global Market Research, 2016;Lommel and DePalma, 2016). The issue of willingness is difficult to tackle: for decades, PE has often been regarded as an activity that consisted of little more than "cleaning up" Paice, 2017). Even as PE becomes a service in its own right (ISO, 2017) and machine learning continues to improve, it will not be easy to overcome this bad reputationwhich is potentially why some stakeholders, like TAUS, veer away from the term "post- translation and post-editing is increasingly blurred. In addition, new means have been developed to remove the drudgeries from the PE process and to improve humancomputer interaction, so that the post-editor can focus more on improving the target text in a more authentic "translation-like" manner that involves text-tailoring rather than correcting basic errors (e.g. automatic PE and interactive MT) (O'Brien and Moorkens, 2014;Forcada, 2016). We believe these recent developments call for a new take on post-editing performance.
In the remainder of this article, we first seek to formulate an answer to the question "'who should become post-editors?" by revisiting some considerations and loci classici in PE research that can be said to form the theoretical backbone of our proposed PE performance test. We then sketch a brief outline of the test. Before concluding, we discuss how we hope to capitalise on the potential of this tool, and how we hope it to address the industry's needs in finding and selecting motivated post-editors.

Who should become post-editors?
The question of who should become post-editors has been lingering in the translation literature for some time. When advances in Artificial Intelligence were brought to bear on translation, some argued that the end of translation was near, and that PE held great promise (see Kenny, 2016). According to Pym, the time had come for a career switch for translators: "MT […] is destined to turn most translators into posteditors one day, perhaps soon" (Pym: 2013: 488).
The answer to the above question seemed obvious: translators should become posteditors. However, as previously argued, PE never seems to have enjoyed a good reputation among translators. The reasons why some translators took an instant dislike to it were that it was slow (if the MT output was poor), repetitive and often very cognitively demanding because it mainly involved extensive corrections of recurrent, basic linguistic errors O'Brien, 2013, 2017;O'Brien and Moorkens 2014).
As a result, the language industry seemed urged to source specialists in other domains, awake new potential in the form of translation trainees and/or find the few translators that were less MT-averse. 117 Now that the lines between traditional translation and PE are becoming increasingly blurred, the question demands reinvigorated attention. PE has not entailed the end of (specialised) translation. Rather than a peripheral activity, it is growing to become part of the core of translation practice (see Koponen, 2016a). Those who are willing to become post-editors have more working options in the specialised translation.
Irrespective of the labels used to refer to those who incorporate MT output into their processes (e.g. "post-editor", "revisor" or simply "translator"), PE's expected move from the outskirts to the core of professional translation necessitates new ways of testing and measuring PE skills. However, the process of designing a PE test is fraught with difficulties. The main ones of these lie in defining the profile of the post-editor and the concept of PE competence. To date, there have been no attempts to empirically test (let alone corroborate) a competence model for PE. In section 3, we will present an attempt at sidestepping this issue.

A diagnostic tool
Although no PE competence model has been proposed, let alone validated, to date, PE has been the object of many empirical studies that can be used to piece together a reasonable and experimentally reliable basis for measuring PE performance (see below).
Still, it should be noted that, given the absence of a validated model of PE competence, the test design introduced below should be regarded as an incremental attempt at developing a diagnostic instrument rather than as a final proposal.
In the following subsections, we briefly outline what we envisage as the test's key modules. One crucial factor has not been included in this discussion: domain knowledge. As in most translation services, familiarity with the subject is believed to stand a post-editor in good stead. However, it is difficult to cover domain knowledge given the test's intended role as a general tool. As it stands, test takers would be assumed to have knowledge of the domain in which they are likely to operate, which those administering the test may wish to check via other means. Still, we believe that the tool will allow for customisation in the future.

Module 1: Keyboarding skills
It is common sense that good keyboarding skills are indicative of good PE performance, as they are likely to increase one's productivity (Sigla et al., 2014). In this module, we will check test takers' keyboarding efficiency. They will be presented with a rough MT passage and a corresponding edited version of the text that will act as a reference. Candidates' goal will be to match the reference with as few edits as possible. Their keyboarding and editing time will be measured. High typing speed and narrow differences between "minimal" and "real" edit distance (i.e. between the rough MT output and the reference version) will act as indicators of desirable performance in this module. Like good keyboarding skills, the ability to quickly assess a situation and make a decision is important in PE (e.g. Offersgaard et al., 2008). Therefore, this module asks candidates to decide whether proposed MT solutions require any editing and where.

Module 2: Problem-solving/Decision-making skills
Candidates will not need to actually edit the MT output or justify the decision to flag a fragment; they will simply mark the parts of the MT output that require intervention.
This module will assume as a goal a level of edited quality that would be indistinguishable from human translation carried out from scratch, which will also be the case in the next module. As well as shedding light on candidates' decision-making patterns, this module is expected to give some insight into candidates' attitude to PE.
Negative attitudes would be expected to entail potential over-editing (DeAlmeida, 2013), which translation companies often mention as an issue (Vieira and Alonso, 2018).

Module 3: Editing skills
A post-editor must be able to strike a happy medium between under-editing and overediting. This can be problematic for several reasons. First, it is not always easy to distinguish between necessary and unnecessary changes to the MT output (Guzmán, 2007, DeAlmeida, 2013. Second, empirical research has shown that PE is cognitively demanding, and that over-editing not only affects productivity, but that it also affects text quality in a negative way (e.g. Vieira, 2017;Koponen 2016b). In this module, special heed will be taken of the editing skills of candidates. They will be asked to post-edit a short text without receiving editing-specific instructions other than that they should use as much of the MT output as possible in bringing the target text to human-level quality. This module will record candidates' keystrokes, which are hoped to offer a glimpse into the editing style of the candidate, and editing time at text and segment levels, which will give an idea of the speed at which the candidate can generate a PE text.

Module 4: Perception of productivity
In recent years, it has been repeatedly reported that there can be a striking disparity between one's PE productivity and one's perception of PE productivity and that participants tend to be more productive than they thought they were (Teixeira, 2014;Koponen, 2012Koponen, , 2016aKoponen, , 2016b. In some cases, the underestimation of one's productivity might be indicative of a negative attitude to MT and PE (Teixeira, 2014). In this module, candidates will be asked about their PE experience in Module 3, which might provide some further insight into candidates' attitude to, and readiness for, PE tasks.

Module 5: Following guidelines
The success of PE projects depends greatly on the service provider's ability to deliver a text that is tailored to the client's (and end-user's) needs. While this is true of all translation tasks, when MT comes into the picture, these needs arguably vary across a wider spectrum of uses and quality levels. For a long time, the main qualitative distinction that was made between different PE services was the distinction between 119 light (or rapid) and full PE. In practice, however, this dichotomy risks over-simplifying the problem. In their comparative review of the literature, Hu and Cadwell (2016) illustrated that PE can be described as a simple "two-stop" service, which might be artificial and impossible to implement. Recent customer studies on MT, PE and usability arrived at similar conclusions. These studies suggest that every communicative situation requires a different level of quality and thus a difference approach to PE (Pluymaekers and Van Egdom, 2016;Van Egdom, 2017). In the present module, therefore, candidates edit different but comparable texts by following text-specific guidelines. These guidelines may vary according to different contexts and client requirements. Test administrators may wish to use two or more texts/sets of guidelines, but a maximum of two texts is suggested to prevent a situation where candidates might conflate the instructions.
Based on the edited products and keyboard and time logs (tracked with SDL Qualitivity), it will be possible to observe candidates' tendency to stick to the guidelines provided.
3.6. Module 6: Perception of productivity From Module 3 to 5, it may be that candidates improve their editing efficiency by editing as little as possible but as much as necessary to follow the guidelines. This would indicate an initial lack of procedural knowledge of PE rather than an unwillingness to use MT or an intrinsic lack of PE skills or aptitude. In the present module, candidates will be required to fill out the questionnaire on perceived productivity one more time. This time, candidates are asked to reflect on their PE productivity in Module 5 specifically, with questions that pertain to each text. It is hoped that results from this questionnaire might offer a glimpse of a learning effect developing between Modules 3 and 5. As seen in previous research (CasMaCAT, 2015), learning curves in human-machine interaction should not be overlooked. However, it should be noted that the design of the PE test does not allow for a longitudinal study of candidates' behaviour.

Module 7: Background questions
At the end of the test, candidates will be asked to provide background information (age, gender, nationality, mother tongue, prior professional experience, translation experience). This final module is added with the sole purpose of gathering relevant research data that enables us to flesh out (a) consistent, coherent and cogent PE profile(s).

Properties and uses
This PE performance test can serve many purposes. Translation students can be asked to take the test, which would allow teachers to gain a good handle on students' level of PE competence and their ability to function on a market that is highly susceptible to automation. The test can also be taken by experienced freelance translators that consider offering PE as a service and want to know if this is a viable option to them. Lastly, but not exhaustively, the test can be of use to translation agencies that seek suitable PE candidates.
To fully capitalise on the potential of this diagnostic tool, the test report will not act as a summative result. In other words, it is not envisaged as a tool for establishing whether a candidate fits the description of a post-editor, but rather as a means of flagging knowledge, skills or attitudes relevant to post-editing that is/are found to be lacking in a candidate's PE performance.
At the time of writing, the modular results still need to be manually processed and pieced together. However, for the test to be fully geared to the needs of abovementioned target groups, automatic generation of a test report should be made possible. For this reason, we aim to develop an integrated approach that allows the results to be brought together in an automatically generated report where the candidate will find information on the criteria for measuring PE performance, plots with the candidate's personal scores and any personalised advice. Furthermore, to ensure the so-called "pragmatic validity" of our product, we hope to collaborate with potential buyers and translation institutions to develop customised-, domain-specific modules.
Lastly, it should be noted that the reliability and validity of the information about candidates PE performance hinges on findings yielded through empirical studies of PE.
To date, some skills remain under-researched and benchmarks for subtask performance may still be inaccurate or non-existent; other PE-related skills might become less relevant as a logical consequence of technological progress. We should therefore underline the fact that the strength of the test probably lies in its incremental design, which allows for its application in future translation contexts as well.
In the future, when the test can be said to meet the expectations of all parties, we hope to complement the test with tailor-made training material that can help remediate specific weaknesses in a candidate's performance in making use of MT.

Conclusion
In this article, we have provided a rationale and an outline for a diagnostic tool for testing PE performance. As MT becomes a prevalent feature of translation environments, the demand for PE skills will continue to rise in the (near) future. We argue that educators and language service providers could benefit from psychometrically sound tools to measure the PE skills of (aspiring) post-editors. This can help them identify suitable candidates for PE projects and flag relevant knowledge, skills or attitudes that is/are found to be lacking in candidates' PE behaviour. In addition, it is hoped that, through PE testing, we can yield relevant data that can shed more light on PE efficacy and help us set new benchmarks for PE performance.
Translation is acquiring many aspects of editing where translators are expected to interact with an array of existing textual suggestions and resources. We believe this trend is likely to continue and that effective decision-making and the ability to quickly judge the usefulness of various machine-and/or human-produced textual alternatives will become even more important in years to come (see Pym 2013). Irrespective of the