Human Evaluation of NMT & Annual Progress Report : A Case Study on Spanish to Korean

This paper proposes the first evaluation of NMT in the Spanish-Korean language pair. Four types of human evaluation —Direct Assessment, Ranking Comparison and MT Post-Editing(MTPE) time/effort— and one semiautomatic methods are applied. The NMT engine is represented by Google Translate in newswire domain. After assessed by six professional translators, the engine demonstrates 78% of performance and 37% productivity gain in MTPE. Additionally, 40.249% of the outputs of the engine are modified with an interval of 15 months, showing 11% of progress rate.


Introduction
The birth story of Machine Translation (MT) threw back to March 4, 1947 when Warren Weaver defined the concept of translation with encoding and decoding (Weaver, 1949: p.16). Starting from Rule-based MT (RBMT), this field witnessed two major turning points. The first moment was when Brown et al. (1988) presented Statistical MT (SMT).
Instead of creating linguistic rules as in the previous approaches, the focal point of SMT was exploiting annotated data and matching equivalences. Subsequently, the second new wave came from a technological aspect. In around 2014, Neural MT (NMT) was showcased (Bahdanau et al., 2014). In NMT, the original concept of utilizing data in SMT remained identical, but the core technology was originated from the field of Artificial Intelligence (AI). With its growing viability, in just two years after its first debut, it became commercially available starting from Google Translate (Wu et al., 2016), and was widespread at an alarming rate.
The baseline technology of NMT is denominated as Artificial Neural Networks (ANNs), one of the Machine Learning algorithms that is advocated by connectionists who approach AI by imitating the interactions of human brain (Domingos, 2015). Simply put, axons and dendrites transmit/receive chemical and electrical signals by adding or subtracting them via so-called Action Potentials (British Neuroscience Association, 2003). A neuron needs to reach a certain limit to be fired in order to send signals that will strengthen the connections. Such a process is interpreted into the binary system of a 0 and 1 of Computer Science as such: a function f(x) decides a bond of nodes by weakening (y = 0) or strengthening (y = 1) the value, in such a way that it updates information (Russell and Norvig, 1995). It is a gist of the threshold theory.
Frank Rosenblatt proposes a single ANN by introducing the new concept of 'weight' to the given theory and names it as "perceptron".
Since then, various types of ANNs have been developed and tested in a number of AI tasks including MT. It started from hybrid architecture of SMT in the realm of ngram language model (Bengio et al., 2003;Schwenk, 2007). Furthermore, Devlin et al. (2014) applied ANNs in a decoding step as a fully-integrated part that could be applicable to any decoders. Upon their success, in 2014 many scholars presented a purely-ANNs-based MT model (Sutskever et al.,2014;Bahdanau et al., 2014;Cho et al., 2014), opening a new era in MT. As such, NMT was distinctive in origin from the traditional MT paradigms.

Objective
Not only did the robustness of NMT have a great impact significantly on the MT field, but also the influence was evident in the humanities field. The study of MT evaluation  time/effort. The HTER score is also proposed as a semi-automatic metric that can mediate the methodological imbalance. In addition to the report, a progress rate of the intended model is estimated by calculating how much the outputs have changed in the period of a year. From our understanding, it is believed that the current study is the first attempt to explore not only the Korean language but also the Spanish-Korean language combination, applying multifaceted and standardized MT evaluation methods.
In this respect, we differ from the studies presented in Chapter 1.3.
This paper constitutes a part of the doctoral thesis of Kim (2019) that includes an NMT performance evaluation, as well as an error analysis. In this present work, we extend the evaluation study with an annual progress report. More intriguing than the upsurge of interests in NMT in Korea (in Figure 1) was the little attention paid to SMT. Kim (2015: p.34 and can be sensibly interpreted by a native speaker" (Görög, 2014). He/she is instructed to give a rating on a 4-point scale of Table 2 sentence by sentence.

Adequacy Scoring
This test is based on the identical architecture to the Fluency Scoring, but it concerns different aspects of the sentence. An annotator is asked to "capture to what extent the meaning in the source text is expressed in the translation" (Görög, 2014).
As such, the current method takes both the source and target texts into consideration.
The rating scale is identically 4-point as given in Table 3.

Ranking Comparison
This test allows an indirect judgment of the engine by contrasting it to two other candidates -Kakao i (an online NMT engine) and a human translator. Provided anonymously with three translations, an annotator ranks them from the best to the worst, with a possibility of a tie. The rankings are then computed with 3, 2 and 1 points each for the final score.  (Koponen, 2012). Edit distance is used to measure technical MTPE efforts (Tatsumi, 2009).
HTER TER (Translation Error Rate) detected the similarity of the system and reference translation by calculating the minimum number of deletions, insertions, substitutions and shifts (reordering) (Snover et al., 2006). Going one step further, Human-mediated TER (HTER) improved TER's correlation to human judgment by filling the linguistic gap between the two texts (Snover et al., 2009). HTER substituted the reference translation to multiple MTPEs that are intentionally created for this purpose, commonly referred to as Targeted Reference (Snover et al., 2009). The minimum scores are selected and normalized by the number of words in the targeted reference.

Dataset
One of the biggest challenges of the Spanish-Korean pair is a lack of parallel corpora.
They are so limited that a big part of available corpora in popular platforms like Wikipedia or OPUS might have been already employed in the training stage of such publicly popular MT systems. To alleviate such concern, we have collected hands-on data and hired a human professional translator to create its reference translation. A total of 253 Spanish sentences are extracted from three major journals -ABC, El País and KBS World Radio--in the section of Politics. The main topic of all 11 articles is election-related. An example of the articles is given below for readers's information.
The size of dataset is detailed in Table 4.

Fluency & Adequacy
In Table 5, the Fluency & Adequacy scores of each annotator were presented. The Google NMT system obtained on (mean) average 3.12 Fluency and 3.108 Adequacy scores of 4, equal to 78% and 77.7%. With a margin of 0.3% point, the engine was judged to be more fluent than adequate.
Taking a closer look, Figure 2 and Figure 3 illustrated the distribution of Fluency (of 3) ranking score and was considered as the worst candidate with 28.17% of preference, as in Table 6. In the meantime, it came to our attention that when a distinction of human versus machine was drawn, it turned out that the annotators preferred the two system translations (58.22%) to the human translation (41.78%). The reasons were unclear, but some possible scenarios were speculated in Kim (2019).
Subsequently, the result was organized by machine and ranking choice in Figure 4 and Figure 5.

MTPE Time & Effort
In comparison to TS, Table 7 showed that MTPE was 37% faster on average and in a range of 12% -53%. We, however, could not interpret a real sense of this 37% productivity, as no standard was established currently in this regard. Groves and Schmidtke (2009) reported 6.1% -28.7% gains while the case of Plitt and Masselot (2010) and Skadina and Pinnis (2017) reached up to 118%. The closest study to ours was Zhechev (2014) who obtained 81.93% gain in the English-Korean pair. From these previous studies, it was soon to declare that MTPE would be always more recommendable than TS in our language pair in this environment, but MTPE was more time-efficient than TS in our study. In relation to effort reduction in MTPE, the temporal effort was far lower in short sentences (l <= 13) and become higher from sentences of l=31, as in Figure 6. The tendency, however, was not proportionate. When it was measured with MTPE throughputs (WPH), no clear-cut correlation was observed, as in Figure 7. Interestingly though, all sentences required certain degree of MTPE efforts, with minimum 380 WPH.
Hence, it was estimated that MTPE was efficient in sentences of l <= 13 but inefficient in those of l >= 31. Such a finding also coincided with the comments of the annotators in Kim (2019: p.134-138).
In terms of the technical effort measured by edit distance, Table 8 displayed that 25.9% of the dataset hardly required any MTPE. It was also noticeable that not a single sentence was entirely deleted to be translated from scratch (d = 10). to satisfy the translation quality. We, however, acknowledged a potential bias in this result due to the characteristic of Korean as an agglutinative language, whose wordspacing unit did not match its part-of-speech tagging (Song and Park, 2020). For example, in Table 10, the first word (우리는) was composed of "we (우리-)" and subject case marker (-는). A back translation to English was given for readers' information.

Annual Progress Report
The aforementioned experiment gave us insight into the status quo of the Google NMT engine in the Spanish-Korean pair, which could be summarized as follows: • The Direct Assessment on the engine confirmed 78% of performance.
• The comparative study suggested that if a human parity was defined as the first-ranked system translation, the given engine obtained 16.54% of human parity.
• MTPE was 37% faster than TS. It was especially effective in short sentences.
Considering the past performance of SMT in Table 1, the fact that NMT achieved positive results alone was a remarkable phenomenon. At this point in time, we came to inquire into how fast and to where NMT would further grow. To this aim, the performance of NMT was chronologically compared in two different periods of time: November 2018 and February 2020. From our understanding, such temporal approach was a new approach in this area. From this mini-task, two questions were answered: • How much did the result change?
• Were those changes positive or negative?
The two versions of system translation (named after Old and New) based on the equivalent dataset to that of the evaluation experiment in Chapter 3 -6,426 words in the newswire domain-were prepared and analyzed. The comparison was based on edit distance of TER on a sentence basis on TAUS DQF. Additionally, Ranking Comparison was manually carried out, in which the author played as a sole annotator.   Figure 8, a certain level of modification was performed throughout the whole dataset, with an exception of two sentences (TER = 0.0). The largest modification was witnessed with one case, with 80% of changes (TER = 0.8) as shown in Table 12. The modifications were witnessed not only in lexicon but also in syntax.

Positivity
In Ranking Comparison, a better option between the New and Old was directly selected on a sentence basis. It turned out that excluding one erroneous sentence, the New was about 11% more preferred than the Old, with 55.65% versus 44.35% (Table 13).
13 sentences were of equal value. With these results at hand, our study confirmed that there was a strong possibility of progress of the given engine.  Taking a comprehensive stance, we could conclude that understanding the meaning of the text with this engine was guaranteed in this setup. Our study proved that NMT was breakthrough technology for this language combination. It also gave hope that MT was tearing down the language barrier. The question was: Is the performance good enough to substitute TS to MTPE? There was no doubt that MTPE was strongly recommended, but at this level we encouraged MTPE only for short sentences of up to 13 words.
As a mini project, we examined the progress rate of Google Translate for a period of 15 months. Compared to year 2018, the engine made 40.35% of modification to the equivalent dataset in year 2020, according to TER. From a quick comparison test, it was estimated that there was 11% of progress in the engine. Taking all into account, we expect a bright future of NMT ahead in the Spanish-Korean language combination.

Future Research
Given such circumstance where the linguistic barrier is considerably resolved between the two languages, we assume that automatic evaluation of NMT will be of utmost value. It is our first aim to organize automatic evaluation of NMT in this language pair, with larger dataset and hopefully more annotators. We are also interested in comparing the performance of NMT and SMT in the given environment.