Automatic assessment of non-native prosody by measuring distances on prosodic label sequences

The aim of this paper is to investigate how automatic prosodic labeling systems contribute to the evaluation of non-native pronunciation. In particular, it examines the efficiency of a group of metrics to evaluate the prosodic competence of non-native speakers, based on the information provided by sequences of labels in the analysis of both native and non-native speech. A group of Sp ToBI labels were obtained by means of an automatic labeling system for the speech of native and non-native speakers who read the same texts. The metrics assessed the differences in the prosodic labels for both speech samples. The results showed the efficiency of the metrics to set apart both groups of speakers. Furthermore, they exhibited how nonnative speakers (American and Japanese speakers) improved their Spanish productions after doing a set of listening and repeating activities. Finally, this study also shows that the results provided by the metrics are correlated with the scores given by human evaluators on the productions of the different speakers.


Introduction
Computer assisted pronunciation training (CAPT) systems have shown to be attractive both from a pedagogical and a commercial point of view. These systems mainly focus on the training of phonetic pronunciation, paying less attention to prosodic aspects (with the only exception of fluency). Nevertheless, prosody plays an important role in the evaluation protocols of L2 evaluators; for example, [1] establishes the minimum requirements of prosodic competence to assess the level of Spanish proficiency according to the Common European Framework of Reference for Languages (CEFR).
There are several approaches in the state of the art that face up the problem of evaluating L2 prosody [2]. These systems are based on comparing the prosodic acoustic characteristics of L2 utterances (like F0, duration and energy) with the corresponding features of native speakers (generally with the ones of a golden speaker who is considered to use the correct pronunciation). These approaches have an important limitation that has to do with the under representation of variety in prosody: the same prosodic function can be represented with more than one prosodic form [3]. This is challenging for CAPT systems because two prosodic productions of the same text can be different 1  but valid at the same time. To face up this problem, in this work we have devised a double strategy: on the one hand, we have used prosodic labels (no directly prosodic acoustic features) to compare utterances; on the other hand, L2 utterances have not only been compared with those of a single golden speaker but with the productions of a set of reference speakers.
The efficiency of using prosodic labels (a set of symbols for transcribing the intonation patterns and other aspects of the prosody of utterances) has been well established in the context of L2 assessment [4,5,6]. Related to this, the ToBI system is a broadly accepted framework for the transcription of prosodic phenomena. It was originally developed for English, based on Pierrehumberts autosegmental model, but since then it has been applied to a large number of languages, among them Spanish [7]. In [4], an experiment of style identification was presented by using the Automatic ToBI labels described in [8]: the results showed 95% of accuracy. When a given utterance is labeled with prosodic labels, its representation is simplified since the labels include symbolic information that specifies the relevant prosodic functions present in the utterance. The automatic prosodic labeling systems are prepared to process prosodic variety as they are trained with data that reflects the form-function multiplicity. In [9], we used the automatic Sp ToBI classifier presented in [10] to characterize radio broadcasting prosodic style by measuring the mutual information between sequences of prosodic labels. In this paper, we follow a similar approach to compute distances between native and non-native speakers by improving the mutual information metric used in [9] and by applying normalization that takes into account the joint entropy of the labels of the different type of speakers. The results show that these new metrics permit to identify non-native speakers with a degree of confidence that is statistically significant. The results are consistent with the a-priori expected improvement on the pronunciation as the pronunciation exercises are repeated.
Unlike other studies that compare L2 prosodic contours with those of a singlne golden speaker, in this work we use a group of native speakers (as a whole, not individually) as reference (already done in [11]), . This is well motivated by the previous research in [9], where we showed the high variety that could exist between speakers of the same style when reading the same texts. This fact evidences the limitations of comparing F0 utterances with the ones of a single native speaker. The aim of this new study is to demonstrate that using the minimum and/or maximum distance between the L2 utterances of nonnative speakers and the corresponding native utterances permits to obtain a better correlation between the objective quality metrics and the subjective scores assigned by human evaluators. Section 2 details the experimental procedure presenting the corpus, the automatic prosodic labeling systems and the metrics. Section 3 presents the results and the paper ends with a discussion about the potential of prosodic labels to offer information on the limitations of non-native speaker pronunciation corpus, the automatic labeling systems and the metrics.

The Corpus
In the framework of the SAMPLE research project, a corpus of spoken utterance produced by L2 Spanish non-native speakers was developed as a means to support computer-assisted pronunciation training (CAPT) studies. The central part of the corpus includes a set of sentences and paragraphs selected from the news database of a popular Spanish radio news broadcasting station. The texts cover various information domains related to everyday life. They were obtained from the Glissando corpus [12], which was developed in connection to another project related to automatic prosodic labelling. The materials belong to the subset of prosodically balanced sentences in Glissando, which statistically resemble the prosodic variability found in Spanish [12]. The whole SAMPLE corpus is described in [13]. It contains different materials: sentences, the Aesops Fable "The North Wind" and news paragraphs. In this study, we focus on the sentences. They were extracted from the news paragraphs of the Glissando corpus [12]. The list of sentences is described in [13] (see table 1 of that paper). All sentences followed a phonetic coverage criterion. 14 speakers that were students of Spanish were recorded: 9 American English (AM) and 5 Japanese (JP). All of them were students of Spanish at a university level. In this paper, we refer to the American speakers as AM1, AM2, AM3, AM4, AM5, AM6, AM7, AM8 and AM9, (corresponding to f01, f02, f04, f05, f06, f07, m08, f09 and f10 in the SAMPLE corpus), where f means female and m means male. Similarly, Japanese speakers are referred to as JP1, JP2, JP3, JP4 and JP5 (corresponding to m03, f11, f12, f13 and f14 in the SAMPLE corpus).
There were several repetitions of each of the fifteen sentences (s01-s15). Ten sentences (s01-s10) were read three times by the L2 speakers. Another group of ten sentences (s06-s15) were used for the task of listen and repeat: a reference utterance of each sentence by a native professional speaker was presented to the non-native speakers, who had to listen and repeat it immediately afterwards. This task was recorded three times. Therefore, we can define six blocks of sentences: • BR1: read sentences s01-s10.
The reference sentences of native pronunciation are the corresponding fifteen sentences extracted from the Glissando corpus. As the Glissando corpus recorded eight different professional speakers, we have more than one reference to contrast the non-native pronunciation. In this paper, the professional speakers are referred to as SP1, SP2, SP3, SP4, SP5, SP6, SP7 and SP8 (corresponding to f16a, f11r, f13r, f15a, m09a, m10a, m12r and m14r in the Glissando corpus). As before, f means female and m means male. Furthermore, r stands for a radio speaker and a indicates an actor. In [13], the procedure followed for subjective evaluation of the utterances of the corpus is described. As a result of this procedure, all the speakers obtained different numeric scores representing the quality of his/her pronunciation taking into account different aspects that have to do with both phonetic and prosodic pronunciation proficiency.

Automatic prosodic labeling
For the labeling of the spoken material, the procedure described in [14] was used. An automatic labeling system was trained with a subcorpus of the Glissando corpus consisting of a 60 news items recorded by five professional speakers (12 news each). These news items include a total of 5,103 pitch accents and 2,835 boundary tones.
The automatic system is a pairwise coupling classifier that combines evidences of three complementary types of classifiers, such as artificial neural networks (NN), decision trees (DT), and support vector machines (SVM) [10]. In order to combine the three classification modules (DT, NN and SVM), we used the comprehensive fuzzy technique proposed in [15].
The reference unit for the automatic labeling system is the word. Every word is characterized in terms of prosodic information (F0, energy and duration features) and POS tags, as described in [10]. As a result, we obtain up to two Sp ToBI labels per word: one for the pitch accent and another one for the boundary tone. We use the following Sp ToBI pitch accents: H*}; and the following boundary tones: L%, H%, =%, !H%, LH% ={LH% ∪ L!H%}. Additionally, the label "none" represents the absence of tone.

The metrics
The output of the automatic labeling is a sequence of prosodic labels per utterance. By comparing the sequences of the automatic labels that correspond to two different speakers, we should obtain a clue of the similarity of the prosodic productions of both speakers. By computing the mutual information between the sequences of prosodic labels of two speakers (as in [9]), we obtain a value that indicates the quantity of information that the speakers share. As the speakers read the same test, the prosodic sequences of the different speakers should have similar informational content. We use in this paper metrics based on the mutual information between sequences of labels of native and non-native speakers as a measure of the pronunciation quality.
Mutual information is defined as: Being x and y the prosodic labels of the utterances read by the speaker X and Y respectively. The higher the similarity between the sequences of labels, the higher the value of I(X; Y ).
A variant of mutual information named variation of information [16] satisfies the properties of a metric (triangle inequality, non-negative, indiscernability and symmetry): which can be normalized as: In the case of d(X, Y ), d(X, Y ) and I (X; Y ), the closer the value to zero, the more similar x and y are. Table 1 shows the d(X, Y ) distances between the different speakers of the corpus (native and non-native) with respect to the native ones. The general tendency is that distances between non-native speakers and native speakers are higher than distances between native speakers. Thus, for example, in column SP8, the distances corresponding to non-native speakers ranges from 1.94 to 2.68 whereas the distances corresponding to native speakers are between 1.871 and 1.939. This tendency is magnified when the min and max columns are analyzed: The mean value of the values of column min for non-native speakers (AM1..JP5) measures 1.93 and 1.71 for native speakers (SP1..SP8 rows). In the max column, the mean value for non-native speakers is 2.20 and it is 1.99 for native speakers. Table 2 shows the mean values and confidence intervals of the cells of the distance tables like 1 computed by using the four metrics detailed in section 2.3, applied to the six blocks of sentences detailed in section 2.1 (the whole 24 tables are not presented for the lack of space). The table compares the statistics between native and non-native speakers. As the native speakers did not do the repetitions, the values corresponding to BR1, BR2 and BR3 and the values corresponding to BLR1, BLR2 and BLR3 are the same. The four metrics show significant statistical differences between non-native and native speakers in all the blocks when the t-student test is applied with p − value << 0.001. Smaller values for native speakers (higher for I(X, Y )) indicate that the similarity between the native speakers is higher than the similarity between non-native and native speakers.

Results
Additionally, distances in table 2 show a tendency to decrease (increase in the case of I(X, Y )) when the reading activities are repeated: for example the metric D(X, Y ) is 0.592 for block BR1 and 0.588 for block BR3. Again, values are generally smaller in the reading after the listening activities: for example, the metric I (X, Y ) is 0.432 for BR1 and 0.429 for BLR1.
The mean values in table 2 exhibit that normalized versions of the metrics (D(X, Y ) and I (X, Y )) show the highest degree of consistency so that µ(BR1) > µ(BR2) > µ(BR3); µ(BLR1) > µ(BLR2) > µ(BLR3) and µ(BRi) > µ(BLRi); for i = 1, 2, 3. Table 3 presents the correlation between the subjective scores assigned to the speakers by human evaluators and the objective distance between the evaluated speakers and the reference native ones. We select the scores assigned to prosodic related variables (Fluency, Accent, Rhythm) and the overall evaluation score named DELE. The correlation ranges from 0.39 to 0.53 in all the cases. This correlation increases when the min and max rows are analyzed. In this case, min and max indicate, respectively, the correlation between the subjective score and the minimal or maximum distance to any reference native speaker. The intervals range from 0.62 to 0.66 for min but it is 0.79 for max. Table 4: Automatic prosodic labels obtained from the different speakers' utterances of the sentence "La coalicin interpuso esta querella por prevaricacin el viernes pasado" (The coalition interposes this complaint for prevarication last friday).

Discussion
The results show that the use of mutual information as a distance measure between speakers, as found in [9], is not the best option in this scenario. On the contrary, it is necessary to consider joint entropy and/or normalize results to increase the reliability of the results.
The four metrics that have been proved in this study are useful to show the separation between the two groups of speakers (native and non-native), and the normalized metrics properly cover the improvements after repetitions.
The results highlight the risks of using a single speaker as a reference speaker when assessing the quality of non-native speaker prosody. Such result was expected, since it is well known that a same sentence can be pronounced with different intonations by different speakers being all these valid pronunciations.
To take into account the prosodic variety and the diversity of possible locutions, it may be advantageous to take into account the whole set of reference speakers instead of a single golden speaker. In this paper, we have shown that using the closest or the most distant speaker as the selection criterium is effective. However, other agglutination scores will be tested in future work.
The example in table 5 illustrates why the measures based on mutual information work. The sequences of native speakers (SP1 to SP8) have more similarities between them (and thus more mutual information) than with the non-native speaker's sequence (JP1). The most revealing differences concern to the presence/absence of pitch accent and the location of boundary tones. The monosyllabic functional words "el" (the) and "por" (for) have been accented by the non-native speaker with a high tone H*, whereas none native speaker has placed an accent on these words. With respect to boundary tones, at the beginning of the sentence, the non-native speaker clearly shows a preference for short prosodic groups "la coalición / interpuso" (the coalition / interposed) and makes a prosodic mistake with the insertion of a boundary tone after the functional word "esta" (this). This violates the good formation of prosodic groups in Spanish, since "esta" operates as a clitic word. Contrary to this phrasing, the native speakers coincide to segment the sentence after "querella" (complaint).
As far as pitch accents are concerned, the inventory has been reduced in the non-native pronunciations, since neither the default value in Spanish prosody L+>H*, a rising accent with a peak displacement, nor L+!H*, a downstepped rising accent without peak displacement, appear in the data. On the contrary, the final rising accent (LH%) which is less used by native speakers, frequently appear in the non-native pronunciation. This can be a case of prosodic transference. We are currently working on the use of these evidences to identify prosodic mistakes in order to obtain a diagnosis of the specific problems of each speaker that allows us to give indications for further improvement.

Conclusions and future work
In this work we have presented the use of a set of metrics based on joint entropy for computing distances between sequences of prosodic labels. These metrics have shown to be efficient to discriminate native from non-native utterances. It has also been shown that the metrics correlate with the subjective scores of quality and that the computed distances are consistent with respect to the expected results after the repetition exercises.
In future work, we will examine the combination of the metrics with other possible complementary metrics that permit to increase the results for automatic assessment of the pronunciation quality. We will also work on the development of a module for the diagnosis of the pronuntiation deficits that benefits from the expressiveness of the ToBI labels as a standard for representing the relationship between prosodic form and function.