Applying a fuzzy classiﬁer to generate Sp ToBI annotation: preliminar results

One of the goals of the Glissando research project 1 is to enrich a radio news corpus [1] with Sp ToBI labels. In this paper we present the application of the automatic predictions of a fuzzy classiﬁer to speed up the labeling process. The strategy is pro- posed after completing the following steps: a) manual annotation of a part of the Glissando corpus with Sp ToBI labels and checking of the coherence of the labels; b) training of the automatic system; c) validation or correction of the automatic sys- tem’s predictions by a human expert. The automatic judgments of the classiﬁer are enriched with conﬁdence measures that are useful to represent uncertain situations concerning the label to be assigned. The main aim of the paper is to show that there ex- ists a correspondence between the uncertain situations that are identiﬁed during an inter-transcriber experiment and the uncer- tain situations that the fuzzy classiﬁer detects. Labeling time reduction encourages the use of this strategy.


Introduction
Prosodic labeling aims to enrich spoken utterances with labels that are representative of the relationship between the prosodic form and function of the constituents of the message. Although prosodic labeling systems establish clear rules and protocols, the difficulty of the task and the inherent subjectivity of the labelers' judgments introduces a high number of inconsistencies. Prosodic labeling systems assume that uncertain situations could appear and reserve special symbols for representing them (like the symbol '?' in ToBI [2] and RaP [3] or the explicit computation of the transcriber disagreement in RPT [4]). Leaving aside these well-known difficulties of the task of manual annotation, the inter-transcriber tests of consistency have identified cases where two different transcribers decide to assign different labels to the same prosodic event. [5] suggested the use of alternative tiers for capturing ambiguities. These facts suggest the existence of an area of uncertainty across the categories, due to the perceptual and acoustic similarity of some pair of labels [6]. Fuzzy sets theory [7] has been widely used to represent those situations where it is difficult to classify a given element into the different possible categories.
It must be noted that recognizing uncertainty in the task of identification of some labels is not equivalent to saying that the prosodic categories used in the ToBI framework are fuzzy categories, since they are based on the description of the intonational phonology of the language, and, as a consequence, each of them is related to a clear phonological content. However, the 1 This work has been partially supported by Ministerio de Ciencia e Innovacion, Spanish Government (Glissando projects FFI2011-29559-C02-01,02 process of annotation (either manually or with the aid of semiautomatic tools) has shown that the resulting labels can carry uncertain information when they have to be associated to the acoustic signal. In [8] we have shown how fuzzy sets can be used to represent situations where assigning a class to a given prosodic unit is difficult because of the high degree of uncertainty. The BURNC corpus [9] was used in the experiments, which is one of the most important references for studies on automatic ToBI prosodic labeling in English, and in this paper, we will present the application of the same fuzzy classifier to the subset of news of the Glissando corpus [1], which aims to be a reference for studies on Spanish prosody.
Since manual annotation is a time-consuming process and very costly in terms of human resources, efforts have to concentrate on developing tools for automatic prosodic labeling or, at least, to aid the experts to speed the process [10]. The state of the art on automatic prosodic labeling reports identification rates higher than 90% in binary decisions, such as the presence or absence of accent, boundary or break. However, when the system is faced with the classification of pitch accents, boundary tones or level of breaks, the rates dramatically decrease to about 70% (see [11] for a review of the state of the art). In [12], we showed that the reasons for these low accuracy rates are the high similarity among some pairs of classes and the imbalanced nature of the prosodic corpora. As expected, the difficulties of manual annotation are reflected in how successful automatic approaches to prosodic labeling are. This paper aims to show that the use of a fuzzy classifier considerably increases the performance when soft classification is performed, and that there exists a correspondence between the uncertain situations that are identified during the inter-transcriber experiments and the uncertain situations that the fuzzy classifier detects, a reason to consider the application of the tool as a good strategy to speed the process of manual annotation.
The strategy implies fulfilling the following steps: a) manual annotation of part of the Glissando corpus with Sp ToBI labels and checking of the coherence of the labels; b) training of the automatic system; c) correction of the automatic system's predictions by a human expert. Section 2 describes the process of manual labeling of the training corpus and the quality assessment procedure that has been applied. Section 3 presents the architecture and methods of the fuzzy classifier. The small number of editing operations in the revision process (as detailed in Section 4) evidences the quality of the fuzzy predictions. Additionally we show that the most uncertain predictions correspond to labels that are also problematic in the inter-transcriber consistency tests. We end with conclusions and the future work of this ongoing research.

The Sp ToBI manually labeled subcorpus
The Glissando news subcorpus contains recordings of eight different Spanish speakers, each of them reading more than 36 news items [1]. For our purposes, two of these speakers were chosen, taking into account differences in gender (i.e. male and female) and reading style (i.e. radio speaker and advertisement actor). The labeled corpus consists of 1100 seconds of reading of news speech recorded by two professional speakers: 12 news read by a radio professional (female voice) and 12 news read by an advertising professional (male voice). These news items include a total of 3202 words (7091 syllables) labeled with 2058 pitch accents, 1115 boundary tones and 1029 breaks. The news data-set has been annotated using the Sp ToBI labels proposed in [16,17], with the modifications advanced in [18] and some adjustements needed for the speaking style, contained in the guidelines distributed in http://veus.glicom.upf.edu/. The tonal inventory is adapted to the specific phenomena pertaining to declarative utterances of a news data set in terms of a reduction of the tonal inventory and the definition and representation of boundary tones. In particular, the tag =% is associated to those cases where the pitch keeps the previous tone value (i.e. sustained pitch), and the parentheses stands for allotonic variations of L+H* and L+>H* (that is, when the fall is not perceived in the pre-stressed syllable (L+)H* and (L+>)H* are used).
The procedure was perceptually based: the transcriber was encouraged to focus preferentially on perception: her task consisted in listening carefully to the utterance in order to (a) mark the subjective sense of disjuncture between each pair of words and before each pause (break tier) and (b) mark prominences and tonal events (tone tier).
Since the ToBI framework is phonologically-driven, various methods of estimating the consistency and stability of the labels assigned to the corpus were conducted: (i) periodical meetings to define guidelines to annotate read news; (ii) discussion and resolution of differences in transcription throughout a six-month period and (iii) validation of consistency among transcribers with an interreliability experiment.
In order to measure the confidence of the annotation, four experts labeled independently the same news (108 words) read by a professional speaker. Pair-wise comparisons and values of the kappa index support the stability of the labels among transcribers in the main categories, but they also show that there is confusion among others. The results of the inter-transcriber consistency test can be seen in the table 1. Values of the kappa index between 0.6 and 0.8 like the ones we obtained are com-monly considered as substantial agreement. These consistency rates are comparable with the ones reported in similar studies for the prosodic labeling of other corpora in different languages (see table 1). Uncertainty exists, which is the main argument that supports the use of a fuzzy classifier.

Automatic labeling with uncertainty
We face the automatic labeling of ToBI events following the multi-class classification approach. The multi-class classification problem has the goal to assign a ToBI label to a given prosodic unit that is, typically, a word or a syllable. The multiclass classification approach contrasts with binary classification where the goal is to determine whether an accent or a boundary is present or not in the given prosodic unit.
In [11] we showed that multi-class identification of ToBI labels can be efficiently done by using pairwise coupling classification. The complex multi-class classification problem is divided into several simpler problems, by means of pairwise coupling. The basic idea is that it is easy for the machine to assign a label when only two classes are considered to be possible. For example it is easy to assign the label L+H* or the label L* to a prosodic unit when the only alternatives are these two classes. However, assigning the label L+H* when the alternatives are H*, !H*, H+!H*, L+!H*, L* and L*+H is a much more challenging task. Our proposal is to combine several twoclass classifiers (one for every pair of possible labels) in order to achieve the multi-class classification because two-class problems provide higher accuracy results. Furthermore, in [12] we observed that different types of classifiers behave differently in the classification of different tones. Thus, decision trees seem to be specialized in the identification of the most populated classes while neural networks tend to balance the number of predicted labels for each of the classes. The compromise in term of the number of samples in every class is important in automatic prosodic labeling because prosodic corpora are naturally imbalanced. For example in the BURNC corpus, 77% of the words are labeled with two out of the eight possible labels [12]. In this work, complementarity between artificial neural networks (NN), decision trees (DT) and support vector machines (SVM) classifiers has been exploited to improve the final system, combining their outputs using a fusion method.
In order to combine the decision scores that result from the three classifications modules (DT, NN and SVM), we used the comprehensive fuzzy technique proposed in [8]. The fuzzy integral technique has proven useful for combining classifiers in several contexts [19,20,21,22,23,24]. We use the implemen- tation of the Sugeno fuzzy integral [25] as described in [26]. As result, each ToBI category is assigned to the words of the corpus with a degree of certainty. This degree of confidence is a numeric value in the [0,1] interval (the highest the value the more the certainty). The ↵-cut approach permits to reduce the set of candidate labels assigned to every word. The ↵-cut value is empirically assigned as explained in [8].
In order to train the classifiers, we have applied a multifold approach that divides the corpus into two sections: 90% training and 10% test. Some categories show a very low number of instances, so we decided to group them with similar types thereby creating particular classes. To do that, we display the inter-label distance into a Multidimesional Scaling (MDS) 2D plot following the perspective adopted in [27]. This MDS map is built with the confusion matrix of a decision tree classifier: the more the inter-class confusion the closer the labels in the map. This plot allows experts to make a decision regarding the different categories. The closest categories are good candidates to be collapsed into an alternative category. As result, we use the following The input of the classifier is composed of acoustic information (F0, energy and duraction features) and POS tags as detailed in [11]. More details about this system can be found in [28].

Procedure of revision of the automatic system's predictions
The fuzzy classifier has been applied to unseen samples of the Glissando news subcorpus read by different voices than those used in the manual annotation. A total subset of 18 news (6 news read by 3 different voices, 2 female and 1 male) has been annotated by means of the predictions of the fuzzy classifier and a human expert has reviewed all the tags.  Table 2: Inter-transcriber agreement per symbol is the number of times (in percentage) that two of the transcribers agree assigning the same symbol to the same prosodic unit. Unique label predictions is the number of times (in percentage) that the fuzzy classifier predicts only one symbol per prosodic unit. Figure 1 illustrates the graphical interface used to present the automatic system's predictions so as to a human expert can verify or correct them. The visual interface aligns the tags predicted by the fuzzy classifier with each prosodic event: pitch accents aligned with the stressed syllables and tone boundaries aligned with the end of the word. The classifier predicts presence or absence of break (with the tag "none"), and if a prosodic rupture exists, the type of boundary tone. On the contrary, information related to the levels of prosodic rupture (break indices) is not present. This information can be inferred from the type of boundary tone, since in our results, there is a statistically significant correlation between BI3 associated to H%, !H%, =%, L!H% and BI4 associated to L% (Pearson's chi squared test X-squared = 742.1301, df = 5, p-value < 2.2e-16). There are only marginal cases in which L% is associated with BI3 , and in which H%, !H%, =%, L!H% are associated with BI4.
Compared with conventional crisp classifiers, the main advantage of a fuzzy classifier is that it can provide more than one label per prosodic unit (to a maximum of three in our system), depending on the uncertainty of the predictions. Each tag is accompanied by a numerical value in the [0,1] interval, the higher the value the more the certainty, and tags are ordered from the highest to the lowest degree. At this point, it should be noted that the procedure, according to the fuzzy set theory, is not based on probabilities since the degree of certainty is independently assigned to each category. This is the reason why the values can sum up more than 1, that is, if three tags appear, we cannot infer that there are three complementary possibilities that sum up 1. On the contrary, having more than one tag in the output represents a difficult situation in which more than one label evidences a degree of certainty over the threshold set by the ↵-cut. Another situation that can be found is that even when only a tag is predicted, it is not necessarily accompanied by a complete confidence (marked with 1).
The task of the human expert is to evaluate the candidates: he/she checks if some of the proposed tags are the right one according to his/her perception and marks it with "+". If she/he doesn't agree with any of the tags proposed, he/she attaches "-" and writes a new option. Figure 1 illustrates different types of situations that the human expert can find in the process of reviewing the output of the fuzzy classifier and that reflects the uncertainty that implies using a phonologically-based system such as ToBI. With respect to tone boundaries, the most certain decision is the label "none" associated to the words "una" (a) and "puede" (can): only one label is predicted, with a high degree of certainty (0.78). In this case, the tag "none" (meaning absence of prosodic break and as a consequence, absence of tone boundary) proposed by the system is validated (+), since any tonal nor segmental cue signal a prosodic break.
On the other hand, at the end of the prosodic group "una persona" (a person) the system suggests two candidates, associated to a very similar degree of certainty: a high tone, H% (0.49) or a mid tone, !H% (0.43). Crucially, the transcriber has to decide if the difference of range is phonologically significant in this context: since the transcriber perceives that the tone decreases to a mid-tone from an L+H* nuclear pitch accent, the second option is selected.
As far as pitch accents is concerned, the difficulty in discriminating between rising accents with or without peak displacement is evidenced in the word "puede" (can). As observed in the pitchtrack, the F0 peak is situated in the syllable border, that is, both stressed and post-stressed syllables show a high tone, a fact that makes difficult the decision. The transcriber chooses the rising tone without peak displacement (L+H*), because it is generally accepted that the high tone should be completely placed in the post-stressed syllable [29].
It may also happen that the proposals of the fuzzy classifier are wrong. It is the case of the pitch accent corresponding to the word "saber" (know). The system proposes two tags: "none", meaning absence of pitch accent and a high tone (H*). Since the stressed syllable is tonally accented, but with a rising accent and not with a high tone, the transcriber dismisses both options and writes the correct one, which is L+!H*. The downstep relates to the immediately preceding high tone within the same prosodic group.
At this point, it should be said that the adequate tuning of the ↵-cut value (that is the number of candidates that are presented in the interface) is crucial for a correct system operation. Lowering the value of the ↵-cut yields a higher number of positive cases, understanding as positive those right cases found within the set of predicted labels (computed as the softclassification rate in [8]). For the reviewer it is important to know that the probability to have the right tag in the set of candidates is really high, but on the other hand, the more the labels the harder the selection of the correct one. In our case, in the training stage we obtained soft classification rates of 82% for pitch accents and 85% for boundary tones . These rates are clearly higher than the accuracy rates that we obtain in classic non-fuzzy classification (69.2% and 81.2% respectively). The increase in the confidence rates is expected to improve the performance of the reviewing process.
In the reviewed subset of 18 news, only one label is predicted in the majority of the cases (60% for boundary tones and 45.3% for pitch accents), and few cases have three labels (3.4% for boundary tones and 2.5% for pitch accents).

Results
The results coming from the process of revision report that in most cases (81.8% for boundary tones and 72.6% for pitch ac-cents) the labeler choices the first candidate of the fuzzy classifier as the right option. Only 9.2% of the boundary tones and 13.5% of the pitch accents labels needed to be edited, that means, to be corrected with a different label. In a preliminary test, the real time ratio of that labeling process has been observed to be 1:66 when the template of predictions is used. This ratio contrasts with the one obtained without any supporting template which was 1:80. These ratios have been obtained from the comparison of the time that a transcriber needs to manually label a news and the time that he/she needs to label a news of comparable size with the aid of the fuzzy classifier: in the first case, 3600 s were labeled in 44.79 s (including autochecking after the first annotation) whereas in the second case, 3000 s were labeled in 45.23 s (including as well an autochecking). The experiment has been done by the same transcriber in separate days, during a stretch of time without any interruption.
Another encouraging result is that we found a clear correspondence between the results obtained by the fuzzy automatic classifier and the results obtained in the inter-transcriber agreement tests. In the inter-transcriber test, the pair of labels that are most frequently confused by the human experts is the pair L+H* vs. L+!H* for pitch accents and the pair "none" (absence of a break) vs. !H% for boundaries (26.9% and 34.4% of the total disagreement respectively). These pairs also have the highest frequency of appearance when the fuzzy classifier predicts more than one label per word (21.8% and 29.9%, respectively). To sum up, Table 2 shows that in the range of certainty the percentages are also similar: those cases where the fuzzy classifier predicts an unique label and the frequency of intertranscriber agreement per symbol are similar for each tag.

Conclusions and future work
We have presented the results of an experiment in which an automatic system has been applied for the Sp ToBI labeling of the Glissando corpus. The automatic system can generate more than one candidate label per prosodic unit according to its degree of confidence. The revision of the candidate labels provides an alternative to speed up the labeling process.
The efficiency of this strategy to speed up the labeling process is supported by the fact that only a small proportion of the predicted labels is edited. Furthermore, in most cases, the reviewer selects the first label out of the set of predictions.
The use of fuzzy labels adequately resembles the uncertainty that characterizes the human prosodic labeling process in many situations. This is evidenced by the fact that most uncertain situations for the automatic classifier correspond with labels that are the most frequently confused in manual intertranscriber tests. This is part of ongoing work in which an iterative training and testing process is being applied in order to improve the predictions. The reviewed labels are reintroduced in the training stage of the classifier so that the knowledge of the system increases iteratively. We are currently investigating definitions of quality metrics that measure the goodness of this iterative approach. The improvement of the revision template interface (currently in praat) following the suggestions of the transcribers is also current and future work.
The results of the labeling process, manual, reviewed and predicted label in the different stages, are expected to be freely available for research purposes in the web page of the project http://veus.glicom.upf.edu/.