SPPAS Documentation

Brigitte Bigi

Version 1.7.5

Automatic Speech Segmentation and Annotation

Introduction

About this chapter

This chapter is not a description on how each automatic annotation is implemented and how it's working: the references are available for that specific purpose!

Instead, this chapter describes how each automatic annotation can be used in SPPAS, i.e. what is the goal of the annotation, what are the requirements, what kind of resources are used, and what is the expected result. Each automatic annotation is then illustrated as a workflow schema, where:

blue boxes represent the name of the automatic annotation,
red boxes represent tiers (with their name mentioned in white),
green boxes indicate the resource,
yellow boxes indicate the annotated file (given as input or produced as output).

At the end of each automatic annotation process, SPPAS produces a Procedure Outcome Report that aims to be read!

Among others, SPPAS is able to produce automatically annotations from a recorded speech sound and its orthographic transcription. Let us first introduce what is the exaclty meaning of "recorded speech" and "orthographic transcription".

File formats and tier names

When using the Graphical User Interface, the file format for input and output can be fixed in the Settings and is applied to all annotations, and file names of each annotation is already fixed and can't be changed. When using the Command-Line interface, or when using scripts, each annotation can be configured independently (file format and file names). In all cases, the name of the tiers are fixed and can't be changed!

Recorded speech

First of all:

Only wav, aiff and au audio files and only as mono are supported by SPPAS.

SPPAS verifies if the wav file is 16 bits and 16000 Hz sample rate. Otherwise it automatically converts to this configuration. For very long files, this may take time. So, the following are possible:

be patient
prepare by your own the required wav/mono/16000Hz/16bits files to be used in SPPAS

Secondly, a relatively good recording quality is expected. Providing a guideline or recommendation for that is impossible, because it depends: "IPU segmentation" requires a better quality compared to what is expected by "Alignment", and for that latter, it depends on the language.

Orthographic Transcription

Overview

Only UTF-8 encoding is supported by SPPAS.

Clearly, there are different ways to pronounce the same utterance. Different speakers have different accents and tend to speak at different rates. There are commonly two types of Speech Corpora. First is related to “Read Speech” which includes book excerpts, broadcast news, lists of words, sequences of numbers. Second is often named as “Spontaneous Speech” which includes dialogs - between two or more people (includes meetings), narratives - a person telling a story, map- tasks - one person explains a route on a map to another, appointment-tasks - two people try to find a common meeting time based on individual schedules. One of the characteristics of Spontaneous Speech is an important gap between a word’s phonological form and its phonetic realizations. Specific realization due to elision or reduction processes are frequent in spontaneous data. It also presents other types of phenomena such as non-standard elisions, substitutions or addition of phonemes which intervene in the automatic phonetization and alignment tasks.

Consequently, when a speech corpus is transcribed into a written text, the transcriber is immediately confronted with the following question: how to reflect the orality of the corpus? Transcription conventions are then designed to provide rules for writing speech corpora. These conventions establish phenomena to transcribe and also how to annotate them.

In that sense, the orthographic transcription must be a representation of what is “perceived” in the signal. Consequently, it must includes:

filled pauses;
short pauses;
repeats;
noises and laugh items (not available for: English, Japanese and Cantonese).

In speech (particularly in spontaneous speech), many phonetic variations occur. Some of these phonologically known variants are predictable and can be included in the pronunciation dictionary but many others are still unpredictable (especially invented words, regional words or words borrowed from another language).

SPPAS is the only automatic annotation software that deals with Enriched Orthographic Transcriptions.

Convention

The transcription must use the following convention:

truncated words, noted as a ’-’ at the end of the token string (an ex- example);
noises, noted by a ’*’ (not available for: English, Japanese and Cantonese);
laughs, noted by a ’@’ (not available for: English, Japanese and Cantonese);
short pauses, noted by a ’+’;
elisions, mentioned in parenthesis;
specific pronunciations, noted with brackets [example,eczap];
comments are noted inside braces or brackets without using comma {this} or [this and this];
liaisons, noted between ’=’ (an =n= example);
morphological variants with <like,lie ok>,
proper name annotation, like $John S. Doe$.

SPPAS also allows to include in the transcription:

regular punctuations,
numbers: they will be automatically converted to their written form.

The result is what we call an enriched orthographic construction, from which two derived transcriptions are generated automatically: the standard transcription (the list of orthographic tokens) and a specific transcription from which the phonetic tokens are obtained to be used by the grapheme-phoneme converter that is named faked transcription.

Example

This is + hum... an enrich(ed) transcription {loud} number 1!

The derived transcriptions are:

standard: this is hum an enriched transcription number one
faked: this is + hum an enrich transcription number one

Inter-Pausal Units (IPUs) segmentation

The "IPUs segmentation" automatic annotation can perform 3 actions:

find silence/speech segmentation of a recorded file,
find silence/speech segmentation of a recorded file including the time-alignment of utterances of a transcription file,
split/save a recorded file into multiple files, depending on segments indicated in a time-aligned file.

Silence/Speech segmentation

The IPUs Segmentation annotation performs a silence detection from a recorded file. This segmentation provides an annotated file with one tier named "IPU". The silence intervals are labelled with the "#" symbol, as speech intervals are labelled with "ipu_" followed by the IPU number.

The following parameters must be fixed:

Minimum volume value (in seconds):
If this value is set to zero, the minimum volume is automatically adjusted for each sound file. Try with it first, then if the automatic value is not correct, fix it manually. The Procedure Outcome Report indicates the value the system choose. The SndRoamer component can also be of great help: it indicates min, max and mean volume values of the sound.
Minimum silence duration (in seconds): By default, this is fixed to 0.2 sec., an appropriate value for French. This value should be at least 0.25 sec. for English.
Minimum speech duration (in seconds): By default, this value is fixed to 0.3 sec. The most relevent value depends on the speech style: for isolated sentences, probably 0.5 sec should be better, but it should be about 0.1 sec for spontaneous speech.
Speech boundary shift (in seconds): a duration which is systematically added to speech boundaries, to enlarge the speech interval.

The procedure outcome report indicates the values (volume, minimum durations) that was used by the system for each sound file given as input. It also mentions the name of the output file (the resulting file). The file format can be fixed in the Settings of SPPAS (xra, TextGrid, eaf, ...).

Procedure outcome report of IPUs Segmentation

The annotated file can be checked manually (preferably in Praat than Elan nor Anvil). If such values was not correct, then, delete the annotated file that was previously created, change the default values and re-annotate.

Result of IPUs Segmentation: silence detection

Notice that the speech segments can be transcribed using the "IPUScribe" component.

Orthographic transcription based on IPUs Segmentation

Silence/Speech segmentation time-aligned with a transcription

Inter-Pausal Units segmentation can also consist in aligning macro-units of a document with the corresponding sound.

SPPAS identifies silent pauses in the signal and attempts to align them with the inter-pausal units proposed in the transcription file, under the assumption that each such unit is separated by a silent pause. This algorithm is language-independent: it can work on any language.

In the transcription file, silent pauses must be indicated using both solutions, which can be combined:

with the symbol '#',
with newlines.

A recorded speech file must strictly correspond to a transcription, except for the extension expected as .txt for this latter. The segmentation provides an annotated file with one tier named "IPU". The silence intervals are labelled with the "#" symbol, as speech intervals are labelled with "ipu_" followed by the IPU number then the corresponding transcription.

The same parameters than those indicated in the previous section must be fixed.

Important: 
This segmentation was tested on documents no longer than one paragraph 
(about 1 minute speech).

Silence/Speech segmentation with orthographic transcription

Split into multiple files

IPU segmentation can split the sound into multiple files (one per IPU), and it creates a text file for each of the tracks. The output file names are "track_0001", "track_0002", etc.

Optionally, if the input annotated file contains a tier named exactly "Name", then the content of this tier will be used to fix output file names.

In the example above, the automatic process will create 6 files: FLIGTH.wav, FLIGHT.txt, MOVIES.wav, MOVIES.txt, SLEEP.wav and SLEEP.txt. It is up to the user to perform another IPU segmentation of these files to get another file format than txt (xra, TextGrid, ...) thanks to the previous section "Silence/Speech segmentation time-aligned with a transcription".

Tokenization

Tokenization is also known as "Text Normalization" the process of segmenting a text into tokens. In principle, any system that deals with unrestricted text need the text to be normalized. Texts contain a variety of "non-standard" token types such as digit sequences, words, acronyms and letter sequences in all capitals, mixed case words, abbreviations, roman numerals, URL's and e-mail addresses... Normalizing or rewriting such texts using ordinary words is then an important issue. The main steps of the text normalization proposed in SPPAS are:

Remove punctuation;
Lower the text;
Convert numbers to their written form;
Replace symbols by their written form, thanks to a "replacement" dictionary, located into the sub-directory "repl" in the "resources" directory. Do not hesitate to add new replacements in this dictionary.
Word segmentation based on the content of a lexicon. If the result is not corresponding to your expectations, fill free to modify the lexicon, located in the "vocab" sub-directory of the "resources" directory. The lexicon contains one word per line.

For more details, see the following reference:

Brigitte Bigi (2011). A Multilingual Text Normalization Approach. 2nd Less-Resourced Languages workshop, 5th Language Technology Conference, Poznàn (Poland).

The SPPAS Tokenization system takes as input a file including a tier with the orthographic transcription. The name of this tier must contains one of the following strings:

trs
trans
ipu
ortho
toe

The first tier that matches is used (case insensitive search).

By default, it produces a file including only one tier with the tokens. To get both transcription tiers faked and standard, check such option!

Tokens-std: the text normalization of the standard transcription,
Tokens-faked: the text normalization of the faked transcription.

Read the "Introduction" of this chapter to understand the difference between "standard" and "faked" transcriptions.

Phonetization

Phonetization, also called grapheme-phoneme conversion, is the process of representing sounds with phonetic signs.

SPPAS implements a dictionary based-solution which consists in storing a maximum of phonological knowledge in a lexicon. In this sense, this approach is language-independent. SPPAS phonetization process is the equivalent of a sequence of dictionary look-ups.

The SPPAS phonetization takes as input an orthographic transcription previously normalized (by the Tokenization automatic system or manually). The name of this tier must contains one of the following strings:

"tok" and "std"
"tok" and "faked"

The first tier that matches is used (case insensitive search).

The system produces a phonetic transcription.

Actually, some words can correspond to several entries in the dictionary with various pronunciations, all these variants are stored in the phonetization result. By convention, spaces separate words, dots separate phones and pipes separate phonetic variants of a word. For example, the transcription utterance:

Tokenization: the flight was twelve hours long
Phonetization: dh.ax|dh.ah|dh.iy f.l.ay.t w.aa.z|w.ah.z|w.ax.z|w.ao.z t.w.eh.l.v aw.er.z|aw.r.z l.ao.ng

Many of the other systems assume that all words of the speech transcription are mentioned in the pronunciation dictionary. On the contrary, SPPAS includes a language-independent algorithm which is able to phonetize unknown words of any language as long as a dictionary is available! If such case occurs during the phonetization process, a WARNING mentions it in the Procedure Outcome Report.

For details, see the following reference:

Brigitte Bigi (2013). A phonetization approach for the forced-alignment task, 3rd Less-Resourced Languages workshop, 6th Language & Technology Conference, Poznan (Poland).

Since the phonetization is only based on the use of a pronunciation dictionary, the quality of such a phonetization only depends on this resource. If a pronunciation is not as expected, it is up to the user to change it in the dictionary. All dictionaries are located in the sub-directory "dict" of the "resources" directory.

SPPAS uses the same dictionary-format as proposed in VoxForge, i.e. the HTK ASCII format. Here is a peace of the eng.dict file:

    THE             [THE]           D @
    THE(2)          [THE]           D V
    THE(3)          [THE]           D i:
    THEA            [THEA]          T i: @
    THEALL          [THEALL]        T i: l
    THEANO          [THEANO]        T i: n @U
    THEATER         [THEATER]       T i: @ 4 3:r
    THEATER'S       [THEATER'S]     T i: @ 4 3:r z

The first column indicates the word, followed by the variant number (except for the first one). The second column indicated the word between brackets. The last columns are the succession of phones, separated by a whitespace.

Alignment

Alignment, also called phonetic segmentation, is the process of aligning speech with its corresponding transcription at the phone level. The alignment problem consists in a time-matching between a given speech unit along with a phonetic representation of the unit.

SPPAS is based on the Julius Speech Recognition Engine (SRE). Speech Alignment also requires an Acoustic Model in order to align speech. An acoustic model is a file that contains statistical representations of each of the distinct sounds of one language. Each phoneme is represented by one of these statistical representations. SPPAS is working with HTK-ASCII acoustic models, trained from 16 bits, 16000 Hz wav files.

Speech segmentation was evaluated for French: in average, automatic speech segmentation is 95% of times within 40ms compared to the manual segmentation (tested on read speech and on conversational speech). Details about these results are available in the slides of the following reference:

Brigitte Bigi (2014). Automatic Speech Segmentation of French: Corpus Adaptation. 2nd Asian Pacific Corpus Linguistics Conference, p. 32, Hong Kong.

The SPPAS aligner takes as input the phonetization and optionally the tokenization. The name of the phonetization tier must contains the string "phon". The first tier that matches is used (case insensitive search).

The annotation provides one annotated file with 3 tiers:

"PhonAlign", is the segmentation at the phone level;
"PhnTokAlign" is the segmentation at the word level, with phonemes as labels;
"TokensAlign" is the segmentation at the word level.

The following options are available to configure alignment:

Expend option: If expend is checked, SPPAS will expend the last phoneme and the last token of each unit to the unit duration.
Extend option: If extend is checked, SPPAS will extend the last phoneme and the last token to the wav duration, otherwise SPPAS adds a silence.
Remove temporary files: keep only alignment result and remove intermediary files (they consists in one wav file per unit and a set of text files per unit).
Speech segmentation system can be either: julius, hvite or basic
Guess short pauses after each token

Syllabification

The syllabification of phonemes is performed with a rule-based system from time-aligned phonemes. This phoneme-to-syllable segmentation system is based on 2 main principles:

a syllable contains a vowel, and only one;
a pause is a syllable boundary.

These two principles focus the problem of the task of finding a syllabic boundary between two vowels. As in state-of-the-art systems, phonemes were grouped into classes and rules established to deal with these classes. We defined general rules followed by a small number of exceptions. Consequently, the identification of relevant classes is important for such a system.

We propose the following classes, for both French and Italian set of rules:

V - Vowels,
G - Glides,
L - Liquids,
O - Occlusives,
F - Fricatives,
N - Nasals.

The rules we propose follow usual phonological statements for most of the corpus. A configuration file indicates phonemes, classes and rules. This file can be edited and modified to adapt the syllabification.

For more details, see the following reference:

B. Bigi, C. Meunier, I. Nesterenko, R. Bertrand (2010). Automatic detection of syllable boundaries in spontaneous speech. Language Resource and Evaluation Conference, pp 3285-3292, La Valetta, Malte.

The Syllabification annotation takes as input one file with (at least) one tier containing the time-aligned phonemes. The annotation provides one annotated file with 3 tiers (Syllables, Classes and Structures).

If the syllabification is not as expected, you can change the set of rules. The configuration file is located in the sub-directory "syll" of the "resources" directory.

The syllable configuration file is a simple ASCII text file that any user can change as needed. At first, the list of phonemes and the class symbol associated with each of the phonemes are described as, for example:

PHONCLASS e V
PHONCLASS p O

The couples phoneme/class are made of 3 columns: the first column is the key-word PHONCLASS, the second column is the phoneme symbol, the third column is the class symbol.The constraints on this definition are:

a pause is mentioned with the class-symbol #,
a class-symbol is only one upper-case character, except:
- the character X if forbidden;
- the characters V and W are used for vowels.

The second part of the configuration file contains the rules. The first column is a keyword, the second column describes the classes between two vowels and the third column is the boundary location. The first column can be:

GENRULE,
EXCRULE, or
OTHRULE.

In the third column, a 0 means the boundary is just after the first vowel, 1 means the boundary is one phoneme after the first vowel, etc. Here are some examples, corresponding to the rules described in this paper for spontaneous French:

GENRULE VXV 0
GENRULE VXXV 1
EXCRULE VFLV 0
EXCRULE VOLGV 0

Finally, to adapt the rules to specific situations that the rules failed to model, we introduced some phoneme sequences and the boundary definition. Specific rules contain only phonemes or the symbol "ANY" which means any phoneme. It consists of 7 columns: the first one is the key-word OTHRULE, the 5 following columns are a phoneme sequence where the boundary should be applied to the third one by the rules, the last column is the shift to apply to this boundary. In the following example:

OTHRULE ANY ANY p s k -2

Repetitions

This automatic detection focus on word repetitions, which can be an exact repetition (named strict echo) or a repetition with variation (named non-strict echo).

SPPAS implements self-repetitions and other-repetitions detection. The system is based only on lexical criteria. The proposed algorithm is focusing on the detection of the source.

The Graphical User Interface only allows to detect self-repetitions. Use the Command-Line User Interface if you want to get other-repetitions.

For more details, see the following paper:

Brigitte Bigi, Roxane Bertrand, Mathilde Guardiola (2014). Automatic detection of other-repetition occurrences: application to French conversational speech, 9th International conference on Language Resources and Evaluation (LREC), Reykjavik (Iceland).

The automatic annotation takes as input a file with (at least) one tier containing the time-aligned tokens of the speaker (and another file/tier for other-repetitions). The annotation provides one annotated file with 2 tiers (Sources and Repetitions).

This process requires a list of stop-words, and a dictionary with lemmas (the system can process without it, but the result is better with it). Both lexicons are located in the "vocab" sub-directory of the "resources" directory.

Momel and INTSINT

Momel (modelling melody)

Momel is an algorithm for the automatic modelling of fundamental frequency (F0) curves using a technique called assymetric modal quaratic regression.

This technique makes it possible by an appropriate choice of parameters to factor an F0 curve into two components:

a macroprosodic component represented by a a quadratic spline function defined by a sequence of target points < ms, hz >.
a microprosodic component represented by the ratio of each point on the F0 curve to the corresponding point on the quadratic spline function.

The algorithm which we call Asymmetrical Modal Regression comprises the following four stages:

For details, see the following reference:

Daniel Hirst and Robert Espesser (1993). Automatic modelling of fundamental frequency using a quadratic spline function. Travaux de l’Institut de Phonétique d’Aix. vol. 15, pages 71-85.

The SPPAS implementation of Momel requires a file with the F0 values, sampled at 10 ms. Two extensions are supported:

.PitchTier, from Praat.
.hz, from any tool, is a file with one F0 value per line.

These options can be fixed:

Window length used in the "cible" method
F0 threshold: Maximum F0 value
F0 ceiling: Minimum F0 value
Maximum error: Acceptable ratio between two F0 values
Window length used in the "reduc" method
Minimal distance
Minimal frequency ratio
Eliminate glitch option: Filter f0 values before 'cible'

Encoding of F0 target points using the "INTSINT" system

INTSINT assumes that pitch patterns can be adequately described using a limited set of tonal symbols, T,M,B,H,S,L,U,D (standing for : Top, Mid, Bottom, Higher, Same, Lower, Up-stepped, Down-stepped respectively) each one of which characterises a point on the fundamental frequency curve.

The rationale behind the INTSINT system is that the F0 values of pitch targets are programmed in one of two ways : either as absolute tones T, M, B which are assumed to refer to the speaker’s overall pitch range (within the current Intonation Unit), or as relative tones H, S, L, U, D assumed to refer only to the value of the preceding target point.

A distinction is made between non-iterative H, S, L and iterative U, D relative tones since in a number of descriptions it appears that iterative raising or lowering uses a smaller F0 interval than non-iterative raising or lowering. It is further assumed that the tone S has no iterative equivalent since there would be no means of deciding where intermediate tones are located.

D.-J. Hirst (2011). The analysis by synthesis of speech melody: from data to models, Journal of Speech Sciences, vol. 1(1), pages 55-83.