Information

The proposed methodology is designed for the human analyst (mostly researchers in Linguistics).
Therefore, we assume that the methodology is general enough to be useful for broad class of research applications.
Different analytical domains - e.g. speech and gesture - and theoretical perspectives require a rigorous organization of the annotation procedure.

Which annotations (in general)?

A very large number of dimensions have been annotated in the past on mono and multimodal corpora. To quote only a few, some frequent speech or language based annotations are speech transcript, segmentation into words, utterances, turns, or topical episodes, labeling of dialogue acts, and summaries; among video-based ones are gesture, posture, facial expression [...]. (Popescu-Belis, 2010)

Which annotations (in general)?

Which annotations (in this tutorial)?

IPUs segmentation
Speech transcript (manual)
Phonemes and words time-alignement
Syllables segmentation
Repetitions detection
Morpho-syntax
Momel and INTSINT
Gestures (manual)

The annotation workflow

The main principle is...

Garbage in, Garbage out.

Capturing and recording multimodal data

The capture of multimodal corpora requires complex settings such as instrumented lecture and meetings rooms, containing capture devices for each of the modalities that are intended to be recorded, but also, most challengingly, requiring hardware and software for digitizing and synchronizing the acquired signals. (Popescu-Belis, 2010)

Some advices:
- Audio: avoid noice; one channel per speaker; uncompressed; 16000Hz (or multiple) enough for automatic speech tools
- Video: prefer H.264 (standard); test the annotation software(s); avoid conversions (record directly into the expected format)
- Synchronization: use a regular "clap"

IPUs Segmentation: definition

also called Silence/Speech segmentation

Parameters to define manually:
- fix the minimum silence duration and speech duration
- both values depend on: the language & the speech style
As results:
- speech and silences are time-aligned and annotated automatically
- A manual verification is recommended

Example of IPUs segmentation: Silences are annotated with # and speech intervals are filled with ipu number

Orthographic Transcription

Orthographic Transcription

An orthographic transcription is the minimum requirement for a speech corpus,
- a better representation of pronunciation may be desired for most of research questions
Orthographic transcription is at the top of the annotation procedure:
- and remember: "Garbage in, Garbage out."
Orthographic transcription of spoken language presents considerable challenges.

Orthographic Transcription

Speech may be annotated for:

phonetic transcription taking into account details of pronunciation
- allows a time-alignment at the phoneme-level
- which is extended to time-alignment at word-level and syllable-level.

⇒ Enriched Orthographic Transcription

Enriched Orthographic Transcription

In speech (particularly in spontaneous speech), many kind of events can occur like breathes, laughter, ...

Enriched Orthographic Transcription

In speech (particularly in spontaneous speech), many phonetic variations occur:
- Some of these phonologically known variants are predictable
- but many others are still unpredictable (especially invented words, regional words or words borrowed from another language)
The orthographic transcription must be enriched:
- it must be a representation of what is “perceived” in the signal.

Enriched Orthographic Transcription

An EOT must include, at least:
- Filled pauses
- Short pauses
- Repeats
- Truncated words
- Noises
- Laughter
An EOT must also include:
- un-regular elisions
- specific pronunciations
An EOT may include:
- all elisions

Enriched Orthographic Transcription: convention

Any EOT must follow a convention
The EOT is the input for automatic systems... and the transcription convention depends on the tool/software.
So... you must read the documentation before starting to transcribe!

Train you first to transcribe and to use the annotation software!

SPPAS transcription convention

truncated words, noted as a ’-’ at the end of the token string (an ex- example)
noises, noted by a ’*’
laughs, noted by a ’@’
short pauses, noted by a ’+’
elisions, mentioned in parenthesis
specific pronunciations, noted with brackets [example,eczap]
comments are noted inside braces or brackets without using comma {this} or [this and this]
liaisons, noted between ’=’ (an =n= example)
morphological variants with <like,lie ok>
proper name annotation, like $John S. Doe$

Transcription example 1 (Conversational speech)

EOT:
donc + i- i(l) prend la è- recette et tout bon i(l) vé- i(l) dit bon [okay, k]
derived Standard orthograph:
donc il prend la recette et tout bon il dit bon okay
derived Faked orthograph:
donc + i i prend la è recette et tout bon i vé i dit bon k

Transcription example 2 (Conversational speech)

EOT:
ah mais justement c'était pour vous vendre bla bla bla bl(a) le mec i(l) te l'a emboucané + en plus i(l) lu(i) a [acheté,acheuté] le truc et le mec il est parti j(e) dis putain le mec i(l) voulait
Standard orthograph:
ah mais justement c'était pour vous vendre bla bla bla bla le mec il te l'a emboucané en plus il lui a acheté le truc et le mec il est parti je dis putain le mec il voulait
Faked orthograph
ah mais justement c'était pour vous vendre bla bla bla bl le mec i te l'a emboucané + en plus i lu a acheuté le truc et le mec il est parti j dis putain le mec i voulait

Transcription example 3 (GrenelleII)

EOT:
euh les apiculteurs + et notamment b- on n(e) sait pas très bien + quelle est la cause de mortalité des abeilles m(ais) enfin il y a quand même + euh peut-êt(r)e des attaques systémiques
Standard orthograph:
les apiculteurs et notamment on ne sait pas très bien quelle est la cause de mortalité des abeilles mais enfin il y a quand même peut-être des attaques systémiques
Faked orthograph:
euh les apiculteurs + et notamment b on n sait pas très bien + quelle est la cause de mortalité des abeilles m enfin il y a quand même + euh peut-ête des attaques systémiques

Orthographic Transcription... to sum up

An Enriched Orthographic Transcription is required
The EOT of a corpus must follow a transcription convention
Manual Standard orthographic transcription takes 15-20 minutes / minute of speech.
Manual Enriched orthographic transcription takes 30-45 minutes / minute of speech.

The automatic systems must be adapted to deal with EOT

Phonemes/Tokens time-alignment

Phonemes and Tokens time-alignment

A problem divided into 3 sub-tasks:

tokenization : text normalization, word segmentation

Tokenization

Tokenization is also known as "Text Normalization".

Tokenization is the process of segmenting a text into tokens.
In principle, any system that deals with unrestricted text need the text to be normalized.
Automatic text normalization is mostly dedicated to written text, in the NLP community

The main steps in SPPAS are:

Remove punctuation
Lower the text
Convert numbers to their written form
Replace symbols by their written form (like %, °, ...)
Word segmentation
- based on a lexicon.

Tokenization in SPPAS

From an EOT, SPPAS produces 2 outputs:
- standard: the text normalization of the standard transcription,
- faked: the test normalization of the faked transcription.
Example:
This is + hum... an enrich(ed) transcription {loud} number 1!

standard:

mp this is hum an enriched transcription number one

faked:

this is + hum an enrich transcription number one

(Bigi 2011)

Phonetization

Phonetization is also known as grapheme-phoneme conversion

Phonetization is the process of representing sounds with phonetic signs.
Phonetic transcription of text is an indispensable component of text-to-speech (TTS) systems and is used in acoustic modeling for automatic speech recognition (ASR) and other natural language processing applications.

Converting from written text into actual sounds, for any language, cause several problems that have their origins in the relative lack of correspondence between the spelling of the lexical items and their sound contents.

Phonetization in SPPAS

SPPAS implements: (Bigi 2013)

a dictionary based-solution
- consists in storing a maximum of phonological knowledge in a lexicon.
- the phonetization process is the equivalent of a sequence of dictionary look-ups
a language-independent algorithm to phonetize unknown words.

Convention: spaces separate words, dots separate phones and pipes separate phonetic variants

Impact of the Orthographic Transcription on automatic phonetization

In (Bigi et al. 2012), we compared 3 types of OT:
1. Standard orthographic transcription.
2. Enriched 1: Std-OT + short pauses, various noises, laughter, filled pauses, truncated words, repeats.
3. Enriched 2: Enriched 1 + elisions, particular pronunciations and unusual liaisons.
Evaluations compare a reference phonetized manually to phonetizations obtained with SPPAS
⇒ error divided by 2-3

Alignment

Alignment is also called phonetic segmentation
The alignment problem consists in a time-matching between a given speech unit along with a phonetic representation of the unit.

Manual alignment has been reported to take between 11 and 30 seconds per phoneme. (Leung and Zue, 1984)

How to perform Speech Segm. ?

HTK - Hidden Markov Model Toolkit
CMU Sphinx
Open Source Large Vocabulary CSR Engine Julius

Prosodylab-Aligner: python+HTK
P2FA: python+HTK

Web-services:
- WebMAUS
- Train&Align

SPPAS (python+Julius), available for English, French, Italian, Spanish, Catalan, Polish, Japanese, Mandarin Chinese, Taiwanese, Cantonese

Alignment results in SPPAS

In average, automatic speech segmentation of French is 95% of times within 40ms compared to the manual segmentation (SPPAS 1.5, September 2014):
- tested on read speech
- tested on conversational speech

Results on vowels of French conversational speech

Syllables segmentation

Syllabification by SPPAS

Automatic annotation
A rule-based system
Rules available for:
- French
- Italian
This phoneme-to-syllable segmentation system is based on 2 main principles:
1. a syllable contains a vowel, and only one;
2. a pause is a syllable boundary.

(Bigi et al. 2010)

Syllabification by SPPAS

Phonemes are grouped into classes, for both French and Italian:
- V - Vowels,
- G - Glides,
- L - Liquids,
- O - Occlusives,
- F - Fricatives,
- N - Nasals.
Fix rules to find the boundaries between two vowels

Repetitions detection

Repetitions

Repetition is the reproduction of something that has just been said.
- Other-repetition reproduction by another speaker of what has said a first speaker
- Self-repetition reproduction by the speaker itself of what he has said
Other-repetition has been identified as an important mechanism in face-to-face conversation through their discursive or communicative functions

(Bigi et al. 2014)

Repetitions with SPPAS

Semi-automatic annotation performed by SPPAS
SPPAS implements:
- self-repetitions,
- other-repetitions detection (CLI only).
The system is based only on lexical criteria, from the time-aligned tokens (or lemmas)
The system was used to propose a lexical characterization of other-repetitions:
⇒ various statistics was estimated on the detected other-repetitions

Morpho-syntax

It is mostly dedicated to written text, in the NLP community
A system must be adapted to deal with speech, particularly for conversational speech:
- spoken data are time-aligned and we expect to get a time-aligned morpho-syntax!
- the lexicon and the probabilities of tokens are different between written texts and speech, so they must be updated.
At LPL, Stéphane Rauzy and G. de Montcheuil are proposing MarsaTag, for French:
- http://sldr.org/sldr000841

Example of Morpho-syntax in CID

Example of time-aligned morpho-syntax on conversational speech

Momel and INTSINT

Momel (modelling melody)
- algorithm modelling raw fundamental frequency curves with a quadratic spline function
- target F₀ Points
INTSINT: an INternational Transcription System for INTonation
- based on an inventory of minimal pitch contrasts found in published descriptions of intonation patterns
- surface phonological structure
- mapping from Momel target points to INTSINT tones

INTSINT

Absolute tones: T(op) M(id) B(ottom)
Relative tones: H(igher) S(ame) L(ower)
Iterative relative tones: U(pstepped) D(ownstepped)

Example of Momel and INTSINT

Momel and INTSINT: software

Momel and INTSINT are available:
- as a Praat plugin, developped by Daniel Hirst
- in SPPAS, developped by Brigitte Bigi

(Hirst and Espesser, 1993)

Gestures

What?
- shape: up/down/sideways/complex/..., single hand/both hands
- phase: preparation/stroke/post-stroke hold/retraction/...
- function: deictic/iconic/symbolic/feedback/...
- ...

(Tellier 2014)

Use/build an annotation scheme, adapted to the particular framework of your research
Define the annotation guide, unambiguous
Validation: intercoder agreement, ...

Summary

Softwares
- Selection
- Examples : Praat, Elan, SPPAS
An annotation workflow
- Recording
- IPU & Transcription
- Phonetics: tokens, phonemes, syllables
- Morphosyntax: POS, lemma, chunk
- Discourse: repetitions
- Prosody: Momel & INTSINT
- Gestures
Exploring
Sharing
- Why ? How ?
- Data Repositories
- Metadata

An annotation workflow

Information

Which annotations (in general)?

Which annotations (in general)?

Which annotations (in this tutorial)?

The annotation workflow

The main principle is...

Record

Capturing and recording multimodal data

IPUs Segmentation

IPUs Segmentation: definition

Orthographic Transcription

Orthographic Transcription

Orthographic Transcription

Enriched Orthographic Transcription

Enriched Orthographic Transcription

Enriched Orthographic Transcription

Enriched Orthographic Transcription: convention

SPPAS transcription convention

Transcription example 1 (Conversational speech)

Transcription example 2 (Conversational speech)

Transcription example 3 (GrenelleII)

Orthographic Transcription... to sum up

Phonemes/Tokens time-alignment

Phonemes and Tokens time-alignment

Tokenization

Tokenization in SPPAS

Phonetization

Phonetization in SPPAS

Impact of the Orthographic Transcription on automatic phonetization

Alignment

How to perform Speech Segm. ?

Alignment results in SPPAS

Syllables segmentation

Syllabification by SPPAS

Syllabification by SPPAS

Repetitions detection

Repetitions

Repetitions with SPPAS

Morpho-syntax

Morpho-syntax

Example of Morpho-syntax in CID

Momel and INTSINT

Momel and INTSINT

INTSINT

Example of Momel and INTSINT

Momel and INTSINT: software

Gestures

Gestures

Summary