SPPAS Documentation

Brigitte Bigi

Version 1.7.5

Resources for Automatic Annotations

What are SPPAS resources and where they come from?

Overview

All automatic annotations included in SPPAS are implemented with language-independent algorithms... this means that adding a new language in SPPAS only consists in adding resources related to the annotation (like lexicons, dictionaries, models, set of rules, etc).

All available resources to perform automatic annotations are located in the sub-directory 'resources'. There are 5 sub-directories:

Lexicon (list of words used during Tokenization) are located in the vocab sub-directory.
A list of replacements to perform during tokenization in the repl sub-directory.
Pronunciation dictionaries (used during Phonetization) are located in the dict sub-directory.
The acoustic models (used during Alignment) are located in the models directory.
The Syllabification configuration files are located in the syll directory.

All resources can be edited, modified, changed or deleted by any user.

Caution: all the files are in UTF-8, and this encoding must not be changed.

The language names are based on the ISO639-3 international standard. See http://www-01.sil.org/iso639-3/ for the list of all languages and codes. Here is the list of available languages in SPPAS resources:

French: fra
English: eng
Spanish: spa
Italian: ita
Catalan: cat
Japanese: jpn
Mandarin Chinese: cmn
Southern Min (or Min Nan): nan
Cantonese: yue

SPPAS can deal with a new language by simply adding the language resources to the appropriate sub-directories. Of course, file formats must corresponds to which expected by SPPAS! Lexicon and dictionaries can be edited/modified/saved with a simple-editor as Notepad++ for example (under Windows: above all, don't use the windows' notepad). Idem for the syllabification rules.

The only step in the procedure which is probably beyond the means of a linguist without external aid is the creation of a new acoustic model when it does not yet exist for the language being analysed. This only needs to carried out once for each language, though, and we provide detailed specifications of the information needed to train an acoustic model on an appropriate set of recordings and dictionaries or transcriptions. Acoustic models obtained by such a collaborative process will be made freely available to the scientific community.

The current acoustic models can be improved too (except for eng and jpn): send your data (wav and transcription files) to the author. Notice that such data will not be published in any form without your authorization. They will be included in the training procedure to create a new (and better) acoustic model, that will be distributed in the next version of SPPAS.

About the phone sets

Most of the resources are using SAMPA to represent phonemes.

In addition, all models (except jpn and yue) include the following fillers:

dummy: untranscribed speech
gb: garbage
@@: laughting

How to add a new language

Dictionary and lexicon:

1.1 Copy the phonetic dictionary LANG.dict in the dict directory

1.2 Copy the vocabulary list LANG.vocab in the vocab directory
Create a directory models/models-LANG; then copy the acoustic model in this directory
Optionally, copy the file syllConfig-LANG.txt in the syll directory.

Required Input Data formats:

The dictionary is HTK ASCII, like: word [word] phon1 phon2 phon3 Columns are separated by spaces.
Acoustic models are in HTK ASCII format (16bits,16000hz).

Notice that the Graphical User Interface dynamically creates the list of available languages by exploring the sub-directory "resources" included in the SPPAS package. This means that all changes in the "resources" directory will be automatically take into account.

French Resources

(C) Laboratoire Parole et Langage, Aix-en-Provence, France.

Pronunciation Dictionary

The French dictionary is under the terms of the "GNU Public License".

The French pronunciation dictionary was created by Brigitte Bigi by merging a some free dictionaries loaded from the web. Some word pronunciations were added using the LIA_Phon tool. Many words were manually corrected and a large set of missing words and pronunciation variants were added.

Acoustic Model

See COPYRIGHT.txt (in the "model-fra" directory) for the details of the license: "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License".

The French acoustic model was created by Brigitte Bigi from various corpora recorded at Laboratoire Parole et Langage (see the "References" section of this documentation). It includes:

CID - Corpus of Interactional Data (7h30) http://www.sldr.org/sldr000027/,
Grenelle (7min) http://www.sldr.org/sldr000729/,
Broadcast news (40min) and Read speech (not public).

References to these corpora:

P. Blache, R. Bertrand, B. Bigi, E. Bruno, E. Cela, R. Espesser, G. Ferré, M. Guardiola, D. Hirst, E.-P. Magro, J.-C. Martin, C. Meunier, M.-A. Morel, E. Murisasco, I Nesterenko, P. Nocera, B. Pallaud, L. Prévot, B. Priego-Valverde, J. Seinturier, N. Tan, M. Tellier, S. Rauzy (2010). Multimodal Annotation of Conversational Data, The Fourth Linguistic Annotation Workshop, ACL 2010, pages 186-191, Uppsala, Sueden.

B. Bigi, C. Portes, A. Steuckardt, M. Tellier (2011). Multimodal Annotations and Categorization for Political Debates, ICMI Workshop on Multimodal Corpora for Machine learning (ICMI-MMC), Alicante, Spain.

Here is the phoneset used in the acoustic model:

SPPAS	- IPA -	Examples
b	b	beau
d	d	doux
f	f	fête pharmacie
g	ɡ	gain guerre second
k	k	cabas psychologie quatre kelvin
l	l	loup
m	m	mou femme
n	n	nous bonne
p	p	passé
R	ʁ	roue rhume
s	s	sa hausse ce garçon option scie
S	ʃ	choux schème shampooing
t	t	tout thé grand-oncle
v	v	vous wagon neuf heures
z	z	hasard zéro transit
Z	ʒ	joue geai
N	ŋ	camping bingo
j	j	fief payer fille travail
w	w	oui loi moyen web whisky wagon
h	ɥ	huit Puy
a	a	patte là
a	ɑ	pâte glas
e	e	clé les chez aller pied journée
E	ɛ	crème est faite peine
E	ɛː	fête maître mètre reître reine caisse Lemaistre Lévesque
eu	ə	le reposer monsieur faisons
eu	ø	ceux jeûner queue deux
9	œ	sœur jeune neuf
i	i	si île régie y
o	o	sot hôtel haut bureau
o	ɔ	sort minimum
u	u	coup clown roue
y	y	tu sûr rue
a~	ɑ̃	sans champ vent temps Jean taon
U~	ɛ̃	vin impair pain daim plein Reims synthèse sympa
U~	œ̃	un parfum
o~	ɔ̃	son nom
fp		euh

Syllabification configuration

The syllabification configuration file corresponds to the rules defined in the following paper:

B. Bigi, C. Meunier, I. Nesterenko, R. Bertrand (2010). Automatic detection of syllable boundaries in spontaneous speech, Language Resource and Evaluation Conference, pp 3285-3292, La Valetta, Malte.

Italian resources

(C) Laboratoire Parole et Langage, Aix-en-Provence, France.

Pronunciation Dictionary

The Italian dictionary is under the terms of the "GNU Public License".

The Italian dictionary was downloaded from the Festival synthetizer tool. A part of the phonetization were manually corrected and a large set of missing words and pronunciation variants were added.

Acoustic Model

See COPYRIGHT.txt (in the "model-ita" directory) for the details of the license: "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License".

The Italian acoustic model were created during the Evalita 2011 evaluation campaign, from the CLIPS MapTask corpus (3h30).

Here is the phoneset used in the acoustic model:

SPPAS	- IPA -	Examples
b	b	banca cibo
d	d	dove idra
dz	dz	zaino zelare mezzo
dZ	dʒ	giungla magia fingere pagina
f	f	fatto fosforo
g	ɡ	gatto agro glifo ghetto
k	k	cavolo acuto anche quei
l	l	lato lievemente
L	ʎ	gli glielo maglia
m	m	mano amare campo
n	n	nano punto pensare anfibio
N	ŋ	fango unghia panchina dunque
J	ɲ	gnocco ogni
p	p	primo ampio copertura
r	r	Roma quattro morte
s	s	sano scatola presentire pasto
S	ʃ	scena sciame pesci
t	t	tranne mito
ts	ts	sozzo canzone marzo
tS	tʃ	Cennini cinque ciao
v	v	vado povero
z	z	sbavare presentare asma
j	j	ieri scoiattolo più Jesi
w	w	uovo fuoco qui
a	a	alto sarà
e	e	vero perché
E	ɛ	elica cioè
i	i	imposta colibrì zie
o	o	ombra come
o	ɔ	otto posso sarò
u	u	ultimo caucciù tuo

Syllabification configuration

The syllabification configuration file corresponds to the rules defined in the following paper:

B. Bigi, C. Petrone (2014). A generic tool for the automatic syllabification of Italian, Proceedings of the first Italian Computational Linguistics Conference, Pisa, Italy.

Spanish resources

Pronunciation Dictionary

The pronunciation dictionary was downloaded from the CMU web page. Brigitte Bigi converted this CMU phoneset to SAMPA, and changed the file format.

Acoustic Model

(C) Laboratoire Parole et Langage, Aix-en-Provence, France.

See COPYRIGHT.txt (in the "model-spa" directory) for the details of the license: "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License".

The acoustic model was trained from Glissando. We address special thanks to Juan-Maria Garrido for giving us access to this corpus: http://veus.barcelonamedia.org/glissando/node/10

GARRIDO, J. M. - ESCUDERO, D. - AGUILAR, L. -CARDEÑOSO, V. - RODERO, E. - DE-LA-MOTA, C. - GONZÁLEZ, C. - RUSTULLET, S. - LARREA, O. - LAPLAZA, Y. - VIZCAÍNO, F. - CABRERA, M. - BONAFONTE, A. (2013). Glissando: a corpus for multidisciplinary prosodic studies in Spanish and Catalan, Language Resources and Evaluation, DOI 10.1007/s10579-012-9213-0.

Here is the phoneset used in the acoustic model:

SPPAS	- IPA -	Examples
b	b	bestia embuste vaca
b	β	bebé obtuso vivir curva
d	d	dedo cuando aldaba
d	ð	dádiva arder anddmirar
f	f	fase café
g	ɡ	gato lengua gatouerra
g	ɣ	trigo amargo sigue signo
h	h	jamón eje reloj general México
k	k	caña laca quise kilo
l	l	lino alhaja principal
L	ʎ	llave pollo roughly
m	m	madre completelymer campo convertir
n	n	nido anillo anhelo sin álbum
J	ɲ	ñandú cañón enyesar
N	ŋ	cinco venga conquista
p	p	pozo topo
rr	r	rumbo carro honra subrayo amor
r	ʁ	caro bravo amor eterno
s	s	saco zapato cientos espita xenón
T	θ	xenón cereal encima zorro enzima paz
t	t	tamiz átomo
tS	tʃ	chubasco acechar
x	x	jamón eje reloj general México
z	z	isla mismo deshuesar
S	ʃ	English abacaxi Shakira
ts	ts	Ertzaintza abertzale Pátzcuaro
j	j	aliada cielo amplio ciudad
w	w	cuadro fuego
a	a	azahar
e	e	vehemente
i	i	dimitir mío
o	o	boscoso
u	u	cucurucho dúo

Catalan resources

Pronunciation Dictionary

The Catalan dictionary is under the terms of the "GNU Public License".

The catalan pronunciation dictionary was downloaded from the Ralf catalog of dictionaries for the Simon ASR system at http://spirit.blau.in/simon/import-pls-dictionary/. It was then converted (format and phoneset) by Brigitte Bigi. Some new words were also added and phonetized manually.

Acoustic Model

(C) Laboratoire Parole et Langage, Aix-en-Provence, France.

See COPYRIGHT.txt (in the "model-cat" directory) for the details of the license: "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License".

The acoustic model was not trained from data. Monophones of other models were cut and pasted to create this one.

Here is the phoneset used in the acoustic model:

SPPAS	- IPA -	Examples
@	ə	sec
E	ɛ	sec
D	ð
N	ŋ
J	ɲ
L	ʎ
S	ʃ
O	ɔ
Z	ʒ
U	ʊ
a	a	sac
b	b
b	β
b	v
d	d
d	ð
e	e	séc
f	f
g	g
g	ʝ
k	k
l	l
m	m
n	n
i	ɪ
i	i	sic
o	o	sóc
p	p
rr	r
r	ʀ
4	ɾ
s	s	si
t	t
tS	tʃ
x	x
x	ɣ
z	z
j	j
w	w
u	u	suc

English

Dictionary

The CMU Pronouncing Dictionary (also known as CMUdict) is a public domain pronouncing dictionary created by Carnegie Mellon University (CMU). It defines a mapping from English words to their North American pronunciations; it contains over 125,000 words and their transcriptions. See http://www.speech.cs.cmu.edu/cgi-bin/cmudict for details.

The Carnegie Mellon Pronouncing Dictionary, in its current and previous versions is Copyright (C) 1993-2008 by Carnegie Mellon University.
Use of this dictionary for any research or commercial purpose is completely unrestricted. If you make use of or redistribute this material, the CMU requests that you acknowledge its origin in your descriptions.

Acoustic Model

The acoustic model distributed in SPPAS resources were downloaded (in 2013) from the VoxForge project at http://www.voxforge.org/. It was then converted to SAMPA by Brigitte Bigi.

The English acoustic model is under the terms of the "GNU Public License".

Here is the phoneset used in the acoustic model:

SPPAS	- IPA -	Examples
b	b	buy cab
d	d	dye cad do
D	ð	thy breathe father
dZ	dʒ	giant badge jam
f	f	phi caff fan
g	ɡ	guy bag
h	h	high ahead
j	j	yes yacht
k	k	sky crack
l	l	lie sly gal
m	m	my smile cam
n	n	nigh snide can
N	ŋ	sang sink singer
T	θ	thigh math
p	p	pie spy cap
r	r	rye try very
s	s	sigh mass
S	ʃ	shy cash emotion
t	t	tie sty cat atom
tS	tʃ	China catch
v	v	vie have
w	w	wye swine
hw	hw	why
z	z	zoo has
Z	ʒ	equation pleasure vision beige
x	x	ugh loch Chanukah
A	ɑː	PALM father bra
A	ɒ	LOT pod John
{	æ	TRAP pad shall ban
aI	aɪ	PRICE ride file fine pie
aU	aʊ	MOUTH loud foul down how
@	ɛ	DRESS bed fell men
eI	eɪ	FACE made fail vein pay
I	ɪ	KIT lid fill bin
O:	ɔː	THOUGHT Maud dawn fall straw
OI	ɔɪ	CHOICE void foil coin boy
@U	oʊ	GOAT code foal bone go
U	ʊ	FOOT good full woman
u:	uː	GOOSE food soon chew do
V	ʌ	STRUT mud dull gun
i	i	HAPPY serious
i:	iː	FLEECE seed feel mean sea
3:r	ɜ:r	LINER foundered current
4	ɾ	Adam atom coda

Mandarin Chinese

(C) Laboratoire Parole et Langage, Aix-en-Provence, France.

Pronunciation dictionary

The pronunciation dictionary was manually created for the syllables by Zhi Na.

Acoustic model

The acoustic model was created by Brigitte Bigi from data recorded at Shanghai by Zhi Na, and another one by Hongwei Ding. We address special thanks to hers for giving us access to the corpus. These recordings are a Chinese version of the Eurom1 corpus. See the following publication for details:

Daniel Hirst, Brigitte Bigi, Hyongsil Cho, Hongwei Ding, Sophie Herment, Ting Wang (2013). Building OMProDat: an open multilingual prosodic database, Proceedings of Tools ans Resources for the Analysis of Speech Prosody, Aix-en-Provence, France, Eds B. Bigi and D. Hirst, ISBN: 978-2-7466-6443-2, pp. 11-14.

Notice that the current model was trained from a very small amount of data: this will impact on the results: Do not expect to get good performances for the automatic alignment.

More Mandarin Chinese data are welcome!

Here is the phoneset used in the acoustic model:

| SPPAS | - IPA - | Examples | |:-----:|:-------:|:-------------------| | @| | | | N | ŋ | | | S | ʃ | | | a | a | | | e | e | | | f | f | 访 | | i | i | 一诒 | | i_d | | 三次 | | i | | 三河市 | | k | k | 诰 | | k_h | | | | l | l | 论 | | m | m | | | n | n | | | o | o | 讴 | | p | p | 诐 | | p_h | | | | s | s | 诉 | | S | | 讹 | | s| | 识说 | | ss | | 许 | | t | t | 掉诋 | | t_h | | 一统一通 | | ts | ts | 诅 | | ts_h | | 䌽吹 | | ts_h | | 串吹 | | ts_hs | | 诎㐤 | | ts| | 证诊 | | tss | | 讵讲 | | u | u | 诬 | | x | x | 诨诲 | | y | y | 诩语 | | z | | |

Southern Min (or Min Nan) resources

(C) Laboratoire Parole et Langage, Aix-en-Provence, France.

We address special thanks to S-F Wang for giving us access to the corpus.

S-F Wang, J. Fon (2013). A Taiwan Southern Min spontaneous speech corpus for discourse prosody, Proceedings of Tools ans Resources for the Analysis of Speech Prosody, Aix-en-Provence, France, Eds B. Bigi and D. Hirst, ISBN: 978-2-7466-6443-2, pp. 20-23.

Cantonese resources

Dictionary

Here is the phoneset used in the dictionary:

SPPAS	- IPA -	Examples
p
p_h
m
f
t
t_h
n
l
k
k_h
N
h
k_w
k_h_w
w
ts
ts_h
s
j
a:
6
E:
e
i:
I
O:
o
u:
U
9:
y:
8
@
S
tS
tS_h

Acoustic Model

(C) DSP and Speech Technology Laboratory, Department of Electronic Engineering, the Chinese University of Hong Kong.

This is a monophone Cantonese acoustic model, based on Jyutping of the Linguistic Society of Hong Kong (LSHK). Each state is trained with 32 Gaussian mixtures. The model is trained with HTK 3.4.1. The corpus for training is CUSENT, also developed in our laboratory.

Generally speaking, you may use the model for non-commercial, academic or personal use.

See COPYRIGHT for the details of the license: "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License".

We also have other well-trained Cantonese acoustic models. If you would like to use the models and/or the CUSENT corpus for commercial applications or development, please contact Professor Tan LEE for appropriate license terms.

The character pronunciation comes from Jyutping phrase box from the Linguistic Society of Hong Kong.

"The copyright of the Jyutping phrase box belongs to the Linguistic Society of Hong Kong. We would like to thank the Jyutping Group of the Linguistic Society of Hong Kong for permission to use the electronic file in our research and/or product development."

If you use the Cantonese acoustic model for academic research, please cite:

Tan Lee, W.K. Lo, P.C. Ching, Helen Meng (2002). Spoken language resources for Cantonese speech processing, Speech Communication, Volume 36, Issues 3–4, Pages 327-342

Website: http://dsp.ee.cuhk.edu.hk
Email: tanlee@ee.cuhk.edu.hk

Polish Resources

*(C) Brigitte Bigi

Pronunciation Dictionary

The Polish dictionary is under the terms of the "GNU Public License", v3.

The Polish pronunciation dictionary was downloaded from the Ralf catalog of dictionaries for the Simon ASR system at http://spirit.blau.in/simon/import-pls-dictionary/. It was then converted (format and phoneset) and corrected by Brigitte Bigi, thanks to the help of Katarzyna Klessa.

Acoustic Model

SPPAS	- IPA -	Examples
a	a
b	b
c	c
x	ç
d	d
dz	d͡z
dZ	d͡ʒ
E	ɛ
E~	ɛ ̃
f	f
g	g
x	h
i	i
j	j
n	ɲ
k	k
l	l
m	m
n	n
N	ŋ
O	ɔ
O	ɔː
o~	ɔ ̃
p	p
Q	Q
r	r
r	ʀ
s	s
s`\| ɕ \| \| \| S \| ʃ \| \| \| t \| t \| \| \| v \| v \| \| \| w \| w \| \| \| x \| x \| \| \| y \| y \| \| \| z \| z \| \| \| ts \| t͡s \| \| \| tS \| t͡ʃ \| \| \| s`	ʂ
u	u
z`	ʑ
Z	ʒ

Portuguese Resources

*(C) Brigitte Bigi

Pronunciation Dictionary

The Portuguese dictionary is under the terms of the "GNU Public License", v3.

The Portuguese pronunciation dictionary was downloaded from the Ralf catalog of dictionaries for the Simon ASR system at http://spirit.blau.in/simon/import-pls-dictionary/. It was then converted (format and phoneset) and corrected by Brigitte Bigi.

Acoustic Model

(C) Laboratoire Parole et Langage, Aix-en-Provence, France.

See COPYRIGHT.txt (in the "model-cat" directory) for the details of the license: "Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License".

The acoustic model was not trained from data. Monophones of other models were cut and pasted to create this one.

Here is the phoneset used in the acoustic model:

SPPAS	- IPA -	Examples
a	a
a~	ɑ̃
b	b
d	d
e	e
E	ɛ
f	f
g	g
i	i
I
j	j
J	ɲ
k	k
l	l
L	ʎ
m	m
n	n
N	ŋ
o	o
O	ɔ
p	p
r	r
R	ʀ
s	s
S	ʃ
t	t
u	u
u~
U	ʊ
v	v
w	w
x	x
y	y
z	z
Z	ʒ