Grand corpus de sms, smslareunion, banque de corpus CoMeRe
This page: http://hdl.handle.net/11403/comere/cmr-smslareunion/cmr-smslareunion-tei-v1
Back to corpus: http://hdl.handle.net/11403/comere/cmr-smslareunion
How to cite this resource
Ledegen, G.(2014). Grand corpus de sms smslareunion, .In Chanier T. (ed) Banque de
corpus CoMeRe. Ortolang : Nancy.
This form has been automatically extracted from the TEI file. For the full
contents, see http://hdl.handle.net/11403/comere/cmr-smslareunion/cmr-smslareunion-tei-v1.xml.
Overview of the corpus
The first version of the corpus was established in the context of
the operation sms4science (Fairon 2006), a research program initiated in 2004 by the
CENTAL (Centre de Traitement Automatique du Langage, Catholic University of Louvain in
Belgium). Conducted in La Réunion, first, the project has brought together 21 694 SMS
messages from the period from April to June 2008, coming from 1,744 users, giving 12,622
finalized SMS messages. The initial corpus was converted into TEI within the framework
of the CoMeRe (Communication médiée par les réseaux)
project. This project aims to assemble
different network-mediated communication corpora in French (Internet,
telecommunication), to structure them in a standard format and to release the corpora in
an open access format for research purposes. The CoMeRe project has received support
from ORTOLANG and the national consortium Corpus-écrits.
Keywords : applied_linguistics
; discourse_analysis
; text_and_corpus_linguistics
; primary_text
; dialogue
; Communication Médiée par les Réseaux
; CoMeRe
; texto
; Computer Mediated Communication
; Short Message Service
Ledegen, G. (2010). Contact de langues à La Réunion : « On ne débouche pas des
cadeaux. Ben i fé qoué alors ? ». Langues et Cité, ‘Langues en contact’, n° 16, 9-10
Ledegen, G. (2011). Résonance SMS : « Jc c koi mé javé pa rèalizé sur le coup! ».
LINX, n° 57, Gadet, F. Guérin, E. (Dirs), ‘Français parlé/français hors de France/créoles
à base française d'un point de vue syntaxique’, 101-112.
Ledegen, G., M. Blondel, J. Gonac’h et J. Seeli. (2011). « Contacts de langues dans
les SMS ‘sourds’ ». Langues et cité Bulletin de l’observatoire des pratiques
linguistiques, n° 19, ‘Parler (avec) plusieurs langues : l’alternance codique’, 10.
Rationale for this corpus
This corpus is a subpart of the CoMeRe corpus databank. The CoMeRe
(Communication Médiée par les Réseaux) project aims to
build a kernel corpus assembling existing corpora of different CMC (Computer-Mediated
Communication) genres and new corpora build on data extracted from the Internet. These
heterogenous corpora will be structured and processed in a uniform way, complemented with
metadata. CoMeRe will be released as OpenData through the national infrastructure
Ortolang, following constraints which will be reused for the forthcoming “Corpus de
Référence du Français”. Project supported by the national consortium Corpus-écrits, sub-part of Huma-Num, and Ortolang (French
correspondant to DARIAH).
The TEI structure used is an extension of TEI for CMC genres. This
extension is developped by a European project which participants are : Michael Beißwenger
(DE), Thierry Chanier (FR), Isabella Chiari (IT), Maria Ermakova (DE), Maarten van Gompel
(NL), Iris Hendrickx (NL), Axel Herold (DE), Henk van den Heuvel (NL), Lothar Lemnitzer
(DE), Angelika Storrer (DE).
Description of the Interaction Space
CMC Environment
: Definition of the modality SMS. Type of messages used in SMS.
Structure of interactions
post: one post corresponds to one SMS.
- xml:idID of the posting.
whenis date of message collected by the system, it depend on the date the
participant send to the system, but not a date of the conversation. Accordingly, one
participant may have sent her messages to her correspondants at different times, but
may have assembled her messages and send them together to the server.
whois telephone number anonymized. Hence one ID identifies one
participant over the whole corpus. If messages sent by the same participant (sender)
may be studied, it should be noted that we have no information about the
typetype of message cf. taxononomy.
reg: This element appears inside the
- The attribute typetransortho indicate that it corresponds
to a normalization, from an orthographic standpoint, of the content of the
- delcorresponds to an omitted word like ne, je/tu/il],
le/lui/en/y, à/sur, que/c'est/peux, etc.
- seg
xml:lang phrase normalized either in French or French pidgin (Créole de
la Réunion)
- seg
cert phrase (labelled as "flottant") where there is some uncertainty
about which language has been used. When unknown, language cannot be
determined. When medium, the researcher guesses that it can be either
French or Pidgin (see the xml:lang).
add: This element appears inside
- typeFspecify an alternate interpretation of a previous
phrase ("flottant")which has been attributed to a given language: Pidgin if the
previous version was French, or vice-versa. This time the seg appears with
an cert and a reduced degree of certainty: low
- typetrad: translation in French of part of the SMS (not
necessary the whole contents of the SMS). There is no indication of the coverage of
the translation, i.e. whether it is a full or partial translation
Data Collection
Data collected : From 2008-04-10 to 2008-06-30
A private company collected the messages and sent them to Laboratoire
de recherche sur les espaces Créolophones et Francophones, Université de la
La Réunion, France
Language of the data:
Types of interaction
channel: mode: w
Short Message Service
constitution: The harvest of SMS requires the intervention of a technical partner, Cirrus
Informatique, which took in charge the reception of SMS and the transfer to the
derivation: type: original
domain: domain of a message : business or domestic
factuality: type: fact
interaction: type: complete
active: single
preparedness: type: spontaneous
purpose: open, i.e. several possible purposes
Participants (extract)
QuestionnaireSome participants answered to a questionnaire. The
questionnaire is detailed in this document
cmr-smslareunion-tei-v1-questionnaire.pdf. Answers to the questionnaire
are in this document cmr-smslareunion-tei-v1-answers.csv. Please note
that persons who filled the questionnaire may not have sent SMS. hence they are not
listed here as participants (e.g. cmr-slr-c001-p005). Vice versa: many participants
listed here have not filled the questionnaire.
Person ID= cmr-slr-c001-p0011
Person ID= cmr-slr-c001-p0012
Person ID= cmr-slr-c001-p0017
Person ID= cmr-slr-c001-p0021
Extracts of Interactions
- POST: xml:id: p_cmr-slr-c001-a00001
| when-iso: 2008-04-10T10:57:44
| who: #cmr-slr-c001-p0143
| type: sms
p: Parfait. Au lcf au niveau -1.
reg: type: transortho,Parfait. Au LCF au niveau -1.
- POST: xml:id: p_cmr-slr-c001-a00002
| when-iso: 2008-04-10T18:10:44
| who: #cmr-slr-c001-p1001
| type: sms
p: Jarive
reg: type: transortho,J'arrive
- POST: xml:id: p_cmr-slr-c001-a00003
| when-iso: 2008-04-10T18:22:15
| who: #cmr-slr-c001-p1002
| type: sms
p: Koman i lé?
reg: type: transortho,Koman i lé?
- POST: xml:id: p_cmr-slr-c001-a00004
| when-iso: 2008-04-10T18:58:11
| who: #cmr-slr-c001-p1003
| type: sms
p: Nous avons recu vos voeux au dessus de l'afrique. Et vous souhaitons une traversée 2008
pleine de rires et de douceurs. Bises les dang
reg: type: transortho,Nous avons recu vos voeux au dessus de l'afrique. Et vous
souhaitons une traversée 2008 pleine de rires et de douceurs. Bises les dang
- POST: xml:id: p_cmr-slr-c001-a00005
| when-iso: 2008-04-10T19:04:02
| who: #cmr-slr-c001-p1004
| type: sms
p: "Slt,j't'ai envoyé 1 mèl èk ttes les infos du salon Actufac.En ce ki concerne la
bourse,tu peux cumuler le crous et le Conseil général.Belle jrnée"
reg: type: transortho,"Salut, je t'ai envoyé un mail avec toutes les informations du
salon Actufac. En ce qui concerne la bourse, tu peux cumuler le CROUS et le Conseil
Général. Belle journée"
- POST: xml:id: p_cmr-slr-c001-a00006
| when-iso: 2008-04-10T21:29:43
| who: #cmr-slr-c001-p1005
| type: sms
p: "Ben la tu va en chier paske jte doneré plu jamé de nouvèl.va voir ailleur si ji
sui,pauvre tache.je jouis mèm pa,c nimport kosa"
reg: type: transortho,"Ben là tu vas en chier parce que je [ne] te donnerai plus jamais
des nouvelles. Va voir ailleurs si j'y suis, pauvre t?che. Je (ne] jouis même pas, c'est
n'importe quoi."
- POST: xml:id: p_cmr-slr-c001-a00007
| when-iso: 2008-04-10T21:31:44
| who: #cmr-slr-c001-p1005
| type: sms
p: "Coucou
,skuz texto tar.le resto c Biologic food (Beau Bon Bien),tu peu pa le louper,il est
just a coté de la station essence derièr quick:fo ke tu rentre kom si tu alé fèr
dlessence et c just aprè.jvé envoyé ce texto o 3699 tiens!big biz a toi,a 2min"
reg: type: transortho,"Coucou
, excuse texto tard. Le restaurant c'est Biologic Food (Beau Bon Bien), tu [ne]
peux pas le louper ; il est juste à c?té de la station essence derrière Quick : il faut
que tu rentres comme si tu allais faire de l'essence et c'est juste après. Je vais
envoyer ce texto au 3699, tiens. Grosses bises à toi, à dans 2 minutes."
- POST: xml:id: p_cmr-slr-c001-a00008
| when-iso: 2008-04-11T08:07:36
| who: #cmr-slr-c001-p1006
| type: sms
p: Bisoir c
lol __ pr te demander si ca te dit 2venir chez flo ce soir pcq c mort lé concerts
pr faire réseau warcraft?rép vite stp biz
reg: type: transortho,Bonsoir c'est
lol_pour te demander si ?a te dit de venir chez
ce soir parce que c'est mort les concerts pour faire réseau Warcraft? Réponds vite
s.t.p. bises
- POST: xml:id: p_cmr-slr-c001-a00009
| when-iso: 2008-04-11T10:14:55
| who: #cmr-slr-c001-p1007
| type: sms
p: Viens me chercher à 3 h stp
reg: type: transortho,Viens me chercher à 3h s.t.p.
Composition of the corpus
Download the whole corpus: http://hdl.handle.net/11403/comere/cmr-smslareunion/cmr-smslareunion-tei-v1.zip (ZIP file, 4.5 Mo )
nbparticipants=884 ; nbmessages=12622
principal : Ledegen Gudrun, Chanier Thierry.
compiler : Ledegen Gudrun.
editor : Chanier Thierry.
data inputter : Hriba Linda, Jin Kun, Caron Gauthier, Corré Gaëlle, Guillemain Marie-Caroline.
developer : Lotin Paul.
participant : Longhi Julien.
publisher : ORTOLANG (Outils et Ressources pour un Traitement Optimisé de la
LANGue), Nancy:France
Publication Statement and Rights
Date: 2014-05-01
uri: cmr-smslareunion-tei-v1
short-uri: cmr-slr-c001
url: http://hdl.handle.net/11403/comere/cmr-smslareunion/cmr-smslareunion-tei-v1
Rights holders of this corpus are: LCF ; Gudrun Ledegen ; Thierry
This corpus can be freely distributed and shared subject only to
attribution. The way to reference / cite the corpus is given in the