Overview of 88milSMS-v1 corpus from CoMeRe repository

Panckhurst R., Détrie C., Lopez C., Moïse C., Roche M., Verine B. (2014-2016). 88milSMS. A corpus of authentic text messages in French (nouvelle version du corpus ISLRN : 024-713-187-947-8). In Chanier T. (ed) Banque de corpus CoMeRe. Ortolang : Nancy. [cmr-88milsms-tei-v1 ; https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1]

Overview of the corpus

Rationale for this corpus

Description du projet de 2014 : Une équipe pluridisciplinaire de linguistes et d’informaticiens a recueilli plus de 88 000 SMS authentiques en français à Montpellier, en 2011. Cette collecte a été effectuée dans le cadre du projet sud4science LR (, Sud4science Languedoc Roussillon. Mutation des pratiques scripturales en communication électronique médiée (financement principal : MSH-M)), lui-même faisant partie du projet international sms4science , coordonné par le CENTAL à l’Université catholique de Louvain (UCL) en Belgique. Lors du recueil des SMS, un questionnaire sociolinguistique a également été proposé aux participants. Les SMS du projet sud4science LR ont été ensuite anonymisés de manière semi-automatique (en collaboration avec des étudiants stagiaires et un juriste-CIL, Nicolas Hvoinsky, (DAJI, Université Paul-Valéry), puis partiellement transcodés (en français standardisé) et annotés (cf. Panckhurst et al. 2013, Panckhurst 2013, Panckhurst et Moïse 2014). La MSH-M (Maison des Sciences de l’Homme de Montpellier), la DGLFLF (Délégation générale à la langue française et aux langues de France) et le CNRS (PEPS ECOMESS, HuMaIn) ont soutenu ce travail. Les chercheurs remercient chaleureusement le correspondant informatique et libertés, CIL, Nicolas Hvoinsky de les avoir accompagnés et conseillés sur le plan juridique, ainsi que sa directrice, Stéphanie Delaunay (DAJI, Université Paul-Valéry Montpellier), tout au long de leur projet ; Cédrick Fairon, Louise-Amélie Cougnon, Hubert Naets, Cental, Université Catholique de Louvain, qui ont accompagné l'équipe dans le cadre du projet international SMS4science. Ils remercient vivement leurs étudiants stagiaires : Anthony Stifani (étudiant en Master Information et Communication à l’Université Paul-Valéry Montpellier 3), qui a manuellement analysé une partie des SMS, permettant ainsi d’évaluer le système d’anonymisation ; Pierre Accorsi et Namrata Patel (étudiants en Master d’Informatique à l’Université de Montpellier), qui ont développé le système informatisé Seek&Hide, permettant d’anonymiser le corpus ; Frédéric André, Yosra Ghliss, Camille Lagarde-Belleville et Michel Otell (étudiants en Master de Sciences du Langage à l’Université Paul-Valéry Montpellier 3) qui ont procédé à l’anonymisation manuelle en ligne à l’aide de Seek&Hide et à la vérification de l’anonymisation automatique du corpus ; Reda Bestandji, Ahmed Loudah, Aghiles Lounes, Zakaria Mokrani, Takfarinas Sider, Tarik Zaknoun (Master I Informatique, Spécialité : « Informatique pour les sciences », Université Montpellier) qui ont travaillé sur un système de transcodage automatique.

En 2016, le corpus initial a été mis au format TEI dans le cadre du projet CoMeRe (Communication médiée par les réseaux). La structure TEI utilisée est une extension de TEI pour les genres de CMC. Cette extension est développée par un projet européen dont les participants sont : Michael Beißwenger (DE), Thierry Chanier (FR), Isabella Chiari (IT), Maria Ermakova (DE), Maarten van Gompel (NL), Iris Hendrickx (NL), Axel Herold (DE), Henk van den Heuvel (NL), Lothar Lemnitzer (DE), Angelika Storrer (DE).

Description of 2014 project: A pluridisciplinary team of linguists and computer scientists collected more than 88,000 French authentic text messages in Montpellier (2011), as part of the sud4science LR project (, Sud4science Languedoc Roussillon. Mutation des pratiques scripturales en communication électronique médiée (main financial support: MSH-M)). This project is part of a vast international project entitled sms4science , coordinated by the CENTAL at Université catholique de Louvain (UCL) in Belgium. Participants from the general public, who donated their SMS to science, were also able to fill in a sociolinguistic questionnaire. The text messages from the sud4science LR project were then semi-automatically anonymised (in collaboration with student internships and a legal adviser-CIL, Nicolas Hvoinsky, DAJI, Université Paul-Valéry), before being partially transcoded (into standardised French) and annotated (cf. Panckhurst et al. 2013, Panckhurst 2013, Panckhurst et Moïse 2014). The MSH-M (Maison des Sciences de l’Homme de Montpellier), DGLFLF (Délégation générale à la langue française et aux langues de France) and the CNRS (PEPS ECOMESS, HuMaIn) provided the main financial support for the project. The researchers thank: Nicolas Hvoinsky (CIL), and his director, Stéphanie Delaunay (DAJI, Université Paul-Valéry Montpellier 3), who accompanied and legally advised the team throughout the project; Cédrick Fairon, Louise-Amélie Cougnon, Hubert Naets, Cental, Université Catholique de Louvain, who accompanied the team within the framework of the international SMS4science project; Student work and internships: Anthony Stifani (Master’s student in Information and Communication, Université Paul-Valéry Montpellier 3), who manually analysed many of the text messages, thus allowing evaluation of the anonymisation system; Pierre Accorsi and Namrata Patel (Master’s students in Computer Science at the Université de Montpellier), who developed the ‘Seek&Hide’ software, used to anonymise the corpus; Michel Otell, Camille Lagarde-Belleville, Frédéric André and Yosra Ghliss (Master’s students in Language Sciences, Université Paul-Valéry Montpellier 3) who performed the online manual anonymisation with ‘Seek&Hide’ and verified the automatic anonymisation of the corpus; Aghiles Lounes, Tarik Zaknoun, Zakaria Mokrani, Reda Bestandji, Takfarinas Sider, Ahmed Loudah (Master’s students in Computer Science, Université de Montpellier) who worked on an automatic transcoding system.

The initial corpus has been converted to TEI standard in the project CoMeRe (Communication Médiée par les Réseaux) . The TEI structure used is an extension of TEI for CMC genres. This extension is developped by a European project whose participants are : Michael Beißwenger (DE), Thierry Chanier (FR), Isabella Chiari (IT), Maria Ermakova (DE), Maarten van Gompel (NL), Iris Hendrickx (NL), Axel Herold (DE), Henk van den Heuvel (NL), Lothar Lemnitzer (DE), Angelika Storrer (DE).

Description of the Interaction Space

CMC Environment

sms: Definition of the modality SMS. Type of messages used in SMS.

Structure of interactions

post: one post corresponds to one SMS.
- xml:idID of the posting.
- when-iso is the date of message collected by the system, i.e. the date the participant sent it to the system. It may not correspond to the date the message has been sent to its adressee. Accordingly, one participant may have sent her messages to her correspondants at different times, but may have assembled her messages and sent them together to the system/server.
- who is the anonymised telephone number. Hence one ID identifies one participant over the whole corpus. If messages sent by the same participant (sender) may be studied, it should be noted that we have no information about the receiver.
- type type of message cf. taxononomy.
distinct: This element has been used to tag emoticons and emojis, with a specific feature structure for the description of the emojis (see

Data Collection

Data collected : From 2011-09-15 to 2011-12-15
location: SMS collected at the Université Paul-Valéry Montpellier 3. Montpellier, France 7009369

Language of the data: français

Types of interaction

channel: mode: w, Short Message Service
constitution: type: single, Participants, mainly living in the area of Montpellier (see partDesc for detailed locations), agreed to give their SMS to .
derivation: type: original,
domain: domain of a message: business or domestic
factuality: type: fact,
interaction: type: complete, active: single,
preparedness: type: spontaneous,
purpose: open, i.e. several possible purposes

Extracts of Participants

Information on participants has been extracted from the file cmr-88milsms-participants_questionnaire_reponses.ods". Data have been extracted and recomputed in 2016 during the TEI-CMC process, see cmr-88milsms-participants_questionnaire_reponses-v2.ods.occupation only indicates whether a participant was a student or pupil or had another occupation when SMS were collected. tag of langknown coded following Language coded following ISO 639-3. n of education corresponds to the codes for educational attainment related to ISCED - International Standard Classification of Education, 2011, Unesco. The contents of educationrefers to the French educational system. In order to obtain the n we have recomputed data from the previous ODS file, which did not explicitly mention whether people had attained an educational level or were still studying at this level. For information on phone usage see fsdDecl

Person ID: cmr-88milsms-p123
sex: male,
age: value: 24, ,
residence: France 34 ,
occupation: student_pupil,
langKnowledge: First language / mother tongue ,
education: n: 6, bac_5,
fs: phone_type( "keyboard" ), years_usage( "5+" ), week_posts( "20+" ), t9_usage( "no" ),
Person ID: cmr-88milsms-p271
sex: male,
age: value: 12, ,
residence: France 34 ,
occupation: non_student,
langKnowledge: First language / mother tongue ,
education: n: 2, college,
fs: phone_type( "smartphone_keyboard" ), years_usage( "1+" ), week_posts( "100+" ), t9_usage( "no" ),
Person ID: cmr-88milsms-p332
sex: female,
age: value: 20, ,
residence: France 34 ,
occupation: student_pupil,
langKnowledge: First language / mother tongue ,
education: n: 5, bac_3,
fs: phone_type( "smartphone_touchpad" ), years_usage( "5+" ), week_posts( "50+" ), t9_usage( "yes" ),
Person ID: cmr-88milsms-p452
sex: female,
age: value: 18, ,
residence: France 34 ,
occupation: non_student,
langKnowledge: First language / mother tongue ,
education: n: 3, bac_2,
fs: phone_type( "smartphone_touchpad" ), years_usage( "5+" ), week_posts( "20+" ), t9_usage( "yes" ),

Extracts of Interactions

POST: xml:id: cmr-88milsms-a1504 | when-iso: 2011-09-18T20:02:40 | who: #cmr-88milsms-p228 | type: sms
p: Fouuuuuuu bisous doux a tte BB
POST: xml:id: cmr-88milsms-a1505 | when-iso: 2011-09-18T20:05:24 | who: #cmr-88milsms-p185 | type: sms
p: Zetes ou? Nou au ciné emo( ^^ )
POST: xml:id: cmr-88milsms-a1506 | when-iso: 2011-09-18T20:17:49 | who: #cmr-88milsms-p241 | type: sms
p: Dsl je viens de voir ton SMS. Mais te prend pas la tête! C'est pr les formalités! Fais simple
POST: xml:id: cmr-88milsms-a1507 | when-iso: 2011-09-18T20:20:51 | who: #cmr-88milsms-p477 | type: sms
p: Bon Courage Pepette pour demain et toute cette semaine! Bisoux
POST: xml:id: cmr-88milsms-a1508 | when-iso: 2011-09-18T20:26:46 | who: #cmr-88milsms-p136 | type: sms
p: Oh emo( :( ) ca va le moral quand meme ?
POST: xml:id: cmr-88milsms-a1509 | when-iso: 2011-09-18T20:28:14 | who: #cmr-88milsms-p228 | type: sms
p: ça y est finis !
POST: xml:id: cmr-88milsms-a1510 | when-iso: 2011-09-18T20:28:18 | who: #cmr-88milsms-p228 | type: sms
p: Ct bon emo( ;) )
POST: xml:id: cmr-88milsms-a1511 | when-iso: 2011-09-18T20:28:26 | who: #cmr-88milsms-p228 | type: sms
p: KEs tu fais BB ?
POST: xml:id: cmr-88milsms-a1512 | when-iso: 2011-09-18T20:36:32 | who: #cmr-88milsms-p228 | type: sms
p: Je vais lever la table
POST: xml:id: cmr-88milsms-a1513 | when-iso: 2011-09-18T20:36:58 | who: #cmr-88milsms-p228 | type: sms
p: Et après finir de bosser les cours !!! Mais aimerais être avc toi
POST: xml:id: cmr-88milsms-a1514 | when-iso: 2011-09-18T20:37:15 | who: #cmr-88milsms-p228 | type: sms
p: Demain on passe âprem ensemble mm si je dois bosser on sera ensemble
POST: xml:id: cmr-88milsms-a1515 | when-iso: 2011-09-18T20:37:27 | who: #cmr-88milsms-p401 | type: sms
p: : ( c'est nul
POST: xml:id: cmr-88milsms-a1516 | when-iso: 2011-09-18T20:37:33 | who: #cmr-88milsms-p228 | type: sms
p: Attention a ce ke tu dis matcho emo( ;) )
POST: xml:id: cmr-88milsms-a1517 | when-iso: 2011-09-18T20:38:40 | who: #cmr-88milsms-p401 | type: sms
p: A non en fait ça va
POST: xml:id: cmr-88milsms-a1518 | when-iso: 2011-09-18T20:40:48 | who: #cmr-88milsms-p477 | type: sms
p: Non je l'ai sur moi. Verifie avec les petits cables si ils n'y vont pas dans ton telephone. Le cable pour brancher la manette de la PS3 ( je crois que ca marche ). Bisoux
POST: xml:id: cmr-88milsms-a1519 | when-iso: 2011-09-18T20:41:28 | who: #cmr-88milsms-p341 | type: sms
p: Je vs souhaite un tres bon petit sejour les amoureux. Profitez bien. Pleins de gros bisous
POST: xml:id: cmr-88milsms-a1520 | when-iso: 2011-09-18T20:42:14 | who: #cmr-88milsms-p452 | type: sms
p: Tu me donneras ton secret pour [_forename_] , tu peux me le dire à moi avec quoi tu la shootes emo( face with stuck-out tongue and winking eye , U+1F61C ) emo( face with stuck-out tongue and tightly-closed eyes , U+1F61D ) emo( smiling face with open mouth and smiling eyes , U+1F604 ) !!! Bon dimanche à vous aussi, plein de bisous. emo( face throwing a kiss , U+1F618 )
POST: xml:id: cmr-88milsms-a1521 | when-iso: 2011-09-18T20:42:25 | who: #cmr-88milsms-p452 | type: sms
p: Tous rentrés !! Passez une très bonne soirée et plus si affinités emo( relieved face , U+1F60C ) !!! Plein de bisous emo( face throwing a kiss , U+1F618 )
POST: xml:id: cmr-88milsms-a1522 | when-iso: 2011-09-18T20:42:52 | who: #cmr-88milsms-p452 | type: sms
p: Avec plaisir ! Merci, bonne soirée à toi aussi. Bisous
POST: xml:id: cmr-88milsms-a1523 | when-iso: 2011-09-18T20:44:19 | who: #cmr-88milsms-p189 | type: sms
p: Slt tu es dispo demain matin pour un squash ou début d'aprem ? [_forename_]
POST: xml:id: cmr-88milsms-a1524 | when-iso: 2011-09-18T20:44:36 | who: #cmr-88milsms-p189 | type: sms
p: 10h10 ca te va?
POST: xml:id: cmr-88milsms-a1525 | when-iso: 2011-09-18T20:44:47 | who: #cmr-88milsms-p189 | type: sms
p: Ok je réserve a demain.
POST: xml:id: cmr-88milsms-a1526 | when-iso: 2011-09-18T20:47:43 | who: #cmr-88milsms-p477 | type: sms
p: Ah bon... Bah j'sais pas alors... Oui oui. Bisoux
POST: xml:id: cmr-88milsms-a1527 | when-iso: 2011-09-18T20:49:58 | who: #cmr-88milsms-p228 | type: sms
p: J'en envie de venir !!!!

POST: xml:id: cmr-88milsms-a1530 | when-iso: 2011-09-18T20:54:14 | who: #cmr-88milsms-p228 | type: sms
p: Nn pas du tt
POST: xml:id: cmr-88milsms-a1531 | when-iso: 2011-09-18T20:58:00 | who: #cmr-88milsms-p156 | type: sms
p: Je t'envoie un message quand je sors de la maison demain ! Ca sera pour 10h je pense
POST: xml:id: cmr-88milsms-a1533 | when-iso: 2011-09-18T20:59:49 | who: #cmr-88milsms-p477 | type: sms
p: Coucou toi! Comment tu vas? Alors cette rentree en cours, ca s'passe bien? Le boulot... Ca v mieux ou ton patron est toujours un tortionnaire?! Faudrait qu'on essais de se voir un de ces 4! Bisoux!
POST: xml:id: cmr-88milsms-a1534 | when-iso: 2011-09-18T21:01:00 | who: #cmr-88milsms-p341 | type: sms
p: Kikou. Oui ca va impec. Je t'ai appele pour prendre de tes nouvelles. J'espere que tu vas bien. Gros bisous
POST: xml:id: cmr-88milsms-a1535 | when-iso: 2011-09-18T21:13:57 | who: #cmr-88milsms-p136 | type: sms
p: Dans quel secteur le boulot ?
POST: xml:id: cmr-88milsms-a1536 | when-iso: 2011-09-18T21:14:10 | who: #cmr-88milsms-p136 | type: sms
p: emo( :( ) emo( <3 ) ! J'suis désolée [_nickname_] . Je langis de te voir !
POST: xml:id: cmr-88milsms-a1537 | when-iso: 2011-09-18T21:16:25 | who: #cmr-88milsms-p136 | type: sms
p: Ah ouais vraiment pas évident a trouver dans ce secteur !
POST: xml:id: cmr-88milsms-a1538 | when-iso: 2011-09-18T21:16:36 | who: #cmr-88milsms-p136 | type: sms
p: T'es toute seule la ?
POST: xml:id: cmr-88milsms-a1539 | when-iso: 2011-09-18T21:20:23 | who: #cmr-88milsms-p136 | type: sms
p: Oh [_nickname_] emo( :( ) je suis désolée j'aimerais etre la ! Il t'a ecrit quoi ? Ca arrive tout le temps les coups de blues emo( :( ) c'est nul d'etre loin de toi
POST: xml:id: cmr-88milsms-a1540 | when-iso: 2011-09-18T21:21:45 | who: #cmr-88milsms-p136 | type: sms
p: Y'a une formation dans le coin ? moi je te laisse aussi, je suis avec [_forename_] on regarde une serie ! A tout bientot jespere ma poulette ! Love
POST: xml:id: cmr-88milsms-a1541 | when-iso: 2011-09-18T21:30:18 | who: #cmr-88milsms-p123 | type: sms
p: Salut emo( ;-) ) Oui, au final j'ai regardé un film, le castor. C'etait pas mal. Pour l'asso, en fait ils ont pas trop critiqué, mais on etait presque les 2 seuls a manquer avec [_forename_] ... [_forename_] aurait dit qu'il pouvait pas trop voir les sejours car il etait tt le temps au meme endroit. Mais c'est n'importe quoi. Et ton we se passe bien a part ca ?
POST: xml:id: cmr-88milsms-a1542 | when-iso: 2011-09-18T21:30:37 | who: #cmr-88milsms-p123 | type: sms
p: Et oui ! Bonne soirée emo( :-) )
POST: xml:id: cmr-88milsms-a1543 | when-iso: 2011-09-18T21:37:18 | who: #cmr-88milsms-p332 | type: sms
p: Coucou, il me semble qu'aujourd'hui c'est ton anniversaire (sauf si c'était le 16 mais dans mon cerveau c'est aujourd'hui) alors pour la peine je te souhaite un très très très joyeux anniversaire!!! Et si je suis en retard j'espère que mes excuses seront accepté comme il se doit !!!!! bisoux!!! Et très joyeux anniversaire emo( ;) )
POST: xml:id: cmr-88milsms-a1544 | when-iso: 2011-09-18T21:39:33 | who: #cmr-88milsms-p136 | type: sms
p: Abusé le gars ... Genre comme si il pouvait pas etre respectueux et accepter le fait que tu ne veuilles pas le voir ... Heureusement que [_forename_] etait la deja
POST: xml:id: cmr-88milsms-a1545 | when-iso: 2011-09-18T21:39:40 | who: #cmr-88milsms-p136 | type: sms
p: Donnez vous rendez vous a la terrasse d'un café sinon ? En terrain neutre ! Je sais que c'est pas evident emo( :( )
POST: xml:id: cmr-88milsms-a1546 | when-iso: 2011-09-18T21:44:52 | who: #cmr-88milsms-p323 | type: sms
p: On vient juste d'arriver, on va manger un petit bout de fouace et au dodo. Merci pr la poche, la boite pr les pates est trop mimi! En tout cas on est content d'etre venus sr Rodez, ct vrmt chouette. Gros bisous, bonne nuit!
POST: xml:id: cmr-88milsms-a1547 | when-iso: 2011-09-18T21:46:24 | who: #cmr-88milsms-p136 | type: sms
p: Ben ca va si tu restes ferme en le voyant, mais apres c'est vraiment pas evident si il est odieux comme ca Avec toi ...
POST: xml:id: cmr-88milsms-a1548 | when-iso: 2011-09-18T21:48:05 | who: #cmr-88milsms-p287 | type: sms
p: Merci pr le lien. Maman m'a dit ton résultat, c'est pas mal vu le niveau des concurents, tu dois etre content. Tu vas bien dormir cette nuit.
POST: xml:id: cmr-88milsms-a1549 | when-iso: 2011-09-18T21:48:32 | who: #cmr-88milsms-p271 | type: sms
p: Tout se que ta l'habitude de faire dans une journee tt absolument tout
POST: xml:id: cmr-88milsms-a1550 | when-iso: 2011-09-18T21:48:50 | who: #cmr-88milsms-p271 | type: sms
p: Parle moi de toi ! Raconte ta vie sa me passionne
POST: xml:id: cmr-88milsms-a1551 | when-iso: 2011-09-18T21:49:09 | who: #cmr-88milsms-p271 | type: sms
p: Ya largo winch sur la 1
POST: xml:id: cmr-88milsms-a1552 | when-iso: 2011-09-18T21:49:22 | who: #cmr-88milsms-p271 | type: sms
p: On parle vers 21h15 j'ai pas fini de bouffer
POST: xml:id: cmr-88milsms-a1553 | when-iso: 2011-09-18T21:49:41 | who: #cmr-88milsms-p271 | type: sms
p: Nan se matin emo( ^^ ) mais la j'avais besoin de champoing et d'apres champoing sa met du temps emo( ^^ )

88milSMS. A corpus of authentic text messages in French.

How to cite this resource

Overview of the corpus

Rationale for this corpus

Description of the Interaction Space

Extracts of Interactions

Credits, Publication Statement and Rights