88milSMS. A corpus of authentic text messages in French. |
Open Resources and TOols for LANGuage |
This page: https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1
Back to corpus main page: https://hdl.handle.net/11403/comere/cmr-88milsms
Download the TEI file: https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1.xml
Panckhurst R., Détrie C., Lopez C., Moïse C., Roche M., Verine B. (2014-2016). 88milSMS. A corpus of authentic text messages in French (nouvelle version du corpus ISLRN : 024-713-187-947-8). In Chanier T. (ed) Banque de corpus CoMeRe. Ortolang : Nancy. [cmr-88milsms-tei-v1 ; https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1]
Keywords : Computer Mediated Communication; CMC; Short Message Service;
References
Composition
The whole corpus cmr-88milsms-tei-v1 includes the following elements
Download the corpus (without videos) corresponding to this topic: https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1.zip
Coverage: nbparticipants=422 ; nbmessages=88522; nbemoticons-emojis=29563
Description du projet de 2014 : Une équipe pluridisciplinaire de linguistes et d’informaticiens a recueilli plus de 88 000 SMS authentiques en français à Montpellier, en 2011. Cette collecte a été effectuée dans le cadre du projet sud4science LR (, Sud4science Languedoc Roussillon. Mutation des pratiques scripturales en communication électronique médiée (financement principal : MSH-M)), lui-même faisant partie du projet international sms4science , coordonné par le CENTAL à l’Université catholique de Louvain (UCL) en Belgique. Lors du recueil des SMS, un questionnaire sociolinguistique a également été proposé aux participants. Les SMS du projet sud4science LR ont été ensuite anonymisés de manière semi-automatique (en collaboration avec des étudiants stagiaires et un juriste-CIL, Nicolas Hvoinsky, (DAJI, Université Paul-Valéry), puis partiellement transcodés (en français standardisé) et annotés (cf. Panckhurst et al. 2013, Panckhurst 2013, Panckhurst et Moïse 2014). La MSH-M (Maison des Sciences de l’Homme de Montpellier), la DGLFLF (Délégation générale à la langue française et aux langues de France) et le CNRS (PEPS ECOMESS, HuMaIn) ont soutenu ce travail. Les chercheurs remercient chaleureusement le correspondant informatique et libertés, CIL, Nicolas Hvoinsky de les avoir accompagnés et conseillés sur le plan juridique, ainsi que sa directrice, Stéphanie Delaunay (DAJI, Université Paul-Valéry Montpellier), tout au long de leur projet ; Cédrick Fairon, Louise-Amélie Cougnon, Hubert Naets, Cental, Université Catholique de Louvain, qui ont accompagné l'équipe dans le cadre du projet international SMS4science. Ils remercient vivement leurs étudiants stagiaires : Anthony Stifani (étudiant en Master Information et Communication à l’Université Paul-Valéry Montpellier 3), qui a manuellement analysé une partie des SMS, permettant ainsi d’évaluer le système d’anonymisation ; Pierre Accorsi et Namrata Patel (étudiants en Master d’Informatique à l’Université de Montpellier), qui ont développé le système informatisé Seek&Hide, permettant d’anonymiser le corpus ; Frédéric André, Yosra Ghliss, Camille Lagarde-Belleville et Michel Otell (étudiants en Master de Sciences du Langage à l’Université Paul-Valéry Montpellier 3) qui ont procédé à l’anonymisation manuelle en ligne à l’aide de Seek&Hide et à la vérification de l’anonymisation automatique du corpus ; Reda Bestandji, Ahmed Loudah, Aghiles Lounes, Zakaria Mokrani, Takfarinas Sider, Tarik Zaknoun (Master I Informatique, Spécialité : « Informatique pour les sciences », Université Montpellier) qui ont travaillé sur un système de transcodage automatique.
En 2016, le corpus initial a été mis au format TEI dans le cadre du projet CoMeRe (Communication médiée par les réseaux). La structure TEI utilisée est une extension de TEI pour les genres de CMC. Cette extension est développée par un projet européen dont les participants sont : Michael Beißwenger (DE), Thierry Chanier (FR), Isabella Chiari (IT), Maria Ermakova (DE), Maarten van Gompel (NL), Iris Hendrickx (NL), Axel Herold (DE), Henk van den Heuvel (NL), Lothar Lemnitzer (DE), Angelika Storrer (DE).
Description of 2014 project: A pluridisciplinary team of linguists and computer scientists collected more than 88,000 French authentic text messages in Montpellier (2011), as part of the sud4science LR project (, Sud4science Languedoc Roussillon. Mutation des pratiques scripturales en communication électronique médiée (main financial support: MSH-M)). This project is part of a vast international project entitled sms4science , coordinated by the CENTAL at Université catholique de Louvain (UCL) in Belgium. Participants from the general public, who donated their SMS to science, were also able to fill in a sociolinguistic questionnaire. The text messages from the sud4science LR project were then semi-automatically anonymised (in collaboration with student internships and a legal adviser-CIL, Nicolas Hvoinsky, DAJI, Université Paul-Valéry), before being partially transcoded (into standardised French) and annotated (cf. Panckhurst et al. 2013, Panckhurst 2013, Panckhurst et Moïse 2014). The MSH-M (Maison des Sciences de l’Homme de Montpellier), DGLFLF (Délégation générale à la langue française et aux langues de France) and the CNRS (PEPS ECOMESS, HuMaIn) provided the main financial support for the project. The researchers thank: Nicolas Hvoinsky (CIL), and his director, Stéphanie Delaunay (DAJI, Université Paul-Valéry Montpellier 3), who accompanied and legally advised the team throughout the project; Cédrick Fairon, Louise-Amélie Cougnon, Hubert Naets, Cental, Université Catholique de Louvain, who accompanied the team within the framework of the international SMS4science project; Student work and internships: Anthony Stifani (Master’s student in Information and Communication, Université Paul-Valéry Montpellier 3), who manually analysed many of the text messages, thus allowing evaluation of the anonymisation system; Pierre Accorsi and Namrata Patel (Master’s students in Computer Science at the Université de Montpellier), who developed the ‘Seek&Hide’ software, used to anonymise the corpus; Michel Otell, Camille Lagarde-Belleville, Frédéric André and Yosra Ghliss (Master’s students in Language Sciences, Université Paul-Valéry Montpellier 3) who performed the online manual anonymisation with ‘Seek&Hide’ and verified the automatic anonymisation of the corpus; Aghiles Lounes, Tarik Zaknoun, Zakaria Mokrani, Reda Bestandji, Takfarinas Sider, Ahmed Loudah (Master’s students in Computer Science, Université de Montpellier) who worked on an automatic transcoding system.
The initial corpus has been converted to TEI standard in the project CoMeRe (Communication Médiée par les Réseaux) . The TEI structure used is an extension of TEI for CMC genres. This extension is developped by a European project whose participants are : Michael Beißwenger (DE), Thierry Chanier (FR), Isabella Chiari (IT), Maria Ermakova (DE), Maarten van Gompel (NL), Iris Hendrickx (NL), Axel Herold (DE), Henk van den Heuvel (NL), Lothar Lemnitzer (DE), Angelika Storrer (DE).
Editorial procedures
The current corpus cmr-88milsms-tei-v1 has been built out of the 2014 version: . The following changes has been made: 1) Changes on "88milSMS_88522_emoji-utf8.xml" in order to become fully XML compatible 2) POS tagging of emoticons (may be not exhaustive). Note: emojis had already been tagged in the source file, 2014 3) Computations on participants and integration of their descriptions in particDesc 4) Transformation of the new XML file obtained in 1) into TEI-CMC 5) Description of the teiHeader
Contents of messages have been anonymised by the corpus compiler in 2014. Encoding of anonymisation has been standardized through all CoMeRe corpora. See fsDecl for more details and the following document cmr-88milsms-guide.pdf
post correspond to one SMS
CMC Environment
Structure of interactions
Data Collection
Data collected : From 2011-09-15 to 2011-12-15Types of interaction
Extracts of Participants
Information on participants has been extracted from the file cmr-88milsms-participants_questionnaire_reponses.ods". Data have been extracted and recomputed in 2016 during the TEI-CMC process, see cmr-88milsms-participants_questionnaire_reponses-v2.ods.occupation only indicates whether a participant was a student or pupil or had another occupation when SMS were collected. tag of langknown coded following Language coded following ISO 639-3. n of education corresponds to the codes for educational attainment related to ISCED - International Standard Classification of Education, 2011, Unesco. The contents of educationrefers to the French educational system. In order to obtain the n we have recomputed data from the previous ODS file, which did not explicitly mention whether people had attained an educational level or were still studying at this level. For information on phone usage see fsdDeclPerson ID: cmr-88milsms-p123
sex: male,
age: value: 24,
,
residence: France 34
,
occupation:
student_pupil,
langKnowledge: First
language / mother tongue ,
education: n: 6,
bac_5,
fs:
phone_type( "keyboard" ),
years_usage( "5+" ),
week_posts( "20+" ),
t9_usage( "no" ),
Person ID: cmr-88milsms-p271
sex: male,
age: value: 12,
,
residence: France 34
,
occupation:
non_student,
langKnowledge: First
language / mother tongue ,
education: n: 2,
college,
fs:
phone_type( "smartphone_keyboard" ),
years_usage( "1+" ),
week_posts( "100+" ),
t9_usage( "no" ),
Person ID: cmr-88milsms-p332
sex: female,
age: value: 20,
,
residence: France 34
,
occupation:
student_pupil,
langKnowledge: First
language / mother tongue ,
education: n: 5,
bac_3,
fs:
phone_type( "smartphone_touchpad" ),
years_usage( "5+" ),
week_posts( "50+" ),
t9_usage( "yes" ),
Person ID: cmr-88milsms-p452
sex: female,
age: value: 18,
,
residence: France 34
,
occupation:
non_student,
langKnowledge: First
language / mother tongue ,
education: n: 3,
bac_2,
fs:
phone_type( "smartphone_touchpad" ),
years_usage( "5+" ),
week_posts( "20+" ),
t9_usage( "yes" ),
Publisher(s)
Identifier(s)
uri: cmr-88milsms-tei-v1Licence
http://creativecommons.org/licenses/by/4.0/This corpus can be freely distributed and shared subject only to attribution. The way to reference / cite the corpus is given in the titleStmt
Credits