logo comere

88milSMS. A corpus of authentic text messages in French.

logo ortolang
Open Resources and TOols for LANGuage

This page: https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1
Back to corpus main page: https://hdl.handle.net/11403/comere/cmr-88milsms

Download the TEI file: https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1.xml

How to cite this resource

Panckhurst R., Détrie C., Lopez C., Moïse C., Roche M., Verine B. (2014-2016). 88milSMS. A corpus of authentic text messages in French (nouvelle version du corpus ISLRN : 024-713-187-947-8). In Chanier T. (ed) Banque de corpus CoMeRe. Ortolang : Nancy. [cmr-88milsms-tei-v1 ; https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1]

Overview of the corpus

The first version of the corpus was produced in 2014 as part of the "sud4science LR project" . More than 88,000 authentic SMS, sent by hundreds of donators living mainly in the Montpellier area, were collected, in 2011, then anonymised, by the researchers, their student interns and a legal adviser-CIL. The initial corpus was then converted to TEI standard in the project CoMeRe (Communication Médiée par les Réseaux) . This project aims to build a kernel corpus assembling existing corpora of different CMC (Computer-Mediated Communication) genres and new corpora built on data extracted from the Internet. These heterogenous corpora will be structured and processed in a uniform way, complemented with metadata. CoMeRe is released as OpenData through the national infrastructure Ortolang, following constraints which will be reused for the forthcoming “Corpus de Référence du Français”. Project supported by the national consortium Corpus-écrits, sub-part of Huma-Num , and Ortolang (French correspondant to DARIAH).

Keywords : Computer Mediated Communication; CMC; Short Message Service;

References

Composition

The whole corpus cmr-88milsms-tei-v1 includes the following elements

Download the corpus (without videos) corresponding to this topic: https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1.zip

Coverage: nbparticipants=422 ; nbmessages=88522; nbemoticons-emojis=29563


Rationale for this corpus

Description du projet de 2014 : Une équipe pluridisciplinaire de linguistes et d’informaticiens a recueilli plus de 88 000 SMS authentiques en français à Montpellier, en 2011. Cette collecte a été effectuée dans le cadre du projet sud4science LR (, Sud4science Languedoc Roussillon. Mutation des pratiques scripturales en communication électronique médiée (financement principal : MSH-M)), lui-même faisant partie du projet international sms4science , coordonné par le CENTAL à l’Université catholique de Louvain (UCL) en Belgique. Lors du recueil des SMS, un questionnaire sociolinguistique a également été proposé aux participants. Les SMS du projet sud4science LR ont été ensuite anonymisés de manière semi-automatique (en collaboration avec des étudiants stagiaires et un juriste-CIL, Nicolas Hvoinsky, (DAJI, Université Paul-Valéry), puis partiellement transcodés (en français standardisé) et annotés (cf. Panckhurst et al. 2013, Panckhurst 2013, Panckhurst et Moïse 2014). La MSH-M (Maison des Sciences de l’Homme de Montpellier), la DGLFLF (Délégation générale à la langue française et aux langues de France) et le CNRS (PEPS ECOMESS, HuMaIn) ont soutenu ce travail. Les chercheurs remercient chaleureusement le correspondant informatique et libertés, CIL, Nicolas Hvoinsky de les avoir accompagnés et conseillés sur le plan juridique, ainsi que sa directrice, Stéphanie Delaunay (DAJI, Université Paul-Valéry Montpellier), tout au long de leur projet ; Cédrick Fairon, Louise-Amélie Cougnon, Hubert Naets, Cental, Université Catholique de Louvain, qui ont accompagné l'équipe dans le cadre du projet international SMS4science. Ils remercient vivement leurs étudiants stagiaires : Anthony Stifani (étudiant en Master Information et Communication à l’Université Paul-Valéry Montpellier 3), qui a manuellement analysé une partie des SMS, permettant ainsi d’évaluer le système d’anonymisation ; Pierre Accorsi et Namrata Patel (étudiants en Master d’Informatique à l’Université de Montpellier), qui ont développé le système informatisé Seek&Hide, permettant d’anonymiser le corpus ; Frédéric André, Yosra Ghliss, Camille Lagarde-Belleville et Michel Otell (étudiants en Master de Sciences du Langage à l’Université Paul-Valéry Montpellier 3) qui ont procédé à l’anonymisation manuelle en ligne à l’aide de Seek&Hide et à la vérification de l’anonymisation automatique du corpus ; Reda Bestandji, Ahmed Loudah, Aghiles Lounes, Zakaria Mokrani, Takfarinas Sider, Tarik Zaknoun (Master I Informatique, Spécialité : « Informatique pour les sciences », Université Montpellier) qui ont travaillé sur un système de transcodage automatique.

En 2016, le corpus initial a été mis au format TEI dans le cadre du projet CoMeRe (Communication médiée par les réseaux). La structure TEI utilisée est une extension de TEI pour les genres de CMC. Cette extension est développée par un projet européen dont les participants sont : Michael Beißwenger (DE), Thierry Chanier (FR), Isabella Chiari (IT), Maria Ermakova (DE), Maarten van Gompel (NL), Iris Hendrickx (NL), Axel Herold (DE), Henk van den Heuvel (NL), Lothar Lemnitzer (DE), Angelika Storrer (DE).

Description of 2014 project: A pluridisciplinary team of linguists and computer scientists collected more than 88,000 French authentic text messages in Montpellier (2011), as part of the sud4science LR project (, Sud4science Languedoc Roussillon. Mutation des pratiques scripturales en communication électronique médiée (main financial support: MSH-M)). This project is part of a vast international project entitled sms4science , coordinated by the CENTAL at Université catholique de Louvain (UCL) in Belgium. Participants from the general public, who donated their SMS to science, were also able to fill in a sociolinguistic questionnaire. The text messages from the sud4science LR project were then semi-automatically anonymised (in collaboration with student internships and a legal adviser-CIL, Nicolas Hvoinsky, DAJI, Université Paul-Valéry), before being partially transcoded (into standardised French) and annotated (cf. Panckhurst et al. 2013, Panckhurst 2013, Panckhurst et Moïse 2014). The MSH-M (Maison des Sciences de l’Homme de Montpellier), DGLFLF (Délégation générale à la langue française et aux langues de France) and the CNRS (PEPS ECOMESS, HuMaIn) provided the main financial support for the project. The researchers thank: Nicolas Hvoinsky (CIL), and his director, Stéphanie Delaunay (DAJI, Université Paul-Valéry Montpellier 3), who accompanied and legally advised the team throughout the project; Cédrick Fairon, Louise-Amélie Cougnon, Hubert Naets, Cental, Université Catholique de Louvain, who accompanied the team within the framework of the international SMS4science project; Student work and internships: Anthony Stifani (Master’s student in Information and Communication, Université Paul-Valéry Montpellier 3), who manually analysed many of the text messages, thus allowing evaluation of the anonymisation system; Pierre Accorsi and Namrata Patel (Master’s students in Computer Science at the Université de Montpellier), who developed the ‘Seek&Hide’ software, used to anonymise the corpus; Michel Otell, Camille Lagarde-Belleville, Frédéric André and Yosra Ghliss (Master’s students in Language Sciences, Université Paul-Valéry Montpellier 3) who performed the online manual anonymisation with ‘Seek&Hide’ and verified the automatic anonymisation of the corpus; Aghiles Lounes, Tarik Zaknoun, Zakaria Mokrani, Reda Bestandji, Takfarinas Sider, Ahmed Loudah (Master’s students in Computer Science, Université de Montpellier) who worked on an automatic transcoding system.

The initial corpus has been converted to TEI standard in the project CoMeRe (Communication Médiée par les Réseaux) . The TEI structure used is an extension of TEI for CMC genres. This extension is developped by a European project whose participants are : Michael Beißwenger (DE), Thierry Chanier (FR), Isabella Chiari (IT), Maria Ermakova (DE), Maarten van Gompel (NL), Iris Hendrickx (NL), Axel Herold (DE), Henk van den Heuvel (NL), Lothar Lemnitzer (DE), Angelika Storrer (DE).

Editorial procedures

The current corpus cmr-88milsms-tei-v1 has been built out of the 2014 version: . The following changes has been made: 1) Changes on "88milSMS_88522_emoji-utf8.xml" in order to become fully XML compatible 2) POS tagging of emoticons (may be not exhaustive). Note: emojis had already been tagged in the source file, 2014 3) Computations on participants and integration of their descriptions in particDesc 4) Transformation of the new XML file obtained in 1) into TEI-CMC 5) Description of the teiHeader

Contents of messages have been anonymised by the corpus compiler in 2014. Encoding of anonymisation has been standardized through all CoMeRe corpora. See fsDecl for more details and the following document cmr-88milsms-guide.pdf

post correspond to one SMS


Description of the Interaction Space

CMC Environment

Structure of interactions

Data Collection

Data collected : From 2011-09-15 to 2011-12-15
location: SMS collected at the Université Paul-Valéry Montpellier 3. Montpellier, France 7009369

Language of the data: français

Types of interaction

Extracts of Participants

Information on participants has been extracted from the file cmr-88milsms-participants_questionnaire_reponses.ods". Data have been extracted and recomputed in 2016 during the TEI-CMC process, see cmr-88milsms-participants_questionnaire_reponses-v2.ods.occupation only indicates whether a participant was a student or pupil or had another occupation when SMS were collected. tag of langknown coded following Language coded following ISO 639-3. n of education corresponds to the codes for educational attainment related to ISCED - International Standard Classification of Education, 2011, Unesco. The contents of educationrefers to the French educational system. In order to obtain the n we have recomputed data from the previous ODS file, which did not explicitly mention whether people had attained an educational level or were still studying at this level. For information on phone usage see fsdDecl

Extracts of Interactions


Credits, Publication Statement and Rights

Publisher(s)

Date: 2016-09-01

Identifier(s)

uri: cmr-88milsms-tei-v1
url: https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1

Licence

http://creativecommons.org/licenses/by/4.0/

This corpus can be freely distributed and shared subject only to attribution. The way to reference / cite the corpus is given in the titleStmt

Credits