Overview of a CoMeRe corpus

This page: https://hdl.handle.net/11403/comere/cmr-88milsms
Back to Repository main page: https://hdl.handle.net/11403/comere

Corpus CoMeRe cmr-88milsms-tei-v1 :
88milSMS. A corpus of authentic text messages in French.

Open Resources and TOols for LANGuage

How to cite this resource

Panckhurst R., Détrie C., Lopez C., Moïse C., Roche M., Verine B. (2016). 88milSMS. A corpus of authentic text messages in French (nouvelle version du corpus ISLRN : 024-713-187-947-8). In Chanier T. (ed) Banque de corpus CoMeRe. Ortolang : Nancy. [cmr-88milsms-tei-v1 ; https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1]

Description

The first version of the corpus (ISLRN : 024-713-187-947-8) was produced in 2014 as part of the "sud4science LR project". More than 88,000 authentic SMS, sent by hundreds of donators living mainly in the Montpellier area, were collected, in 2011, then anonymised, by the researchers, their student interns and a legal adviser-CIL.

The initial corpus was then converted to TEI standard in the project CoMeRe (Communication Médiée par les Réseaux). This project aims to build a kernel corpus assembling existing corpora of different CMC (Computer-Mediated Communication) genres and new corpora build on data extracted from the Internet. These heterogenous corpora will be structured and processed in a uniform way, complemented with metadata. CoMeRe will be released as OpenData through the national infrastructure Ortolang, following constraints which will be reused for the forthcoming “Corpus de Référence du Français”. Project supported by the national consortium Corpus-écrits, sub-part of Huma-Num, and Ortolang (French correspondant to DARIAH)

Keywords: Short Message Service; Computer Mediated Communication; CMC;

Created on: 2016-09-01
Language: fra
Coverage: nbparticipants=422 ; nbmessages=88522; nbemoticons-emojis=29563
Time of data collection: name=88milsms ; start=2011-09-15 ; end=2011-12-15
ConformTo: TEI (Text Encoding Initiative)The TEI structure used is an extension of TEI for CMC genres. This extension is developped by a European project for which thr participants are : Michael Beißwenger (DE), Thierry Chanier (FR), Isabella Chiari (IT), Maria Ermakova (DE), Maarten van Gompel (NL), Iris Hendrickx (NL), Axel Herold (DE), Henk van den Heuvel (NL), Lothar Lemnitzer (DE), Angelika Storrer (DE). http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication"
Scientific references:
- "88milSMS. A corpus of authentic text messages in French" Panckhurst R., Détrie C., Lopez C., Moïse C., Roche M., Verine B. (2014), produit par l’Université Paul-Valéry Montpellier III et le CNRS, en collaboration avec l’Université catholique de Louvain, financé grâce au soutien de la MSH-M et du Ministère de la Culture (Délégation générale à la langue française et aux langues de France) et avec la participation de Praxiling, Lirmm, Lidilem, Tetis, Viseo.
  ISLRN 024-713-187-947-8
  http://88milsms.huma-num.fr
- Détrie, C. (2016), « Être contre et/ou tout contre en textotant : l’expression du consensus et du dissensus dans les SMS, entre rupture et continuum », 5e Congrès Mondial de Linguistique Française (F. Neveu, G. Bergounioux, M.-H. Côté, J.-M. Fournier, L. Hriba et S. Prévost éd.)
  http://dx.doi.org/10.1051/shsconf/20162702004
- Lopez, C., Roche, M., Panckhurst, R. (2015). Classification des items inconnus de 88milSMS: aide à l'identification automatique de la créativité scripturale. Travaux neuchâtelois de linguistique (revue TRANEL), 63, 71-86.
  https://www2.unine.ch/files/content/sites/islc/files/Tranel/63/71-86_lopez_al_corr.pdf
- Lopez C., Bestandji R., Roche M., Panckhurst R. (2014) "Towards Electronic SMS Dictionary Construction: An Alignment-based Approach", Proceedings LREC, Reykjavik, Islande, 26-31 mai, 2833-2838.
  http://www.lrec-conf.org/proceedings/lrec2014/pdf/753_Paper.pdf
- Accorsi P., Patel N., Lopez C., Panckhurst R., Roche M. (2014), "Seek&Hide : Anonymising a French SMS corpus using natural language processing techniques", in SMS Communication. A Linguistic Approach, éd L.-A. Cougnon, C. Fairon, John Benjamins : Amsterdam/Philadelphia, p. 11-28.
  https://benjamins.com/#catalog/books/bct.61.03acc/details
- Panckhurst R., Moïse C., (2014), "French text messages. From SMS data collection to preliminary analysis", in SMS Communication. A Linguistic Approach, éd L.-A. Cougnon, C. Fairon, John Benjamins : Amsterdam/Philadelphia, p. 141-168.
  https://benjamins.com/#catalog/books/bct.61.09pan/details
- Panckhurst R. (2013), “A large SMS corpus in French: from design and collation to anonymisation, transcoding and analysis”, Colloquium, CILC2013, Alicante, March 14-16 : http://web.ua.es/en/cilc2013/, Proceedings, Procedia — Social and Behavioural Sciences, Elsevier
  http://www.sciencedirect.com/science/article/pii/S1877042813041475
- Panckhurst R., Détrie C., Lopez C., Moïse C., Roche M. et Verine B. (2013). "Sud4science, de l'acquisition d'un grand corpus de SMS en français à l'analyse de l'écriture SMS". Épistémè - revue internationale de sciences sociales appliquées, 9 : Des usages numériques aux pratiques scripturales électroniques, 107-138
  https://hal.archives-ouvertes.fr/hal-00923618
- Patel N., Accorsi P., Inkpen D., Lopez C., Roche M. (2013) "Approaches of anonymisation of an SMS corpus", Proceedings of CICLING (Conference on Intelligent Text Processing and Computational Linguistics), LNCS, Springer Verlag, March 24–30, 2013, University of the Aegean, Samos, Greece, p. 77-88.
  http://www.cicling.org/2013/
- More references here: http://88milsms.huma-num.fr/references.html

Access to the full information of one of its components: https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1
Download the OLAC file: https://hdl.handle.net/11403/comere/cmr-88milsms/olac-cmr-88milsms-tei-v1.xml
Download the whole corpus: https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1.zip (ZIP file, 5 Mo )

This corpus contains :

cmr-88milsms-tei-v1.xml;
cmr-88milsms-guide.pdf;
cmr-88milsms-participants_questionnaire_explications.pdf;
cmr-88milsms-participants_questionnaire_reponses.ods;
cmr-88milsms-participants_questionnaire_reponses-v2.ods;

Credits

Creators: PANCKHURST Rachel; CHANIER Thierry ;
compiler: PANCKHURST Rachel
depositor: PANCKHURST Rachel
editor: CHANIER Thierry
developer: LOTIN Paul
data_inputter: DÉTRIE Catherine
developer: LOPEZ Cédric
data_inputter: MOÏSE Claudine
developer: ROCHE Mathieu
developer: VERINE Bertrand
sponsor: Maison des Sciences de l'Homme de Montpellier ; http://msh-m.fr
sponsor: Délégation générale à la langue française et aux langues de France ; http://www.culture.gouv.fr/culture/dglf/
sponsor: Consortium CORLI (http://www.huma-num.fr/consortiums#CORLI); ILF (Institut de Linguistique Française, www.ilf.cnrs.fr/) ; TGIR (Très Grande Infrastructure de Recherche, http://www.huma-num.fr/) ; France

Licence

This corpus can be freely distributed and shared subject only to attribution. The way to reference / cite the corpus is given in the bibliographicCitation
http://creativecommons.org/licenses/by/4.0/

Corpus CoMeRe cmr-88milsms-tei-v1 : 88milSMS. A corpus of authentic text messages in French.

How to cite this resource

Description

Contents

Credits

Licence

Corpus CoMeRe cmr-88milsms-tei-v1 :
88milSMS. A corpus of authentic text messages in French.