How to cite this resource
Panckhurst R., Détrie C., Lopez C., Moïse C., Roche M., Verine B. (2016).
88milSMS. A corpus of authentic text messages in French (nouvelle version du corpus ISLRN :
024-713-187-947-8). In Chanier T. (ed) Banque de corpus CoMeRe. Ortolang : Nancy.
[cmr-88milsms-tei-v1 ;
https://hdl.handle.net/11403/comere/cmr-88milsms/cmr-88milsms-tei-v1]
Description
The first version of the corpus (ISLRN : 024-713-187-947-8) was produced in 2014 as part of
the "sud4science LR project". More than 88,000 authentic SMS, sent by hundreds of donators
living mainly in the Montpellier area, were collected, in 2011, then anonymised, by the
researchers, their student interns and a legal adviser-CIL.
The initial corpus was then converted to TEI standard in the project CoMeRe (Communication
Médiée par les Réseaux). This project aims to build a kernel corpus assembling existing
corpora of different CMC (Computer-Mediated Communication) genres and new corpora build on
data extracted from the Internet. These heterogenous corpora will be structured and
processed in a uniform way, complemented with metadata. CoMeRe will be released as OpenData
through the national infrastructure Ortolang, following constraints which will be reused
for the forthcoming “Corpus de Référence du Français”. Project supported by the national
consortium Corpus-écrits, sub-part of Huma-Num, and Ortolang (French correspondant to
DARIAH)
Keywords: Short Message Service; Computer Mediated Communication; CMC;
- Created on: 2016-09-01
- Language: fra
- Coverage: nbparticipants=422 ; nbmessages=88522; nbemoticons-emojis=29563
- Time of data collection: name=88milsms ; start=2011-09-15 ; end=2011-12-15
- ConformTo: TEI (Text Encoding Initiative)The TEI structure used is an extension of TEI
for CMC genres. This extension is developped by a European project for which thr
participants are : Michael Beißwenger (DE), Thierry Chanier (FR), Isabella Chiari (IT),
Maria Ermakova (DE), Maarten van Gompel (NL), Iris Hendrickx (NL), Axel Herold (DE),
Henk van den Heuvel (NL), Lothar Lemnitzer (DE), Angelika Storrer (DE).
http://wiki.tei-c.org/index.php/SIG:Computer-Mediated_Communication"
- Scientific references:
- "88milSMS. A corpus of authentic text messages in French" Panckhurst R., Détrie
C., Lopez C., Moïse C., Roche M., Verine B. (2014), produit par l’Université
Paul-Valéry Montpellier III et le CNRS, en collaboration avec l’Université
catholique de Louvain, financé grâce au soutien de la MSH-M et du Ministère de la
Culture (Délégation générale à la langue française et aux langues de France) et
avec la participation de Praxiling, Lirmm, Lidilem, Tetis, Viseo.
ISLRN
024-713-187-947-8
http://88milsms.huma-num.fr
- Détrie, C. (2016), « Être contre et/ou tout contre en textotant : l’expression du
consensus et du dissensus dans les SMS, entre rupture et continuum », 5e Congrès
Mondial de Linguistique Française (F. Neveu, G. Bergounioux, M.-H. Côté, J.-M.
Fournier, L. Hriba et S. Prévost éd.)
http://dx.doi.org/10.1051/shsconf/20162702004
- Lopez, C., Roche, M., Panckhurst, R. (2015). Classification des items inconnus de
88milSMS: aide à l'identification automatique de la créativité scripturale.
Travaux neuchâtelois de linguistique (revue TRANEL), 63, 71-86.
https://www2.unine.ch/files/content/sites/islc/files/Tranel/63/71-86_lopez_al_corr.pdf
- Lopez C., Bestandji R., Roche M., Panckhurst R. (2014) "Towards Electronic SMS
Dictionary Construction: An Alignment-based Approach", Proceedings LREC,
Reykjavik, Islande, 26-31 mai,
2833-2838.
http://www.lrec-conf.org/proceedings/lrec2014/pdf/753_Paper.pdf
- Accorsi P., Patel N., Lopez C., Panckhurst R., Roche M. (2014), "Seek&Hide :
Anonymising a French SMS corpus using natural language processing techniques", in
SMS Communication. A Linguistic Approach, éd L.-A. Cougnon, C. Fairon, John
Benjamins : Amsterdam/Philadelphia, p.
11-28.
https://benjamins.com/#catalog/books/bct.61.03acc/details
- Panckhurst R., Moïse C., (2014), "French text messages. From SMS data collection
to preliminary analysis", in SMS Communication. A Linguistic Approach, éd L.-A.
Cougnon, C. Fairon, John Benjamins : Amsterdam/Philadelphia, p.
141-168.
https://benjamins.com/#catalog/books/bct.61.09pan/details
- Panckhurst R. (2013), “A large SMS corpus in French: from design and collation to
anonymisation, transcoding and analysis”, Colloquium, CILC2013, Alicante, March
14-16 : http://web.ua.es/en/cilc2013/, Proceedings, Procedia — Social and
Behavioural Sciences,
Elsevier
http://www.sciencedirect.com/science/article/pii/S1877042813041475
- Panckhurst R., Détrie C., Lopez C., Moïse C., Roche M. et Verine B. (2013).
"Sud4science, de l'acquisition d'un grand corpus de SMS en français à l'analyse de
l'écriture SMS". Épistémè - revue internationale de sciences sociales appliquées,
9 : Des usages numériques aux pratiques scripturales électroniques,
107-138
https://hal.archives-ouvertes.fr/hal-00923618
- Patel N., Accorsi P., Inkpen D., Lopez C., Roche M. (2013) "Approaches of
anonymisation of an SMS corpus", Proceedings of CICLING (Conference on Intelligent
Text Processing and Computational Linguistics), LNCS, Springer Verlag, March
24–30, 2013, University of the Aegean, Samos, Greece, p.
77-88.
http://www.cicling.org/2013/
- More references here: http://88milsms.huma-num.fr/references.html
Contents
This corpus contains :
- cmr-88milsms-tei-v1.xml;
- cmr-88milsms-guide.pdf;
- cmr-88milsms-participants_questionnaire_explications.pdf;
- cmr-88milsms-participants_questionnaire_reponses.ods;
- cmr-88milsms-participants_questionnaire_reponses-v2.ods;
Credits
- Creators: PANCKHURST Rachel; CHANIER Thierry ;
- compiler: PANCKHURST Rachel
- depositor: PANCKHURST Rachel
- editor: CHANIER Thierry
- developer: LOTIN Paul
- data_inputter: DÉTRIE Catherine
- developer: LOPEZ Cédric
- data_inputter: MOÏSE Claudine
- developer: ROCHE Mathieu
- developer: VERINE Bertrand
- sponsor: Maison des Sciences de l'Homme de Montpellier ; http://msh-m.fr
- sponsor: Délégation générale à la langue française et aux langues de France ;
http://www.culture.gouv.fr/culture/dglf/
- sponsor: Consortium CORLI (http://www.huma-num.fr/consortiums#CORLI); ILF (Institut de
Linguistique Française, www.ilf.cnrs.fr/) ; TGIR (Très Grande Infrastructure de
Recherche, http://www.huma-num.fr/) ; France
Licence
This corpus can be freely distributed and shared subject only to attribution.
The way to reference / cite the corpus is given in the bibliographicCitation
http://creativecommons.org/licenses/by/4.0/