TEI-CMC version of Wikipedia discussions associated with the article "Organisme génétiquement modifié (OGM)"

Open Resources and TOols for LANGuage

How to cite this resource

Poudat,C., Grabar , N., Jin, K. & Paloque-Berges, C. (2015). TEI-CMC version of Wikipedia discussions associated to the article "Organisme génétiquement modifié (OGM)". In Corpus Wikiconflits "Conflits dans le Wikipédia francophone" (cmr-wikiconflits), CoMeRe corpora repository. Ortolang.fr : Nancy. [ cmr-wikiconflits-ogm_discu-tei-v1 ; https://hdl.handle.net/11403/comere/cmr-wikiconflits/cmr-wikiconflits-ogm_discu-tei-v1 ]

Overview of the corpus

This file contains discussions associated with the wikipedia article "Organisme génétiquement modifié (OGM)" (Genetically modified organism - GMO- cmr-wikiconflits-ogm_p1-tei-v1) from 2006 to 2014, transformed into TEI-CMC format. This sub-corpus also includes article and discussions related to "Débat sur les organismes génétiquement modifiés" (Genetically modified food controversies). Discussions have been reorganized out of the main discussion page and all archives of discussion pages. This set represents a subpart of the corpus Wikiconflits "Conflits dans le Wikipédia francophone" (cmr-wikiconflits).

Keywords : Computer Mediated Communication; CMC; Wikipedia; discussion;


Poudat, C., Jin, K., & Chanier, T. (2014). Wikiconflits, un corpus extrait de Wikipédia : principe et méthode d'élaboration. In Poudat,C., Grabar , N., Jin, K. & Paloque-Berges, C. (2015). Corpus Wikiconflits, conflits dans le Wikipédia francophone". Banque de corpus CoMeRe. Ortolang.fr : Nancy. [cmr-wikiconflits-tei-v4.1-manuel.pdf ; https://hdl.handle.net/11403/comere/cmr-wikiconflits]


Coverage: 137 participants ; 3 077 contributions ; 305 141 tokens (only for this file)

Rationale for this corpus

The corpus Wikiconflits "Conflits dans le Wikipédia francophone" cmr-wikiconflits) gathers conflictual discussions around a set of (pseudo-)scientific topics: "Quotient Intellectuel","Igor et Grichka Bogdanoff", "Organismes génétiquement modifiés", "Chiropratique", "Histoire de la Logique", "Éolienne", "Psychanalyse" (see cmr-wikiconflits-tei-v4.1-manuel.pdf for selection criteria). For each topic, versions of the article have been transformed into TEI, talk / discussions pages have been reorganized , alongside pages related to conflicts and neutral points of view, all formatted into TEI-CMC. History pages have also been extracted as-is in HTML Wikipedia formats, as well as pages and talk pages of the more important contributors (left in wikicode format).

This corpus has been created by the CoMeRe project, which aims to gather different corpora that represent the forms of communication in French on different networks (Internet, phone, etc.), all structured and informed in the same way, diffused in open access formats for research purposes. The CoMeRe projet has received the support of ORTOLANG (the French equivalent of DARIAH) and of the national consortium Written-Corpus ('Corpus-écrits') , subsection of Huma-Num.

Editorial procedures

The body is divided into divisions (div), one per subject. Every division is segmented into contribution (post), one per author (see tagsDecl for details).

Contributors to discussions may not respect the recommended (by Wikipedia) ways of reacting / posting an answer / a question: ident not present with insertions appearing in the previous contribution as if everything had been written by one person ; no signature, etc. Therefore after the automatic decompositon into seperate conrtibutions (post), some manual checks and corrections have been made: when adding missing information in attributes of the post(date, contributors id), or when segementing a contribution into several parts, because they were from different authors, or when relating different post together because they orignally were part of the same contribution (i.e. before another contributor wrote inside it without taking respecting the wikipedia format). In the latter case, a join may have been added in order to establish these links. It should be noted that correctors when reestablishing the discussion thread, avoided changing the original contents of the text (words/ tokens ; they did not introduce signatures, for example). Information about these problems and the manual correction is explained in . Correctors (i.e. authors of this corpus) may have left some XML comments between 2 post in order to explain what they did.

Every subject of discussion has been assembled here. For this purpose, we have searched in the main discussions page, and its related archives. All redundant information between the main discussion page and its archives have been suppressed. All missing information (missing in the main page but present in the archives) has been included here. Then each contribution has been segmented into one message (post).

Description of the Interaction Space

CMC Environment

Structure of interactions

Data Collection

Data collected : From 2006-01-24 to 2014-02-01
location: French Wikipedia website Discussion page associated to an article France

Language of the data: français

Types of interaction


For the list of participants / contributors see listPrefixDef and notesStmt

Extracts of Interactions

Credits, Publication Statement and Rights


Date: 2015-03-20


uri: cmr-wiki-c010
url: https://hdl.handle.net/11403/comere/cmr-wikiconflits/cmr-wikiconflits-ogm_discu-tei-v1



Following Wikipedia.fr recommendation this corpus (and all its related contents) can be freely distributed and shared subject only to attribution, and share alike. How to reference / cite this contents is given in the titleSmt