Polititweets : corpus de tweets provenant de comptes politiques influents 1

Longhi, J., Marinica, C., Borzic, B., Alkhouli, A.(2014). Polititweets : corpus de tweets provenant de comptes politiques influents 1. In Chanier T. (ed) Banque de corpus CoMeRe. Ortolang.fr : Nancy. [https://hdl.handle.net/11403/comere/cmr-polititweets/cmr-polititweets-c001-tei-v1]

Overview of the corpus

The corpus Polititweets gathers tweets of 7 personalities from 6 French different political groups : Mélenchon, Bayrou, Copé, Fillon, Lepen, Ayraut, Cohn-Bendit. Extracted from the Twitter accounts (twittos in French) from these personalities, by a method that selected the messages of 205 twittos send in 2013-14, what makes for the total corpus 34273 messages (tweets) that contain 502 085 tokens / written forms, ponctuation excluded. An important part of the content of these messages is related to the campaign of municipal elections of March 2014. The first version of this corpus has been build up by the project "Numerical humanities and data journalism : the case of political vocabulary" from the Université of Cergy-Pontoise. The initial corpus initial has been converted to the TEI format within the framework of the project CoMeRe (Communication médiée par les réseaux, Network mediated communication) . The complete corpus consists in 7 TEI folders. In the body of each folder, the messages coming from about 30 user accounts / twittos are gathered. The first serie of messages comes from the account of one of the 7 personalities that have been selected. The CoMeRe projet aims to gather different corpus that represent the forms of communication in French on the networks (Internet, phone, etc.), all structured and informed in the same way, diffused in open acces for research purposes. The CoMeRe projet has received the support of ORTOLANG (the French equivalent of DARIAH) and of the national consortium Written-Corpus ('Corpus-écrits') , subsection of Huma-Num. ;

Keywords : applied_linguistics ; discourse_analysis ; text_and_corpus_linguistics ; primary_text ; dialogue ; Communication Médiée par les Réseaux ; CoMeRe ; Tweet ; Computer Mediated Communication ; CMC ; Tweet ;


Longhi J.(2013). "Essai de caractérisation du tweet politique", L’Information grammaticale, n°136, p.25-32

Rationale for this corpus

>The initial aims of the researchers collecting these data was to be equiped with a corpus that would permit a research centred on the political vocabulary, from analyses of observables coming from the new communication methods. The document


détails the way the accounts of 200 persons were selected. Here an extrait 1) we started with 7 personalities of 6 different French political groups : JLMelenchon, Bayrou, Copé, Fillon, Lepen, Ayraut, Cohn-Bendit 2) we gathered on all the lists quotations mentioning them => 7087 lists 3) we selected among these lists, the ones that had at least 6 user accounts / twittos and who contained the chain of characters *politic* in the name or description of the list => 120 listss (11K lignes) 4) On these 120 lists, we selected 2934 messages / tweets ; 5) to be sure to select only political twittos (and not journalistic...), we work by levels. By selecting only the accounts quoted on more than 12 lists, we obtain 205 political twittos. On the 205 accounts, we recovered the 200 last tweets of every person at the date of 27 March 2014, that is 34273 tweets. This has permitted to obtain a corpus centered on the period between two ballots of the local elections 2014, or, for the accounts that were less actives, the consideration of these eletions, or the previous ones (because, according to the density of the publication of tweets, the temporality of each account will be different : the oldest one is dated 2009-03-04 11:59:49).

This corpus is a subpart of the CoMeRe corpus databank. The CoMeRe (Communication Médiée par les Réseaux) project aims to build a kernel corpus assembling existing corpora of different CMC (Computer-Mediated Communication) genres and new corpora build on data extracted from the Internet. These heterogenous corpora will be structured and processed in a uniform way, complemented with metadata. CoMeRe will be released as OpenData through the national infrastructure Ortolang, following constraints which will be reused for the forthcoming “Corpus de Référence du Français”. Project supported by the national consortium Corpus-écrits, sub-part of Huma-Num, and Ortolang (French correspondant to DARIAH).

The TEI structure used is an extension of TEI for CMC genres. This extension is developped by a European project which participants are : Michael Beißwenger (DE), Thierry Chanier (FR), Isabella Chiari (IT), Maria Ermakova (DE), Maarten van Gompel (NL), Iris Hendrickx (NL), Axel Herold (DE), Henk van den Heuvel (NL), Lothar Lemnitzer (DE), Angelika Storrer (DE).

Description of the Interaction Space

CMC Environment

  • tweet : Definition of the modality Tweet. Type of messages used in Tweets.
  • Structure of interactions
    text: each text correspond to the set of tweets coming from the same Twitter account
    post: one post corresponds to one tweet.

    p: This element appears inside the
    distinct: This element appears inside
    addressingTerm: Addressing terms address an utterance to a particular interlocutor / twitto or refers to a twitto. It includes :
    trailer: This element appears inside

    Data Collection

    Data collected : From 2009-03-04 to 2014-03-27
    location: Twitter website Out of the 30 twitter-account selected from French politicians, the last 200 last tweets have been extracted. Since most of the messages were sent at the end of 2013 and before the end of March 2014, they mainly refers to discussions betwwen the two rounds of voting in the municipal elections of March 2014 France

    Language of the data: français

    Types of interaction

    channel: mode: w , Message sent through a Twitter account
    constitution: Selected through automatic processing. See projectDesc for more information
    derivation: type: original ,
    domain: type: public , domain of a message: politics
    factuality: type: fact ,
    interaction: type: complete , active: many ,
    preparedness: type: spontaneous ,
    purpose: political local elections

    Participants (extract)

    The list or participants, i.e. twittos is given in sourceDesc is

    Extracts of Interactions

    head:Tweets de Jean-Luc Mélenchon

    head:Tweets de Chantal Jouanno

    Composition of the corpus

    Collection cmr-polititweets-tei-v1 : list of files / identification numbers









    30 user accounts / twittos ; 4955 posts ; 80768 tokens


    principal : Longhi Julien, Chanier Thierry.
    compiler : Longhi Julien, Marinica Claudia.
    editor : Chanier Thierry.
    data_inputter : Hriba Linda, Jin Kun, Borzic Boris, Alkhouli Abdulhafiz.
    developer : Lotin Paul.
    participant : Ledegen Gudrun.
    publisher : ORTOLANG (Outils et Ressources pour un Traitement Optimisé de la LANGue), Nancy:France .

    Date: 2014-05-02


    Rights holders of this corpus are: Julien Longhi ; Thierry Chanier

    This corpus can be freely distributed and shared subject only to attribution. The way to reference / cite the corpus is given in the titleSmt