Arabic Treebank: Part 1 (ATB1) v 4.1, Linguistic Data Consortium
(LDC) catalog number LDC2010T13 and isbn 1-58563-553-7, was developed
at LDC. It consists of 734 newswire stories from Agence France Presse
(AFP) with part-of-speech (POS), morphology, gloss and syntactic
treebank annotation in accordance with the
Penn
Arabic Treebank (PATB) Guidelines developed in 2008 and 2009. This
release represents a significant revision of LDC's previous ATB1
publications:
Arabic Treebank: Part 1 v 2.0 LDC2003T06 and
Arabic Treebank: Part 1 v 3.0 (POS with full vocalization +
syntactic analysis) LDC2005T02.
The ongoing PATB project supports research in Arabic-language natural
language processing and human language technology development. The
methodology and work leading to the release of this publication are
described in detail in the documentation accompanying this corpus and
in two research papers: Enhancing
the Arabic Treebank: A Collaborative Effort toward New Annotation
Guidelines and
Consistent
and Flexible Integration of Morphological Annotation in the Arabic
Treebank.
ATB1 v 4.1 contains a total of 145,386 tokens before clitics are
split, and 167,280 tokens after clitics are separated for the
treebank annotation.
Please see docs/file.tbl for the directory structure of this publication, as well as a complete list of files.
Please go to data for data files.
Other documentation files are:
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2010T13.
Portions © 2000 Agence France Presse, © 2003, 2005, 2010 Trustees of the University of Pennsylvania