Cite this corpus using the following:
@software{thibault_clerice_2020_corpus_latin,
author = {Clérice, Thibault},
title = {Corpus Latin antiquité et antiquité tardive lemmatisé},
month = apr,
year = 2021,
publisher = {Zenodo},
version = {0.1.3},
doi = {10.5281/zenodo.4337145},
url = {https://doi.org/10.5281/zenodo.4337145}
}
This corpus contains the whole set of Capitains compliant classical and late Latin texts avaialable out there. The latest version of the corpus is based on the following corpora:
Name | Version | Project you need to cite |
---|---|---|
PerseusDL/canonical-latinLit | 0.0.843 | https://www.perseus.org |
OpenGreekAndLatin/csel-dev | 1.0.211 | https://www.perseus.org |
OpenGreekAndLatin/Latin | v1.10.0 | https://www.perseus.org |
lascivaroma/digiliblt | 0.0.64 | https://digiliblt.uniupo.it |
lascivaroma/priapeia | 1.1.18 | Lasciva Roma |
lascivaroma/additional-texts | 1.0.192 | Lasciva Roma |
The texts are distributed using the same licence as the original, annotation are CC-BY-SA 4.0.
Number of tokens: 21,327,783 (17,885,059 without punctuation)
They were tagged with Pie-Extended LASLA model using the 0.0.6 LASLA + model (trained with aligned PROIEL Vulgate as well as Priapea and a Late Latin Corpus to be published soon).
Note: the model is currently being fine-tuned in the context of my PhD. I'll fill this part when it will be done.
- Enclitics duplicate the whole token (-que are not separated). They are identifiable through as their form starts and ends with
{
and}
. Example :
<w lemma="breuis" msd="Case=Abl|Numb=Plur|Gend=Com|Deg=Pos" n="1.18" pos="ADJqua" rend="section">breuibusque</w>
<w lemma="que" msd="MORPH=empty" n="1.18" pos="CON" rend="section">{breuibusque}</w>
-
Roman numbers are lemmatized as Arabic numbers.
-
The model is highly susceptible of wrong annotation for wrong tokens such as
7AP
. -
Tokens ending with
?
are known as needing disambiguation but disambiguation was not possible (there seems to have been an issue with some automatic ones in this version of the corpus).
XML Files are TEI compliant (normally) and the text in separated in passages, their type being provided at the ab
level.
<text n="urn:cts:latinLit:phi0119.phi009.perseus-lat2">
<body>
<ab n="urn:cts:latinLit:phi0119.phi009.perseus-lat2:1" type="line">
Tokens use the standard TEI annotation elements @pos
, @msd
and @lemma
:
<w lemma="gaudeo" msd="Numb=Sing|Mood=Ind|Tense=Pres|Voice=Act|Person=1" n="7" pos="VER" rend="line">gaudeo</w>
Tag | French | English | UD POS | Example |
---|---|---|---|---|
ADJadv.mul | Numéral Adverbial Multiplicatif | Multiplicative numeral adverbial | ADV | quadragies |
ADJadv.ord | Numéral Adverbial Ordinal | Ordinal numeral adverb | ADV | secundo |
ADJcar | Numéral Cardinal | Cardinal | NUM | decem, ducenti, duo |
ADJdis | Numéral Distributif | Distributive numeral | ADJ | tricenus, trinus, uicenus, undenus |
ADJmul | Numéral Multiplicatif | Multiplicative numeral | ADJ | septemplex, simplex, triplex |
ADJord | Numéral Ordinal | Ordinal numeral | ADJ | octogesimus, primus, prior |
ADJqua | Adjectif qualificatif | Adjective | ADJ | |
ADV | Adverbe | Adverb | ADV | |
ADVint | Adverbe interrogatif | Interogative Adverb | ADV | an, anne, cuicuimodi2 |
ADVint.neg | Adverbe interrogatif négatif | Negative Interrogative Adverb | ADV | necne, nonne, quidni |
ADVneg | Adverbe négatif | Negative Adverb | ADV | haud, ne3, nec1 |
ADVrel | Adverbe relatif | Relative Adverb | ADV | proquam, prout |
CONcoo | Conjonction de coordination | Coordinating conjunction | CCONJ | |
CONsub | Conjonction de subordination | Subordinating conjunction | SCONJ | |
INJ | Interjection | Interjection | INTJ | |
NOMcom | Nom commun | Noun | NOUN | |
NOMpro | Nom propre | Proper Noun | PROPN | |
OUT | Non-Géré | Out | X | |
PRE | Préposition | Preposition | ADP | |
PROdem | Pronom démonstratif | Demonstrative Pronoun | PRON | hic, idem, ille |
PROind | Pronom indéfini | Indefinite Pronoun | PRON | aliquantus, aliquis, aliquot, alis, alius, alter |
PROint | Pronom interrogatif | Interrogative Pronoun | PRON | cuias, cuius, ecquis |
PROper | Pronom personnel | Personal Pronoun | PRON | ego, nos, tu, uos |
PROpos | Pronom possessif | Possessive Pronoun | PRON | mei, meus, noster |
PROpos.ref | Pronom possessif réfléchi | Relfexive Possessive Pronoun | PRON | Sui, suus |
PROref | Pronom réfléchi | Reflexive Pronoun | PRON | sepse, sui |
PROrel | Pronom relatif | Relative Pronoun | PRON | cuius, qualis, qualiscumque |
PUNC | Ponctuation | Punctuation | PUNCT | |
VER | Verbe | Verb | VERB | |
VERaux | Verbe auxiliaire | Auxiliary Verb | AUX | |
FOR | Termes étrangers | Foreign words | X |
POS | Tokens |
---|---|
277 | |
ADJadv.mul | 7748 |
ADJadv.ord | 21012 |
ADJcar | 168392 |
ADJdis | 14022 |
ADJmul | 4185 |
ADJord | 61518 |
ADJqua | 1313643 |
ADV | 1003398 |
ADVint | 65600 |
ADVint.neg | 3414 |
ADVneg | 276761 |
ADVrel | 185454 |
CON | 179855 |
CONcoo | 1295347 |
CONsub | 604806 |
FOR | 44232 |
INJ | 30000 |
NOM | 29 |
NOMcom | 4301213 |
NOMpro | 714746 |
PRE | 1208858 |
PROdem | 780450 |
PROind | 379789 |
PROint | 79845 |
PROper | 242527 |
PROpos | 134720 |
PROpos.ref | 83633 |
PROref | 80269 |
PROrel | 516713 |
PUNC | 3442724 |
UNK | 843 |
VER | 4079470 |
_ | 2290 |