-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Primera publicació del model amb reconeixement d'entitats
- Loading branch information
0 parents
commit 115b80c
Showing
5 changed files
with
137 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
__pycache__/ | ||
ca_* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
include meta.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
# [CA] Model pel processament del llenguatge natural en Català per a spaCy | ||
|
||
Model per a [spaCy](https://spacy.io) de la llengua catalana generat a partir de: | ||
|
||
- Vectors de paraules de [fastText](https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md) | ||
- Gramàtica, morfologia i sintaxi fent servir dades del corpus d'[AnCora](https://github.com/UniversalDependencies/UD_Catalan-AnCora) | ||
- Annotacions per a l'extracció d'entitats derivades de la wikipedia ([Cross-lingual Name Tagging and Linking for 282 Languages](http://nlp.cs.rpi.edu/paper/282elisa2017.pdf)) | ||
|
||
Degut a la mida final del model (2.5GB) i dels vectors de paraules (1.1GB) aquests no s'inclouen al repositori però podeu descarregar-vos el model final a la secció Publicacions (Releases). | ||
|
||
## Instal·lació i ús | ||
|
||
Podeu instal·lar el model i fer-lo servir amb spaCy executant les següents ordres a l'interfície de línia d'ordres: | ||
|
||
```sh | ||
> pip install https://github.com/ccoreilly/spacy-catala/releases/download/v0.0.2/ca_fasttext_wiki-0.0.2.tar.gz | ||
> python -m spacy link ca_fasttext_wiki ca | ||
``` | ||
|
||
# [EN] spaCy NLP Model for the Catalan language | ||
|
||
spaCy NLP model for the Catalan language generated from: | ||
|
||
- [fastText](https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md) word vectors | ||
- The [AnCora](https://github.com/UniversalDependencies/UD_Catalan-AnCora) corpus for parts of speech, morphological features, and syntactic dependencies. | ||
- Wikipedia annotations for named entity extraction ([Cross-lingual Name Tagging and Linking for 282 Languages](http://nlp.cs.rpi.edu/paper/282elisa2017.pdf)) | ||
- | ||
|
||
The final model is around 2.5GB and the fastText vectors over 1GB which is why they are not included in this repository. You can download the model under the Releases tab. | ||
|
||
## Installing and using the model | ||
|
||
You can install and use the model in spaCy by executing the following commands: | ||
|
||
```sh | ||
> pip install https://github.com/ccoreilly/spacy-catala/releases/download/v0.0.2/ca_fasttext_wiki-0.0.2.tar.gz | ||
> python -m spacy link ca_fasttext_wiki ca | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
{ | ||
"accuracy": { | ||
"uas":24.6151530788, | ||
"las":23.711622807, | ||
"ents_p":44.912142152, | ||
"ents_r":21.7459467727, | ||
"ents_f":29.3034819462, | ||
"tags_acc":97.6588546924, | ||
"token_acc":100.0 | ||
}, | ||
"author": "Ciaran O'Reilly", | ||
"description": "Catalan Model from fastText vectors and annotations from the catalan Wikipedia", | ||
"email": "[email protected]", | ||
"lang": "ca", | ||
"license": "MIT", | ||
"name": "fasttext_wiki", | ||
"parent_package": "spacy", | ||
"pipeline": ["tagger", "parser", "ner"], | ||
"sources": ["fastText"], | ||
"spacy_version": ">=2.1.8", | ||
"speed": { | ||
"nwords":326934, | ||
"cpu":8088.3390474066, | ||
"gpu":7692.6920447333 | ||
}, | ||
"url": "https://nlu.cat", | ||
"vectors":{ | ||
"width":300, | ||
"vectors":2000000, | ||
"keys":2000000, | ||
"name":"ca_model.vectors" | ||
}, | ||
"version": "0.0.2" | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
#!/usr/bin/env python | ||
# coding: utf8 | ||
from __future__ import unicode_literals | ||
|
||
import io | ||
import json | ||
from os import path, walk | ||
from shutil import copy | ||
from setuptools import setup | ||
|
||
|
||
def load_meta(fp): | ||
with io.open(fp, encoding='utf8') as f: | ||
return json.load(f) | ||
|
||
|
||
def list_files(data_dir): | ||
output = [] | ||
for root, _, filenames in walk(data_dir): | ||
for filename in filenames: | ||
if not filename.startswith('.'): | ||
output.append(path.join(root, filename)) | ||
output = [path.relpath(p, path.dirname(data_dir)) for p in output] | ||
output.append('meta.json') | ||
return output | ||
|
||
|
||
def list_requirements(meta): | ||
parent_package = meta.get('parent_package', 'spacy') | ||
requirements = [parent_package + meta['spacy_version']] | ||
if 'setup_requires' in meta: | ||
requirements += meta['setup_requires'] | ||
return requirements | ||
|
||
|
||
def setup_package(): | ||
root = path.abspath(path.dirname(__file__)) | ||
meta_path = path.join(root, 'meta.json') | ||
meta = load_meta(meta_path) | ||
model_name = str(meta['lang'] + '_' + meta['name']) | ||
model_dir = path.join(model_name, model_name + '-' + meta['version']) | ||
|
||
copy(meta_path, path.join(model_name)) | ||
copy(meta_path, model_dir) | ||
|
||
setup( | ||
name=model_name, | ||
description=meta['description'], | ||
author=meta['author'], | ||
author_email=meta['email'], | ||
url=meta['url'], | ||
version=meta['version'], | ||
license=meta['license'], | ||
packages=[model_name], | ||
package_data={model_name: list_files(model_dir)}, | ||
install_requires=list_requirements(meta), | ||
zip_safe=False, | ||
) | ||
|
||
|
||
if __name__ == '__main__': | ||
setup_package() |