This package is simply a wrapper that relies on Apache OpenNLP for part-of-speech tagging and mate tools for parsing.
Tagging and parsing is performed in a pipeline in which the results of the tagger is feed to the parser. The main script is called run_danish_pipeline.py
. It takes two parameters: an input text file that has one sentence per line, and an output file name for the results. Note that the input text should be tokenized. The output file is in CONLL9 format.
The tagger and the parser are trained using the train_danish_pipeline.py
script. For convenience, pre-trained models are included in the model
directory, and they are automatically used by the run script.
The pre-trained models are learned on the training part of the Danish Dependency Treebank. We use the version distributed as a part of the CONLL 2006 shared task on parsing. Part-of-speech tags are converted to the Google universal tagset before training.
- Apache OpenNLP 1.5.2
- Mate tools 3.3
- Universal part-of-speech tags conversion files
The POS tagger achieves accuracy of 96.8 % on the test portion of the Danish Dependency Treebank using the universal tagset.