This is an implementation of a conditional random field (CRF) sequence tagger in C++. Development is ongoing, and only a limited number of possible features for training are currently implemented. However, the CRF will train a model and can use it to tag new sentences.
Currently, the underlying CRF training code as well as part-of-speech, chunking, and named-entity tagging layers are implemented.
If you're interested in how CRFs work, there is a brief introduction in
src/include/crf/tagger.h
. There are also good tutorials on the web, such as
Sutton and McCallum (2012)
The code has an external dependency on the excellent libLBFGS library by Naoaki Okazaki. Otherwise, it only uses the standard C++ library, and does not require C++11. It should compile with any reasonably modern version of g++ or clang++.
- Clone the repository or download the code
mkdir ext
- Build libLBFGS and install it in the
ext
directory, so it looks like this:ext/lbfgs/{include,lib,share}
- Compile with
make
. Binaries will be placed in thebin
directory - You may need to set your
DYLD_LIBRARY_PATH
orLD_LIBRARY_PATH
for the code to run:export DYLD_LIBRARY_PATH=/path/to/ext/lbfgs/lib
- Formats are controlled by the "--ifmt" and "--ofmt" command line options.
- The Chunker and NER taggers currently only read the CoNLL 2000 and CoNLL 2003 data formats. The POS tagger can read flexible formats
- All the taggers can produce output in flexible formats
- There is a mini printf-style language for specifying input and output formats. The format specifies how each word in the sentence should be formatted along with its accompanying tags. Each word in the sentence is printed in the same way.
- Formats look like the following (
+
means "at least one"):"<sent_pre>(<format><sep>)+<word_sep><sent_pos>"
<sent_pre>
is a string printed before each sentence<format>
is one of the format strings<sep>
is a one character (only) separator between format items (escapes like\n
are allowed)<word_sep>
is a one character (only) separator between each word block (escapes like\n
are allowed)<sent_pos>
is a string printed at the end of each sentence
- Available format strings are:
%w
for the word%p
for the part of speech tag%c
for the chunk tag%e
for the entity tag
- Note that you should only print out format strings that are actually present in the input or produced by the tagger.
- For example, to produce output from the chunker in the CoNLL 2000 evaluation format:
--ofmt "%w %p %c %e\n\n\n"
bin/train_pos
will train a model for POS tagging.bin/pos
will take a model produced bytrain_pos
and use it to tag sentences- The software reads pipe-formatted input by default as described by command line
options. Check the
--help
flag for more info. Custom formats can also be used.
bin/train_chunk
will train a model for chunking.bin/chunk
will take a model produced bytrain_chunk
and use it to tag sentences- The software will currently only read data in the CoNLL 2000 chunking shared task format. This data is available at http://www.cnts.ua.ac.be/conll2000/chunking/.
bin/train_ner
will train a model for NER tagging.bin/ner
will take a model produced bytrain_ner
and use it to tag sentences- Run
bin/ner --help
for a description of program options. The software will currently only read the CoNLL 2003 NER shared task formatted input (see http://www.cnts.ua.ac.be/conll2003/ner/ for more information).
This code is licensed for academic (non-commerical) use. Contact me for licensing terms if you wish to use any or all of this code for any non-academic purpose.