-
Notifications
You must be signed in to change notification settings - Fork 42
Backend: SVC
The SVC that implements Linear Support Vector Classification. It is based on LinearSVC in scikit-learn, which in turn is based on liblinear. This kind of algorithm is well suited for multiclass (but not multilabel) classification, for example classifying documents with the Dewey Decimal Classification or the 20 Newsgroups classification, which is used for the examples below. It requires relatively little training data. It is suitable for classifications of up to around 10,000 classes.
Note that SVC cannot handle more than one class/subject per document during training. If a document in the training corpus has more than one subject, a random one of the subjects is picked and a warning is shown.
[svc-en]
name=SVC English
language=en
backend=svc
analyzer=snowball(english)
limit=100
vocab=20news-en
The parameters are:
Parameter | Description |
---|---|
limit | Maximum number of results to return |
min_df | How many documents a word must appear in to be considered. Default: 1 |
ngram | maximum length of word n-grams (default: 1) |
The min_df
parameter controls the features (words/tokens) used to build the model. With the default setting of 1, all the words in the training set will be used, even ones that appear in only one training document. With a higher value such as 5, only those that appear in at least that many documents are included. Increasing the min_df
value will decrease the size and training time of the model.
The ngram
parameter controls the use of word n-grams. With the default setting (1), only single words are considered. Setting this to 2 will use word bigrams, which may increase accuracy, but the model will be much larger.
Load a vocabulary:
annif loadvoc svc-en /path/to/Annif-corpora/fulltext/20news/20news-vocab.tsv
Train the model:
annif train svc-en /path/to/Annif-corpora/fulltext/20news/20news-train.tsv
Test the model with a single document:
cat document.txt | annif suggest svc-en
Evaluate a files in short text document corpus format:
annif eval svc-en /path/to/Annif-corpora/fulltext/20news/20news-test.tsv
- Home
- Getting started
- System requirements
- Optional features and dependencies
- Usage with Docker
- Architecture
- Commands
- Web user interface
- REST API
- Corpus formats
- Project configuration
- Analyzers
- Transforms
- Language detection
- Hugging Face Hub integration
- Achieving good results
- Reusing preprocessed training data
- Running as a WSGI service
- Backward compatibility between Annif releases
- Backends
- Development flow, branches and tags
- Release process
- Creating a new backend