-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathREADME
77 lines (59 loc) · 3.53 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
FASTSUBS Copyright (c) 2012-2014 Deniz Yuret
Usage: fastsubs [-n <n> | -p <p>] model.lm[.gz] < input.txt
Fastsubs is a program that finds the most likely substitutes for words
in input.txt using the language model in model.lm (ARPA format). The
number of substitutes is controlled by two parameters: -n specifies
the number of substitutes for each word, -p specifies a threshold on
the sum of the substitute probabilities. The program stops generating
substitutes for a word when either limit is satisfied. The output has
an input word and its substitutes with the logarithm (base 10) of
their unnormalized probabilities in the following format:
word <tab> sub1 <space> lprob1 <tab> sub2 <space> lprob2 <tab> ...
Please see the file LICENSE for terms of use. A paper with the
description of the algorithm and other related resources can be found
at http://goo.gl/jzKH0. The code is mostly C99 compliant and has been
compiled and tested on a linux system with gcc. The few GNU
extensions used are optional and can be turned off by editing the
beginning of dlib.h.
Update of Jan 9, 2014: The latest version gets rid of all glib
dependencies (glib is broken, it blows up your code without warning if
your arrays or hashes get too big). It also fixes a bug: The previous
versions of the code and the paper assumed that the log back-off
weights were upper bounded by 0 which they are not. In my standard
test of generating the top 100 substitutes for all the 1,222,974
positions in the PTB, this caused a total of 66 low probability
substitutes to be missed and 31 to be listed out of order. Finally a
multithreaded version, fastsubs-omp is implemented. The number of
threads can be controlled using the environment variable
OMP_NUM_THREADS.
fastsubs-omp can be used with a few additional run-time options:
Usage: fastsubs-omp [-n <n> | -p <p> | -m <max-threads> | -t | -z ] model.lm[.gz] < input.txt
-m specifies the maximum number of threads that can be used
-z specifies that the output probability values would be normalized probabilities that sum up to 1 (instead of the default unnormalized log probabilities)
-t specifies a one-target-per-text-line input format (instead of the default free text format).
In this format every input line is:
target_name <tab> target_id <tab> target_index <tab> text_line
Then for every input line fastsubs-omp outputs two lines:
1) the input line
2) substitutes for the text_line[target_index] (i.e. the word in the target_index position in text_line) in the format:
sub1 <space> lprob1 <tab> sub2 <space> lprob2 <tab> ...
Other test and utility executables that can be built using the
Makefile:
* fastsubs-omp: Multithreaded version.
* wordsub: Takes the output of fastsubs and samples random substitutes
for each word.
* normalize-subs.pl: Takes fastsubs output and converts the
unnormalized log10(p) entries to normalized p entries that add up to
1.0 (which may not be accurate if fastsubs did not output the whole
vocabulary.)
* fastsubs-test: Takes a model file and interactively runs fastsubs on
sentences typed by the user.
* lmheap-test: Takes a model file and tests the heaps in the lm data
structure by interactively reading n-grams from the user and
printing out the contents of the heap for each position.
* sentence-test: Takes a model file and tests the sentence and lm data
structures by reading sentences from the user and printing out logp
for each word.
* lm-test: Takes a model file and tests the lm data structure by
reading n-grams from the user and printing out their logp and
back-off weights.