-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
24 lines (18 loc) · 1 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
GLOOKUP Copyright (c) 2008-2014, Deniz Yuret
This is the code used in:
Deniz Yuret. 2008. Smoothing a Tera-word Language Model. In the 46th
Annual Meeting of the Association for Computational Linguistics: Human
Language Technologies. See http://goo.gl/rmD87d for details.
The glookup program reads ngram patterns with wildcards (represented
with the '_' character) from stdin and prints their counts from the
Web1T Google ngram data (whose path is given by the -p option).
Please see glookup.1 (man page), or glookup.txt (plain text format)
for documentation.
The model.pl script optimizes and tests various language models. See
'perldoc model.pl', or model.txt for documentation. Typical usage:
model.pl -patterns < text > patterns
glookup -p web1t_path < patterns > counts
model.pl -counts counts < text
The glookup.pl script quickly searches for a given pattern in
uncompressed Google Web1T data. Use the C version for bulk processing,
the perl version to get a few counts quickly.