Need a larger bigram corpus #167

tnantais · 2013-08-17T19:57:00Z

The current bigram model was built from a corpus with just over 1 millions tokens (words and punctuation). We need probably something five times this size for the production version. The current sparseness is illustrated by the prediction context "The j" where "Jewish" and "Jews" are both in the prediction list, something that would not be expected after analysis of a larger, more representative English corpus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need a larger bigram corpus #167

Need a larger bigram corpus #167

tnantais commented Aug 17, 2013

Need a larger bigram corpus #167

Need a larger bigram corpus #167

Comments

tnantais commented Aug 17, 2013