Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need a larger bigram corpus #167

Open
tnantais opened this issue Aug 17, 2013 · 0 comments
Open

Need a larger bigram corpus #167

tnantais opened this issue Aug 17, 2013 · 0 comments

Comments

@tnantais
Copy link
Contributor

The current bigram model was built from a corpus with just over 1 millions tokens (words and punctuation). We need probably something five times this size for the production version. The current sparseness is illustrated by the prediction context "The j" where "Jewish" and "Jews" are both in the prediction list, something that would not be expected after analysis of a larger, more representative English corpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant