About English ngrams #78
-
Hi, What's the difference of the different ngram sets for English in the repo? I see there's oxey_english and oxey_english2, probably from o-x-e-y/oxeylyzer/tree/main/static/language_data oxey_english vs oxey_english2: unigramspractically identical
oxey_english vs oxey_english2: bigrams
oxey_english vs oxey_english2: trigrams
oxey_english vs eng_shai: unigrams
oxey_english vs eng_shai: bigrams
oxey_english vs eng_shai: trigrams
After comparing the |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
Thanks for the detailed comparison :) If I recall correctly, the "eng_shai" corpus is the iweb corpus that was used to develop the "colemak" layout. It was added in 2022, so my memory may be incorrect, though. I don't know, which sources oxey's two english corpora are based upon. It may very well be the "shai" corpus. I used the oxey corpora mainly to be able to compare the oxey-metrics of this analyzer to the ones from oxey's playground and make sure, they are aligned. |
Beta Was this translation helpful? Give feedback.
Thanks for the detailed comparison :)
You are correct for the "oxey" corpora. They come from the
english.json
andenglish2.json
files at your linked page.If I recall correctly, the "eng_shai" corpus is the iweb corpus that was used to develop the "colemak" layout. It was added in 2022, so my memory may be incorrect, though.
I don't know, which sources oxey's two english corpora are based upon. It may very well be the "shai" corpus. I used the oxey corpora mainly to be able to compare the oxey-metrics of this analyzer to the ones from oxey's playground and make sure, they are aligned.