About English ngrams #78

fohrloop · 2024-10-07T10:07:44Z

fohrloop
Oct 7, 2024

Hi,

What's the difference of the different ngram sets for English in the repo? I see there's oxey_english and oxey_english2, probably from o-x-e-y/oxeylyzer/tree/main/static/language_data english.json and english.json, and then something called eng_shai. I checked with a tool I made the statistical differences, but would be nice to know if there's some additional metadata about them. The oxey corpora did not contain whitespaces so I also ignored whitespace-containing ngrams in the eng_shai:

oxey_english vs oxey_english2: unigrams

practically identical

───────────────────oxey_english─────────────────── ──────────────────oxey_english2───────────────────
 1: e ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.71               1 (+0): e ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.82
 2: t ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.87                      2 (+0): t ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.71
 3: a ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.92                        3 (+0): a ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.99
 4: o ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.59                         4 (+0): o ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.46
 5: i ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.07                          5 (+0): i ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.22
 6: n ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.76                           6 (+0): n ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.92
 7: s ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.38                           7 (+0): s ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.46
 8: r ▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.01                            8 (+0): r ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.25
 9: h ▇▇▇▇▇▇▇▇▇▇ 4.61                               9 (+0): h ▇▇▇▇▇▇▇▇▇▇ 4.33
10: l ▇▇▇▇▇▇▇▇▇ 4.12                               10 (+0): l ▇▇▇▇▇▇▇▇▇ 4.11
11: d ▇▇▇▇▇▇▇▇ 3.65                                11 (+0): d ▇▇▇▇▇▇▇▇ 3.64
12: c ▇▇▇▇▇▇▇ 3.05                                 12 (+0): c ▇▇▇▇▇▇▇ 3.29
13: u ▇▇▇▇▇▇ 2.87                                  13 (+0): u ▇▇▇▇▇▇ 2.80
14: m ▇▇▇▇▇ 2.42                                   14 (+0): m ▇▇▇▇▇ 2.43
15: f ▇▇▇▇▇ 2.12                                   15 (+1): p ▇▇▇▇▇ 2.18
16: p ▇▇▇▇ 2.08                                    16 (-1): f ▇▇▇▇▇ 2.13
17: g ▇▇▇▇ 2.04                                    17 (+0): g ▇▇▇▇ 1.95
18: y ▇▇▇▇ 1.92                                    18 (+0): y ▇▇▇▇ 1.74
19: w ▇▇▇▇ 1.81                                    19 (+0): w ▇▇▇▇ 1.71
20: b ▇▇▇ 1.49                                     20 (+0): b ▇▇▇ 1.47
21: . ▇▇ 1.11                                      21 (+0): . ▇▇▇ 1.22
22: v ▇▇ 1.06                                      22 (+0): v ▇▇ 1.11
23: , ▇▇ 1.03                                      23 (+0): , ▇▇ 0.87
24: k ▇▇ 0.80                                      24 (+0): k ▇▇ 0.70
25: ' ▇ 0.48                                       25 (+0): ' ▇ 0.46
26: - ▇ 0.26                                       26 (+0): - ▇ 0.24
27: x  0.21                                        27 (+0): x  0.21
28: j  0.17                                        28 (+0): j  0.17
29: q  0.10                                        29 (+0): q  0.11
30: ;  0.10                                        30 (+0): ;  0.10
31: z  0.10                                        31 (+0): z  0.09
32: /  0.08                                        32 (+0): /  0.08
33: =  0.01                                        33 (+0): =  0.00
34: \  0.00                                        34 (+2): ]  0.00
35: [  0.00                                        35 (+0): [  0.00
36: ]  0.00                                        36 (+1): `  0.00
37: `  0.00                                        37 (-3): \  0.00

oxey_english vs oxey_english2: bigrams

───────────────────oxey_english─────────────────── ──────────────────oxey_english2───────────────────
 1: th ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.15                 1 ( +0): th ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.97
 2: he ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.59                     2 ( +0): he ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.45
 3: in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.39                       3 ( +0): in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.38
 4: an ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.92                          4 ( +0): an ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.88
 5: er ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.87                           5 ( +0): er ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.86
 6: re ▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.76                            6 ( +0): re ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.81
 7: on ▇▇▇▇▇▇▇▇▇▇▇▇ 1.52                             7 ( +0): on ▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.60
 8: at ▇▇▇▇▇▇▇▇▇▇ 1.33                               8 ( +0): at ▇▇▇▇▇▇▇▇▇▇▇ 1.34
 9: nd ▇▇▇▇▇▇▇▇▇▇ 1.29                               9 ( +2): en ▇▇▇▇▇▇▇▇▇▇▇ 1.32
10: or ▇▇▇▇▇▇▇▇▇▇ 1.28                              10 ( +0): or ▇▇▇▇▇▇▇▇▇▇ 1.29
11: en ▇▇▇▇▇▇▇▇▇ 1.24                               11 ( +1): es ▇▇▇▇▇▇▇▇▇▇ 1.28
12: es ▇▇▇▇▇▇▇▇▇ 1.20                               12 ( -3): nd ▇▇▇▇▇▇▇▇▇▇ 1.25
13: ou ▇▇▇▇▇▇▇▇▇ 1.18                               13 ( +6): ti ▇▇▇▇▇▇▇▇▇▇ 1.19
14: to ▇▇▇▇▇▇▇▇▇ 1.18                               14 ( +3): te ▇▇▇▇▇▇▇▇▇ 1.13
15: ng ▇▇▇▇▇▇▇▇▇ 1.13                               15 ( +5): ar ▇▇▇▇▇▇▇▇▇ 1.10
16: it ▇▇▇▇▇▇▇▇ 1.10                                16 ( -2): to ▇▇▇▇▇▇▇▇▇ 1.10
17: te ▇▇▇▇▇▇▇▇ 1.08                                17 ( -1): it ▇▇▇▇▇▇▇▇▇ 1.08
18: st ▇▇▇▇▇▇▇▇ 1.06                                18 ( -3): ng ▇▇▇▇▇▇▇▇▇ 1.06
19: ti ▇▇▇▇▇▇▇▇ 1.05                                19 ( -1): st ▇▇▇▇▇▇▇▇▇ 1.06
20: ar ▇▇▇▇▇▇▇▇ 1.05                                20 ( +2): al ▇▇▇▇▇▇▇▇ 1.04
21: ed ▇▇▇▇▇▇▇ 0.98                                 21 ( +2): is ▇▇▇▇▇▇▇▇ 1.03
22: al ▇▇▇▇▇▇▇ 0.97                                 22 ( -1): ed ▇▇▇▇▇▇▇▇ 1.03
23: is ▇▇▇▇▇▇▇ 0.97                                 23 (-10): ou ▇▇▇▇▇▇▇▇ 1.03
24: ha ▇▇▇▇▇▇▇ 0.96                                 24 ( +1): nt ▇▇▇▇▇▇▇▇ 1.00
25: nt ▇▇▇▇▇▇▇ 0.90                                 25 ( +3): se ▇▇▇▇▇▇▇ 0.87
26: ve ▇▇▇▇▇▇▇ 0.87                                 26 ( +0): ve ▇▇▇▇▇▇▇ 0.86
27: le ▇▇▇▇▇▇ 0.85                                  27 ( -3): ha ▇▇▇▇▇▇▇ 0.84
28: se ▇▇▇▇▇▇ 0.84                                  28 ( -1): le ▇▇▇▇▇▇▇ 0.81
29: as ▇▇▇▇▇▇ 0.81                                  29 ( +2): of ▇▇▇▇▇▇▇ 0.81
30: ea ▇▇▇▇▇▇ 0.78                                  30 ( -1): as ▇▇▇▇▇▇ 0.80
31: of ▇▇▇▇▇▇ 0.78                                  31 ( +2): co ▇▇▇▇▇▇ 0.79
32: me ▇▇▇▇▇▇ 0.76                                  32 ( +0): me ▇▇▇▇▇▇ 0.76
33: co ▇▇▇▇▇ 0.72                                   33 ( -3): ea ▇▇▇▇▇▇ 0.75
34: ll ▇▇▇▇▇ 0.71                                   34 ( +1): ro ▇▇▇▇▇▇ 0.75
35: ro ▇▇▇▇▇ 0.70                                   35 ( +3): de ▇▇▇▇▇▇ 0.72
36: ne ▇▇▇▇▇ 0.70                                   36 ( +6): io ▇▇▇▇▇▇ 0.70
37: hi ▇▇▇▇▇ 0.69                                   37 ( -1): ne ▇▇▇▇▇▇ 0.68
38: de ▇▇▇▇▇ 0.68                                   38 ( -4): ll ▇▇▇▇▇ 0.66
39: ri ▇▇▇▇▇ 0.62                                   39 ( +0): ri ▇▇▇▇▇ 0.65
40: li ▇▇▇▇▇ 0.60                                   40 ( +3): ra ▇▇▇▇▇ 0.62
42: io ▇▇▇▇ 0.58                                    41 ( -1): li ▇▇▇▇▇ 0.62
43: ra ▇▇▇▇ 0.57                                    44 ( -7): hi ▇▇▇▇▇ 0.61

oxey_english vs oxey_english2: trigrams

Less "out", "ave", "ome", "eve" and "you" in english2, and more "men", "pro", "ers", "ons"

───────────────────oxey_english─────────────────── ──────────────────oxey_english2───────────────────
 1: the ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.68                  1 ( +0): the ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.57
 2: ing ▇▇▇▇▇▇▇▇▇▇ 1.27                              2 ( +0): ing ▇▇▇▇▇▇▇▇▇▇ 1.16
 3: and ▇▇▇▇▇▇▇▇▇▇ 1.22                              3 ( +0): and ▇▇▇▇▇▇▇▇▇▇ 1.15
 4: ion ▇▇▇▇▇ 0.66                                   4 ( +0): ion ▇▇▇▇▇▇▇ 0.80
 5: ent ▇▇▇▇▇ 0.59                                   5 ( +0): ent ▇▇▇▇▇▇ 0.70
 6: for ▇▇▇▇▇ 0.57                                   6 ( +2): tio ▇▇▇▇▇▇ 0.65
 7: you ▇▇▇▇▇ 0.56                                   7 ( -1): for ▇▇▇▇▇ 0.59
 8: tio ▇▇▇▇ 0.53                                    8 ( +6): ati ▇▇▇▇ 0.48
 9: hat ▇▇▇▇ 0.49                                    9 ( +3): ter ▇▇▇▇ 0.42
10: tha ▇▇▇▇ 0.47                                   10 ( +1): her ▇▇▇▇ 0.41
11: her ▇▇▇▇ 0.47                                   11 ( -2): hat ▇▇▇ 0.38
12: ter ▇▇▇ 0.42                                    12 ( +5): ate ▇▇▇ 0.37
13: all ▇▇▇ 0.40                                    13 ( -3): tha ▇▇▇ 0.37
14: ati ▇▇▇ 0.38                                    14 ( -7): you ▇▇▇ 0.36
15: thi ▇▇▇ 0.37                                    15 ( +5): are ▇▇▇ 0.36
16: ver ▇▇▇ 0.37                                    16 ( +0): ver ▇▇▇ 0.35
17: ate ▇▇▇ 0.36                                    17 ( -4): all ▇▇▇ 0.35
18: our ▇▇▇ 0.36                                    18 ( +0): our ▇▇▇ 0.35
19: ere ▇▇▇ 0.35                                    19 ( +5): ers ▇▇▇ 0.34
20: are ▇▇▇ 0.34                                    20 ( +6): pro ▇▇▇ 0.34
21: ith ▇▇▇ 0.34                                    21 ( -2): ere ▇▇▇ 0.33
22: wit ▇▇▇ 0.33                                    22 ( +6): res ▇▇▇ 0.32
23: his ▇▇▇ 0.33                                    23 ( -2): ith ▇▇▇ 0.31
24: ers ▇▇▇ 0.33                                    24 ( -9): thi ▇▇▇ 0.31
25: rea ▇▇ 0.30                                     25 (+15): men ▇▇▇ 0.31
26: pro ▇▇ 0.30                                     26 ( +4): con ▇▇▇ 0.31
27: eve ▇▇ 0.27                                     27 ( -5): wit ▇▇▇ 0.30
28: res ▇▇ 0.27                                     28 ( +1): com ▇▇▇ 0.30
29: com ▇▇ 0.27                                     29 ( -6): his ▇▇▇ 0.29
30: con ▇▇ 0.27                                     30 ( +8): ons ▇▇ 0.28
31: ill ▇▇ 0.26                                     31 ( +5): ted ▇▇ 0.27
32: out ▇▇ 0.25                                     32 ( -7): rea ▇▇ 0.27
33: ome ▇▇ 0.25                                     33 ( -2): ill ▇▇ 0.26
34: ess ▇▇ 0.25                                     34 ( +1): ive ▇▇ 0.26
35: ive ▇▇ 0.24                                     35 ( +4): nce ▇▇ 0.26
36: ted ▇▇ 0.24                                     36 ( -2): ess ▇▇ 0.25
37: ave ▇▇ 0.24                                     37 (-10): eve ▇▇ 0.25
38: ons ▇▇ 0.24                                     38 ( +8): ect ▇▇ 0.25
39: nce ▇▇ 0.24                                     39 ( +6): est ▇▇ 0.24
40: men ▇▇ 0.24                                     40 ( +1): ear ▇▇ 0.24
41: ear ▇▇ 0.24                                     45 (-13): out ▇▇ 0.21
45: est ▇▇ 0.23                                     47 (-10): ave ▇▇ 0.21
46: ect ▇▇ 0.21                                     48 (-15): ome ▇▇ 0.21

oxey_english vs eng_shai: unigrams

ranks of alphabetic characters seem to be identical
eng_shai contains also numbers and symbols (and spaces)

───────────────────oxey_english─────────────────── ─────────────────────eng_shai─────────────────────
 1: e  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.71                1 ( +0): e ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 11.58
 2: t  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.87                       2 ( +0): t ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 8.79
 3: a  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.92                         3 ( +0): a ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.84
 4: o  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.59                         4 ( +0): o ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.51
 5: i  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.07                           5 ( +0): i ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.02
 6: n  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.76                           6 ( +0): n ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.70
 7: s  ▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.38                            7 ( +0): s ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 6.33
 8: r  ▇▇▇▇▇▇▇▇▇▇▇▇ 6.01                             8 ( +0): r ▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.96
 9: h  ▇▇▇▇▇▇▇▇▇ 4.61                                9 ( +0): h ▇▇▇▇▇▇▇▇▇▇ 4.51
10: l  ▇▇▇▇▇▇▇▇ 4.12                                10 ( +0): l ▇▇▇▇▇▇▇▇▇ 4.08
11: d  ▇▇▇▇▇▇▇ 3.65                                 11 ( +0): d ▇▇▇▇▇▇▇▇ 3.58
12: c  ▇▇▇▇▇▇ 3.05                                  12 ( +0): c ▇▇▇▇▇▇▇ 3.05
13: u  ▇▇▇▇▇▇ 2.87                                  13 ( +0): u ▇▇▇▇▇▇ 2.84
14: m  ▇▇▇▇▇ 2.42                                   14 ( +0): m ▇▇▇▇▇ 2.39
15: f  ▇▇▇▇ 2.12                                    15 ( +0): f ▇▇▇▇▇ 2.10
16: p  ▇▇▇▇ 2.08                                    16 ( +0): p ▇▇▇▇ 2.07
17: g  ▇▇▇▇ 2.04                                    17 ( +0): g ▇▇▇▇ 2.02
18: y  ▇▇▇▇ 1.92                                    18 ( +0): y ▇▇▇▇ 1.90
19: w  ▇▇▇▇ 1.81                                    19 ( +0): w ▇▇▇▇ 1.77
20: b  ▇▇▇ 1.49                                     20 ( +0): b ▇▇▇ 1.47
21: .  ▇▇ 1.11                                      21 ( +0): . ▇▇ 1.08
22: v  ▇▇ 1.06                                      22 ( +0): v ▇▇ 1.06
23: ,  ▇▇ 1.03                                      23 ( +0): , ▇▇ 0.99
24: k  ▇▇ 0.80                                      24 ( +0): k ▇▇ 0.79
25: '  ▇ 0.48                                       25 ( +1): - ▇ 0.26
26: -  ▇ 0.26                                       26 ( -1): ' ▇ 0.26
27: x   0.21                                        27 ( +0): x  0.21
28: j   0.17                                        28 (???): "  0.19
29: q   0.10                                        29 (???): 0  0.18
30: ;   0.10                                        30 ( -2): j  0.17
31: z   0.10                                        31 (???): 1  0.16
32: /   0.08                                        32 (???): 2  0.12
33: =   0.01                                        33 ( -4): q  0.10
34: \   0.00                                        34 ( -3): z  0.10
35: [   0.00                                        35 (???): )  0.09
36: ]   0.00                                        36 (???): (  0.09
37: `   0.00                                        37 (???): :  0.07
???: )  0.00                                        38 (???): 5  0.07
???: 0  0.00                                        39 (???): 3  0.06
???: 1  0.00                                        40 (???): 9  0.05
???: "  0.00                                        47 (-15): /  0.03
???: 2  0.00                                        48 (-18): ;  0.02
???: 9  0.00                                        55 (-22): =  0.00
???: :  0.00                                       ??? (???): \  0.00
???: 5  0.00                                       ??? (???): `  0.00
???: 3  0.00                                       ??? (???): ]  0.00
???: (  0.00                                       ??? (???): [  0.00

oxey_english vs eng_shai: bigrams

───────────────────oxey_english─────────────────── ─────────────────────eng_shai─────────────────────
 1: th ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.15                1 (+0): th ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.10
 2: he ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.59                    2 (+0): he ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.53
 3: in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.39                      3 (+0): in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.36
 4: an ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.92                         4 (+0): an ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.90
 5: er ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.87                          5 (+0): er ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.85
 6: re ▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.76                           6 (+0): re ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.74
 7: on ▇▇▇▇▇▇▇▇▇▇▇▇ 1.52                            7 (+0): on ▇▇▇▇▇▇▇▇▇▇▇▇ 1.51
 8: at ▇▇▇▇▇▇▇▇▇▇ 1.33                              8 (+0): at ▇▇▇▇▇▇▇▇▇▇▇ 1.31
 9: nd ▇▇▇▇▇▇▇▇▇▇ 1.29                              9 (+1): or ▇▇▇▇▇▇▇▇▇▇ 1.27
10: or ▇▇▇▇▇▇▇▇▇▇ 1.28                             10 (-1): nd ▇▇▇▇▇▇▇▇▇▇ 1.27
11: en ▇▇▇▇▇▇▇▇▇ 1.24                              11 (+0): en ▇▇▇▇▇▇▇▇▇▇ 1.22
12: es ▇▇▇▇▇▇▇▇▇ 1.20                              12 (+0): es ▇▇▇▇▇▇▇▇▇▇ 1.20
13: ou ▇▇▇▇▇▇▇▇▇ 1.18                              13 (+1): to ▇▇▇▇▇▇▇▇▇ 1.16
14: to ▇▇▇▇▇▇▇▇▇ 1.18                              14 (-1): ou ▇▇▇▇▇▇▇▇▇ 1.16
15: ng ▇▇▇▇▇▇▇▇▇ 1.13                              15 (+0): ng ▇▇▇▇▇▇▇▇▇ 1.11
16: it ▇▇▇▇▇▇▇▇ 1.10                               16 (+0): it ▇▇▇▇▇▇▇▇▇ 1.09
17: te ▇▇▇▇▇▇▇▇ 1.08                               17 (+0): te ▇▇▇▇▇▇▇▇▇ 1.07
18: st ▇▇▇▇▇▇▇▇ 1.06                               18 (+1): ti ▇▇▇▇▇▇▇▇▇ 1.06
19: ti ▇▇▇▇▇▇▇▇ 1.05                               19 (-1): st ▇▇▇▇▇▇▇▇ 1.05
20: ar ▇▇▇▇▇▇▇▇ 1.05                               20 (+0): ar ▇▇▇▇▇▇▇▇ 1.03
21: ed ▇▇▇▇▇▇▇ 0.98                                21 (+1): al ▇▇▇▇▇▇▇▇ 0.97
22: al ▇▇▇▇▇▇▇ 0.97                                22 (+1): is ▇▇▇▇▇▇▇▇ 0.96
23: is ▇▇▇▇▇▇▇ 0.97                                23 (-2): ed ▇▇▇▇▇▇▇▇ 0.96
24: ha ▇▇▇▇▇▇▇ 0.96                                24 (+0): ha ▇▇▇▇▇▇▇▇ 0.93
25: nt ▇▇▇▇▇▇▇ 0.90                                25 (+0): nt ▇▇▇▇▇▇▇ 0.90
26: ve ▇▇▇▇▇▇▇ 0.87                                26 (+0): ve ▇▇▇▇▇▇▇ 0.86
27: le ▇▇▇▇▇▇ 0.85                                 27 (+0): le ▇▇▇▇▇▇▇ 0.84
28: se ▇▇▇▇▇▇ 0.84                                 28 (+0): se ▇▇▇▇▇▇▇ 0.84
29: as ▇▇▇▇▇▇ 0.81                                 29 (+0): as ▇▇▇▇▇▇ 0.79
30: ea ▇▇▇▇▇▇ 0.78                                 30 (+0): ea ▇▇▇▇▇▇ 0.77
31: of ▇▇▇▇▇▇ 0.78                                 31 (+0): of ▇▇▇▇▇▇ 0.76
32: me ▇▇▇▇▇▇ 0.76                                 32 (+0): me ▇▇▇▇▇▇ 0.76
33: co ▇▇▇▇▇ 0.72                                  33 (+0): co ▇▇▇▇▇▇ 0.71
34: ll ▇▇▇▇▇ 0.71                                  34 (+0): ll ▇▇▇▇▇▇ 0.70
35: ro ▇▇▇▇▇ 0.70                                  35 (+0): ro ▇▇▇▇▇▇ 0.69
36: ne ▇▇▇▇▇ 0.70                                  36 (+0): ne ▇▇▇▇▇▇ 0.69
37: hi ▇▇▇▇▇ 0.69                                  37 (+1): de ▇▇▇▇▇ 0.67
38: de ▇▇▇▇▇ 0.68                                  38 (-1): hi ▇▇▇▇▇ 0.66
39: ri ▇▇▇▇▇ 0.62                                  39 (+0): ri ▇▇▇▇▇ 0.62
40: li ▇▇▇▇▇ 0.60                                  40 (+0): li ▇▇▇▇▇ 0.60

oxey_english vs eng_shai: trigrams

───────────────────oxey_english─────────────────── ─────────────────────eng_shai─────────────────────
 1: the ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.68                 1 (+0): the ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.63
 2: ing ▇▇▇▇▇▇▇▇▇▇ 1.27                             2 (+0): ing ▇▇▇▇▇▇▇▇▇▇▇ 1.25
 3: and ▇▇▇▇▇▇▇▇▇▇ 1.22                             3 (+0): and ▇▇▇▇▇▇▇▇▇▇ 1.19
 4: ion ▇▇▇▇▇ 0.66                                  4 (+0): ion ▇▇▇▇▇▇ 0.67
 5: ent ▇▇▇▇▇ 0.59                                  5 (+0): ent ▇▇▇▇▇ 0.59
 6: for ▇▇▇▇▇ 0.57                                  6 (+0): for ▇▇▇▇▇ 0.57
 7: you ▇▇▇▇▇ 0.56                                  7 (+0): you ▇▇▇▇▇ 0.55
 8: tio ▇▇▇▇ 0.53                                   8 (+0): tio ▇▇▇▇▇ 0.53
 9: hat ▇▇▇▇ 0.49                                   9 (+0): hat ▇▇▇▇ 0.48
10: tha ▇▇▇▇ 0.47                                  10 (+0): tha ▇▇▇▇ 0.46
11: her ▇▇▇▇ 0.47                                  11 (+0): her ▇▇▇▇ 0.45
12: ter ▇▇▇ 0.42                                   12 (+0): ter ▇▇▇▇ 0.41
13: all ▇▇▇ 0.40                                   13 (+0): all ▇▇▇ 0.39
14: ati ▇▇▇ 0.38                                   14 (+0): ati ▇▇▇ 0.38
15: thi ▇▇▇ 0.37                                   15 (+0): thi ▇▇▇ 0.36
16: ver ▇▇▇ 0.37                                   16 (+0): ver ▇▇▇ 0.36
17: ate ▇▇▇ 0.36                                   17 (+0): ate ▇▇▇ 0.36
18: our ▇▇▇ 0.36                                   18 (+0): our ▇▇▇ 0.36
19: ere ▇▇▇ 0.35                                   19 (+1): are ▇▇▇ 0.34
20: are ▇▇▇ 0.34                                   20 (-1): ere ▇▇▇ 0.34
21: ith ▇▇▇ 0.34                                   21 (+0): ith ▇▇▇ 0.34
22: wit ▇▇▇ 0.33                                   22 (+0): wit ▇▇▇ 0.33
23: his ▇▇▇ 0.33                                   23 (+1): ers ▇▇▇ 0.33
24: ers ▇▇▇ 0.33                                   24 (-1): his ▇▇▇ 0.32
25: rea ▇▇ 0.30                                    25 (+1): pro ▇▇▇ 0.30
26: pro ▇▇ 0.30                                    26 (-1): rea ▇▇▇ 0.29
27: eve ▇▇ 0.27                                    27 (+1): res ▇▇ 0.27
28: res ▇▇ 0.27                                    28 (-1): eve ▇▇ 0.27
29: com ▇▇ 0.27                                    29 (+1): con ▇▇ 0.27
30: con ▇▇ 0.27                                    30 (-1): com ▇▇ 0.27
31: ill ▇▇ 0.26                                    31 (+0): ill ▇▇ 0.26
32: out ▇▇ 0.25                                    32 (+3): ive ▇▇ 0.24
33: ome ▇▇ 0.25                                    33 (-1): out ▇▇ 0.24
34: ess ▇▇ 0.25                                    34 (+0): ess ▇▇ 0.24
35: ive ▇▇ 0.24                                    35 (-2): ome ▇▇ 0.24
36: ted ▇▇ 0.24                                    36 (+2): ons ▇▇ 0.24
37: ave ▇▇ 0.24                                    37 (-1): ted ▇▇ 0.24
38: ons ▇▇ 0.24                                    38 (-1): ave ▇▇ 0.24
39: nce ▇▇ 0.24                                    39 (+0): nce ▇▇ 0.24
40: men ▇▇ 0.24                                    40 (+0): men ▇▇ 0.24

After comparing the eng_shai and oxey_english I have a feeling that these two sets of ngrams have been extracted pretty much the same original corpus. I wonder if anyone can confirm it's so?

Answered by dariogoetz

Oct 7, 2024

Thanks for the detailed comparison :)
You are correct for the "oxey" corpora. They come from the english.json and english2.json files at your linked page.

If I recall correctly, the "eng_shai" corpus is the iweb corpus that was used to develop the "colemak" layout. It was added in 2022, so my memory may be incorrect, though.

I don't know, which sources oxey's two english corpora are based upon. It may very well be the "shai" corpus. I used the oxey corpora mainly to be able to compare the oxey-metrics of this analyzer to the ones from oxey's playground and make sure, they are aligned.

View full answer

dariogoetz · 2024-10-07T11:25:48Z

dariogoetz
Oct 7, 2024
Maintainer

Thanks for the detailed comparison :)
You are correct for the "oxey" corpora. They come from the english.json and english2.json files at your linked page.

If I recall correctly, the "eng_shai" corpus is the iweb corpus that was used to develop the "colemak" layout. It was added in 2022, so my memory may be incorrect, though.

I don't know, which sources oxey's two english corpora are based upon. It may very well be the "shai" corpus. I used the oxey corpora mainly to be able to compare the oxey-metrics of this analyzer to the ones from oxey's playground and make sure, they are aligned.

5 replies

fohrloop Oct 7, 2024
Author

Thank you for the prompt response! I compared the iweb corpus against the eng_shai and indeed, they are identical. For example, top 100 trigrams:

───────────────────────iweb─────────────────────── ─────────────────────eng_shai─────────────────────
 1: ␣th  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.57                 1 (+0): ␣th  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.57
 2: the  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.29                     2 (+0): the  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.29
 3: he␣  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.03                        3 (+0): he␣  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.03
 4: ␣an  ▇▇▇▇▇▇▇▇ 0.63                              4 (+0): ␣an  ▇▇▇▇▇▇▇▇▇ 0.63
 5: ing  ▇▇▇▇▇▇▇▇ 0.61                              5 (+0): ing  ▇▇▇▇▇▇▇▇▇ 0.61
 6: nd␣  ▇▇▇▇▇▇▇▇ 0.61                              6 (+0): nd␣  ▇▇▇▇▇▇▇▇▇ 0.61
 7: and  ▇▇▇▇▇▇▇▇ 0.59                              7 (+0): and  ▇▇▇▇▇▇▇▇ 0.59
 8: ␣to  ▇▇▇▇▇▇▇▇ 0.58                              8 (+0): ␣to  ▇▇▇▇▇▇▇▇ 0.58
 9: ng␣  ▇▇▇▇▇▇▇ 0.54                               9 (+0): ng␣  ▇▇▇▇▇▇▇▇ 0.54
10: to␣  ▇▇▇▇▇▇▇ 0.53                              10 (+0): to␣  ▇▇▇▇▇▇▇ 0.53
11: ␣in  ▇▇▇▇▇▇▇ 0.50                              11 (+0): ␣in  ▇▇▇▇▇▇▇ 0.50
12: ed␣  ▇▇▇▇▇▇ 0.48                               12 (+0): ed␣  ▇▇▇▇▇▇▇ 0.48
13: ␣of  ▇▇▇▇▇▇ 0.47                               13 (+0): ␣of  ▇▇▇▇▇▇▇ 0.47
14: of␣  ▇▇▇▇▇▇ 0.43                               14 (+0): of␣  ▇▇▇▇▇▇ 0.43
15: ␣a␣  ▇▇▇▇▇ 0.40                                15 (+0): ␣a␣  ▇▇▇▇▇▇ 0.40
16: er␣  ▇▇▇▇▇ 0.40                                16 (+0): er␣  ▇▇▇▇▇▇ 0.40
17: is␣  ▇▇▇▇▇ 0.36                                17 (+0): is␣  ▇▇▇▇▇ 0.36
18: in␣  ▇▇▇▇▇ 0.35                                18 (+0): in␣  ▇▇▇▇▇ 0.35
19: ␣co  ▇▇▇▇▇ 0.35                                19 (+0): ␣co  ▇▇▇▇▇ 0.35
20: re␣  ▇▇▇▇▇ 0.35                                20 (+0): re␣  ▇▇▇▇▇ 0.35
21: on␣  ▇▇▇▇▇ 0.35                                21 (+0): on␣  ▇▇▇▇▇ 0.35
22: e␣t  ▇▇▇▇▇ 0.34                                22 (+0): e␣t  ▇▇▇▇▇ 0.34
23: s␣a  ▇▇▇▇ 0.33                                 23 (+0): s␣a  ▇▇▇▇▇ 0.33
24: ion  ▇▇▇▇ 0.33                                 24 (+0): ion  ▇▇▇▇▇ 0.33
25: at␣  ▇▇▇▇ 0.32                                 25 (+0): at␣  ▇▇▇▇▇ 0.32
26: or␣  ▇▇▇▇ 0.32                                 26 (+0): or␣  ▇▇▇▇ 0.32
27: es␣  ▇▇▇▇ 0.30                                 27 (+0): es␣  ▇▇▇▇ 0.30
28: e␣a  ▇▇▇▇ 0.30                                 28 (+0): e␣a  ▇▇▇▇ 0.30
29: ent  ▇▇▇▇ 0.29                                 29 (+0): ent  ▇▇▇▇ 0.29
30: ␣re  ▇▇▇▇ 0.29                                 30 (+0): ␣re  ▇▇▇▇ 0.29
31: ␣be  ▇▇▇▇ 0.29                                 31 (+0): ␣be  ▇▇▇▇ 0.29
32: for  ▇▇▇▇ 0.28                                 32 (+0): for  ▇▇▇▇ 0.28
33: you  ▇▇▇▇ 0.27                                 33 (+0): you  ▇▇▇▇ 0.27
34: ␣fo  ▇▇▇▇ 0.27                                 34 (+0): ␣fo  ▇▇▇▇ 0.27
35: ␣yo  ▇▇▇▇ 0.27                                 35 (+0): ␣yo  ▇▇▇▇ 0.27
36: tio  ▇▇▇▇ 0.26                                 36 (+0): tio  ▇▇▇▇ 0.26
37: as␣  ▇▇▇ 0.26                                  37 (+0): as␣  ▇▇▇▇ 0.26
38: ␣wi  ▇▇▇ 0.26                                  38 (+0): ␣wi  ▇▇▇▇ 0.26
39: n␣t  ▇▇▇ 0.25                                  39 (+0): n␣t  ▇▇▇ 0.25
40: s␣t  ▇▇▇ 0.25                                  40 (+0): s␣t  ▇▇▇ 0.25
41: d␣t  ▇▇▇ 0.25                                  41 (+0): d␣t  ▇▇▇ 0.25
42: t␣t  ▇▇▇ 0.24                                  42 (+0): t␣t  ▇▇▇ 0.24
43: hat  ▇▇▇ 0.23                                  43 (+0): hat  ▇▇▇ 0.23
44: ␣ha  ▇▇▇ 0.23                                  44 (+0): ␣ha  ▇▇▇ 0.23
45: e␣s  ▇▇▇ 0.23                                  45 (+0): e␣s  ▇▇▇ 0.23
46: tha  ▇▇▇ 0.23                                  46 (+0): tha  ▇▇▇ 0.23
47: ␣is  ▇▇▇ 0.22                                  47 (+0): ␣is  ▇▇▇ 0.22
48: ␣on  ▇▇▇ 0.22                                  48 (+0): ␣on  ▇▇▇ 0.22
49: an␣  ▇▇▇ 0.22                                  49 (+0): an␣  ▇▇▇ 0.22
50: her  ▇▇▇ 0.22                                  50 (+0): her  ▇▇▇ 0.22
51: ly␣  ▇▇▇ 0.22                                  51 (+0): ly␣  ▇▇▇ 0.22
52: ␣ma  ▇▇▇ 0.21                                  52 (+0): ␣ma  ▇▇▇ 0.21
53: ␣pr  ▇▇▇ 0.20                                  53 (+0): ␣pr  ▇▇▇ 0.20
54: st␣  ▇▇▇ 0.20                                  54 (+0): st␣  ▇▇▇ 0.20
55: ␣wh  ▇▇▇ 0.20                                  55 (+0): ␣wh  ▇▇▇ 0.20
56: ter  ▇▇▇ 0.20                                  56 (+0): ter  ▇▇▇ 0.20
57: ll␣  ▇▇▇ 0.20                                  57 (+0): ll␣  ▇▇▇ 0.20
58: ve␣  ▇▇▇ 0.20                                  58 (+0): ve␣  ▇▇▇ 0.20
59: ␣ca  ▇▇▇ 0.20                                  59 (+0): ␣ca  ▇▇▇ 0.20
60: e␣o  ▇▇▇ 0.20                                  60 (+0): e␣o  ▇▇▇ 0.20
61: ␣it  ▇▇▇ 0.20                                  61 (+0): ␣it  ▇▇▇ 0.20
62: th␣  ▇▇▇ 0.19                                  62 (+0): th␣  ▇▇▇ 0.19
63: all  ▇▇▇ 0.19                                  63 (+0): all  ▇▇▇ 0.19
64: en␣  ▇▇▇ 0.19                                  64 (+0): en␣  ▇▇▇ 0.19
65: ␣st  ▇▇▇ 0.19                                  65 (+0): ␣st  ▇▇▇ 0.19
66: ati  ▇▇▇ 0.19                                  66 (+0): ati  ▇▇▇ 0.19
67: al␣  ▇▇▇ 0.19                                  67 (+0): al␣  ▇▇▇ 0.19
68: e␣i  ▇▇ 0.19                                   68 (+0): e␣i  ▇▇▇ 0.19
69: nt␣  ▇▇ 0.19                                   69 (+0): nt␣  ▇▇▇ 0.19
70: ␣wa  ▇▇ 0.18                                   70 (+0): ␣wa  ▇▇▇ 0.18
71: e␣c  ▇▇ 0.18                                   71 (+0): e␣c  ▇▇▇ 0.18
72: thi  ▇▇ 0.18                                   72 (+0): thi  ▇▇ 0.18
73: ver  ▇▇ 0.18                                   73 (+0): ver  ▇▇ 0.18
74: ate  ▇▇ 0.18                                   74 (+0): ate  ▇▇ 0.18
75: our  ▇▇ 0.18                                   75 (+0): our  ▇▇ 0.18
76: le␣  ▇▇ 0.17                                   76 (+0): le␣  ▇▇ 0.17
77: ␣se  ▇▇ 0.17                                   77 (+0): ␣se  ▇▇ 0.17
78: ut␣  ▇▇ 0.17                                   78 (+0): ut␣  ▇▇ 0.17
79: ts␣  ▇▇ 0.17                                   79 (+0): ts␣  ▇▇ 0.17
80: ␣he  ▇▇ 0.17                                   80 (+0): ␣he  ▇▇ 0.17
81: are  ▇▇ 0.17                                   81 (+0): are  ▇▇ 0.17
82: ere  ▇▇ 0.17                                   82 (+0): ere  ▇▇ 0.17
83: ith  ▇▇ 0.16                                   83 (+0): ith  ▇▇ 0.16
84: it␣  ▇▇ 0.16                                   84 (+0): it␣  ▇▇ 0.16
85: e␣w  ▇▇ 0.16                                   85 (+0): e␣w  ▇▇ 0.16
86: ␣we  ▇▇ 0.16                                   86 (+0): ␣we  ▇▇ 0.16
87: s␣o  ▇▇ 0.16                                   87 (+0): s␣o  ▇▇ 0.16
88: wit  ▇▇ 0.16                                   88 (+0): wit  ▇▇ 0.16
89: ers  ▇▇ 0.16                                   89 (+0): ers  ▇▇ 0.16
90: n␣a  ▇▇ 0.16                                   90 (+0): n␣a  ▇▇ 0.16
91: ␣ar  ▇▇ 0.16                                   91 (+0): ␣ar  ▇▇ 0.16
92: s,␣  ▇▇ 0.16                                   92 (+0): s,␣  ▇▇ 0.16
93: se␣  ▇▇ 0.16                                   93 (+0): se␣  ▇▇ 0.16
94: f␣t  ▇▇ 0.16                                   94 (+0): f␣t  ▇▇ 0.16
95: his  ▇▇ 0.16                                   95 (+0): his  ▇▇ 0.16
96: ␣so  ▇▇ 0.15                                   96 (+0): ␣so  ▇▇ 0.15
97: t␣a  ▇▇ 0.15                                   97 (+0): t␣a  ▇▇ 0.15
98: ␣no  ▇▇ 0.15                                   98 (+0): ␣no  ▇▇ 0.15
99: ␣mo  ▇▇ 0.15                                   99 (+0): ␣mo  ▇▇ 0.15
100: ou␣ ▇▇ 0.15                                   100 (+0): ou␣ ▇▇ 0.15

fohrloop Oct 7, 2024
Author

fwiw, I found this comment explaining a bit about the english and english2 datasets.

dariogoetz Oct 8, 2024
Maintainer

Very interesting, thank you. I am always happy about a merge request for some documentation of the datasets, if you can spare the time :)

fohrloop Oct 8, 2024
Author

Can't promise that I would write a PR but you're free to use my comments / shared data as you wish! I'm currently trying to use the keyboard_layout_optimizer to create a layout for myself, and I'm documenting the process, which makes it hopefully easier also for others to start using the tool. I'll be sharing the progress at least in r/KeyboardLayouts.

fohrloop Oct 8, 2024
Author

I also compared the iweb corpus to the shai.json in the oxeylizer repo. They were also almost identical. The difference was that the shai.json did not contain ngrams with whitespace, upper case characters or numbers. See comment here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About English ngrams #78

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

About English ngrams #78

fohrloop Oct 7, 2024

Replies: 1 comment · 5 replies

dariogoetz Oct 7, 2024 Maintainer

fohrloop Oct 7, 2024 Author

fohrloop Oct 7, 2024 Author

dariogoetz Oct 8, 2024 Maintainer

fohrloop Oct 8, 2024 Author

fohrloop Oct 8, 2024 Author

fohrloop
Oct 7, 2024

Replies: 1 comment 5 replies

dariogoetz
Oct 7, 2024
Maintainer

fohrloop Oct 7, 2024
Author

fohrloop Oct 7, 2024
Author

dariogoetz Oct 8, 2024
Maintainer

fohrloop Oct 8, 2024
Author

fohrloop Oct 8, 2024
Author