Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repetitions in frequency-alpha-alldicts.txt #5

Open
bszollosinagy opened this issue May 18, 2023 · 1 comment
Open

Repetitions in frequency-alpha-alldicts.txt #5

bszollosinagy opened this issue May 18, 2023 · 1 comment

Comments

@bszollosinagy
Copy link

The word "ascetic" exists more than once in the file: once at rank 18614, then at rank 25054, and also ranks 63318 and 104505.

The word "copious" and "verdant" are also duplicated for some reason.

Can the counts be simply summed across all occurrences?

@hackerb9
Copy link
Owner

$ grep ascetic frequency-alpha-alldicts.txt 
18614      ascetic                      2,875,469    0.000199%   97.305329%
25054      asceticism                   1,605,339    0.000111%   98.265396%
63318      ascetical                      153,464    0.000011%   99.760632%
104505     ascetically                     24,997    0.000002%   99.955170%

It would be nice to be able to merge different forms of the same root together, as a dictionary does, but that information is not included in the Google corpus.

Do you know of any database I could use for such merging? I'm not going to write an automatic algorithm for it as it'd end up merging "cop" with "copy" and "copious".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants