Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensitivity to capitalization, punctuation, and places sharing a name. #7

Open
khof312 opened this issue Mar 30, 2017 · 3 comments
Open

Comments

@khof312
Copy link

khof312 commented Mar 30, 2017

Hi @elyase this is great work, thanks - very fast. I am encountering a few reliability issues however. Specifically, I am finding that the library is very sensitive to capitalization and punctuation (ignores lowercase, ignores countries if followed by other properly capitalized words) and that it also has trouble disambiguating between multiple places with the same name. For example:

GeoText("France Is A Country").country_mentions
>>OrderedDict()

GeoText("paris France").country_mentions
>>OrderedDict([('FR', 1)])

GeoText("Paris France").country_mentions
>>OrderedDict()

GeoText("Paris, France").country_mentions
>> OrderedDict([('FR', 1), ('US', 1)])

(Presumably because there are also American cities named Paris?)

Just wanted to flag this for future updates...thanks!

@elyase
Copy link
Owner

elyase commented Mar 30, 2017

Thanks for bringing up those issues. You are right that there are a lot of wrong corner cases, some can be traced back to the data, some have to do with limitations of the regex approach.
In my wish list is to add an optional machine learning approach that can do better disambiguation. This will hopefully do better disambiguation but will be somewhat slower and have some more dependencies.
For now I will manually patch those cases you found out and fix them for the next release.

@khof312
Copy link
Author

khof312 commented Mar 31, 2017

Thanks! Didn't mean to make demands, this is already a great service that you are providing for free :) I am using the library regardless, thank you!!! If I have the time, I will also try to propose some fixes.

@apoorv-agarwal
Copy link

Change the regex expression to [A-Za-z]+[a-zà-ú](?:[ '-][A-Z]+[a-zà-ú])*

This will solve the sensitivity to capitalization. But there are some issues apart from the regex as well. For example, despite the regex detecting "LONDON" as a candidate, it does not get captured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants