Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USA not recognized #23

Open
lisiq opened this issue Dec 28, 2019 · 5 comments
Open

USA not recognized #23

lisiq opened this issue Dec 28, 2019 · 5 comments

Comments

@lisiq
Copy link

lisiq commented Dec 28, 2019

"USA" is not being detected. I have to replace "USA" to "United States" in order the country to be detected.

@dhimmel
Copy link

dhimmel commented Feb 26, 2020

I'm also not able to extract a country for "USA" using geotext. It looks like "UK" also does not produce a match:

import geotext
text = "UK"
geo_text = geotext.GeoText(text)
dict(geo_text.country_mentions)

returns {} (an empty dict). @elyase would these be easy fixes?

@iwpnd
Copy link

iwpnd commented Mar 2, 2020

Check the demo data carefully. GeoText does not use synonyms in its lookup.

@dhimmel
Copy link

dhimmel commented Mar 2, 2020

Check the demo data carefully

Are you talking about the data in geotext/data? I do see data that would seem to allow for mapping USA and UK to countries:

US USA 840 US United States Washington 9629091 310232863 NA .us USD Dollar 1 #####-#### ^\d{5}(-\d{4})?$ en-US,es-US,haw,fr 6252001 CA,MX,CU

# The official ISO country code for the United Kingdom is 'GB'. The code 'UK' is reserved.

GB GBR 826 UK United Kingdom London 244820 62348447 EU .uk GBP Pound 44 @# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA ^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2}[A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2})|([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$ en-GB,cy-GB,gd 2635167 IE

GeoText does not use synonyms in its lookup

First, isn't USA the official ISO 3166 3-letter code for the United States? So not a synonym. Also if this issue is caused by excluding synonyms, perhaps that's the wrong design decision?

@iwpnd
Copy link

iwpnd commented Mar 2, 2020

The data in geotext/data does not reflect what GeoText is looking for in a text. It only takes a small part of it for the lookup. So yes, the data allows for more, but GeoText is prohibiting it.
I talked about it briefly in issue-22. If you want synonyms, I tried another approach over at flashgeotext. Not sure I cover all the synonyms for country names, but some. And, I leave it to you to bring your own data/add data if something is missing.

@elyase
Copy link
Owner

elyase commented Mar 2, 2020

Hi guys, we don't include ISO because the approach used in Geotext (rule based regex) is based on high precision rules (so you can almost be certain that it is correct when it works). The drawback is that we lose some recall.
While there are several ways this can be improved there is always a fine line in the precision / recall tradeoff. For example, if you take a look at the ISO list you will see many of them are token that are found everywhere even when they don't represent a country. Even some of them like USA, have meanings in others languages (USA means "to use" in Spanish). I have long wanted to improve the regex using a data based approach but I am missing data with representative negative examples (like extracting USA when it shouldn't be the case).

So I prefer the approach of providing basic functionality with high precision and leaving the responsibility of extending recall to users ex what @lisiq did (preprocessing the data). What we could do is improve the API to make it easier to add your own exceptions.

That said flashgeotext from @iwpnd looks great. Please try it out and let me know if we should join efforts there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants