-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cities not identified #22
Comments
"Ventallo" == "Ventalló"
>> False "Sant Cugat del Vallés" == "Sant Cugat del Vallès"
>> False Check the punctuation.
Hope that clarifys it for you. |
|
Check it yourselves by using: import geotext
import re
text = 'I loved Rio de Janeiro and Havana'
city_regex = r"[A-ZÀ-Ú]+[a-zà-ú]+[ \-]?(?:d[a-u].)?(?:[A-ZÀ-Ú]+[a-zà-ú]+)*"
candidates = re.findall(city_regex, text)
print(candidates)
>> ['Rio de Janeiro', 'Havana'] So yes, it catches a three word city if there are two capitalized words, separated by di-da-de-du. Now let's take this experiment further and show why regex does a job, but not a good and reliable job in this specific task to find a word in a text. import geotext
import re
text = 'In Rio de Janeiro and Havana people love to drink rum.'
city_regex = r"[A-ZÀ-Ú]+[a-zà-ú]+[ \-]?(?:d[a-u].)?(?:[A-ZÀ-Ú]+[a-zà-ú]+)*"
candidates = re.findall(city_regex, text)
print(candidates)
>> ['In Rio', 'Janeiro ', 'Havana '] Now the regex statement does not catch Rio de Janeiro, because Rio is already associated to "In". Hope I made myself a little more clear. |
Ok, now I understood what you mean with more than two words, I would rephrase it like "2 words excluding nexus (da-u). I found a couple of matches that the regular expression used will not be able to match. Those "nexus" which are not "da-du" will not be detected, for example:
When the "nexus" and another word appear together. This might be an specific case:
I have also found something odd. Actually I should have got a match in "Sant Cugat del Vallés", if you look to the file cities15000.txt we have the following:
So, ideally I should have a match only in the first 2 words, actually:
Actually, I made some testing, It looks the algorithm is later not using all the entries inside cities15000.txt. For example, for Mexico City I have been only able to match "Mexico City" and not other names I tried (did not tried all)
Concerning the version of the geonames (as Sant Cugat del Vallés is not included), I have downloaded from geonames the file cities15000.txt and found that, although in the website I can find "Sant Cugat del Vallés", you will not find it in the raw extracted file. So it looks the website is not providing all the data. Quick question, I have seen that also cities down to 500 people are included and the size of the file is "just" 4 times the file of 15000. Why not using the bigger file for higher coverage? I mean, software is quite good in performance. At the end, this piece of code is much better to what I have, just brainstorming some ideas that might improve it. |
Ideally you would have the option to pass a country and/or language and geotext would pull what it needs. I also don't think that creating the index on every class initiation of Geotext is a good design decision. I would create it once and pass it along to every child of geotext. Also, as we both now pointed out, regex is not the way to go. |
Hi guys, thanks for the great feedback. The library is unfortunately somewhat abandoned but the good news is that after seeing that people still use it despite its many flaws I plan to give it some attention during the holidays. These are some areas that I see can be improved based on your feedback:
what are your thoughts here? |
Hi @elyase!
I would be happy to help. :) |
The code is quite good. ML can be heavy, but might support some other users. For my specific use is not a priority. I do agree regex might be a little big fragile, but on the other side, is quite fast. You can always combine regex with other regex and/or other approaches. Maybe mix with ML in a "high accuracy" mode? Finally you will need a bunch of examples and develop algorithms to overcome them, I can provide some in the languages I master. I can support as well on development. Ok for API improvements, but not a pain for me today. |
I have found 2 cities which are not identified in geotext.
The cities "Ventalló" and "Sant Cugat del Vallès" exist in http://www.geonames.org but geotext is not able to find.
The text was updated successfully, but these errors were encountered: