Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(doc): Do not allow duplicate names to be created #511

Merged
merged 1 commit into from
Aug 13, 2020

Conversation

orangejulius
Copy link
Member

@orangejulius orangejulius commented Jun 9, 2020

While diagnosing an issue related to scoring, I discovered that WOF records are sometimes created with duplicate name values. While the pelias/model code can detect some of them (and more will be fixed with pelias/model#132), we could also fix this issue at the source.

Here's an example of what a document might look like today, before this PR:

{
  name: { default: [ 'Kansas City' ] },
  phrase: { default: [ 'Kansas City', 'Kansas City' ] },
  ...
}

This can be fixed by checking each potential alternate name against the "primary" name value.

We might not want to merge this PR, since it only fixes the issue in this repo, but it might also be nice to test this change in a single repository first.

@orangejulius
Copy link
Member Author

orangejulius commented Jun 9, 2020

Just some commentary, this issue was super hard to track down!

I was looking at an issue where the much more populated and well known Kansas City, MO was being ranked below Kansas City, KS.

The documents were identical, except the score showed that matches on the phrase field were being adjusted based on a field length of 4 for the Kansas City, MO record (from WOF), but only 2 for the Kansas City, KS record (from geonames).

I had to dig into the documents generated by both importers to learn that the difference was in duplicate values in the phrase field. pelias/schema#285 to allow us to stop using a hidden phrase field can't come soon enough!!

Looking back, we've often been confused as to why Geonames records for a given admin area seem to be preferred, and this might be the reason! So hopefully results will be much better with this PR and/or pelias/model#132

@orangejulius orangejulius requested a review from missinglink June 9, 2020 23:04
@orangejulius orangejulius force-pushed the do-not-create-duplicate-names branch from 231c736 to 876a4f7 Compare June 9, 2020 23:36
orangejulius added a commit that referenced this pull request Jun 10, 2020
This should help fix some scoring issues identified in
#511

While not a complete fix, it should mitigate the effects of
pelias/openstreetmap#507 somewhat.
While diagnosing an issue related to scoring, I discovered that WOF
records are sometimes created with duplicate name values. While the
pelias/model code can detect some of them (and more will be fixed with
pelias/model#132), we should fix this issue at
the source.

Here's an example of what a document might look like today, before this
PR:

```
{
  name: { default: [ 'Kansas City' ] },
  phrase: { default: [ 'Kansas City', 'Kansas City' ] },
  ...
}
```

This can be fixed by checking each potential alternate name against the
"primary" name value.
@orangejulius orangejulius force-pushed the do-not-create-duplicate-names branch from 876a4f7 to a6c80a2 Compare August 13, 2020 13:24
@orangejulius orangejulius merged commit 8a35d6d into master Aug 13, 2020
@orangejulius orangejulius deleted the do-not-create-duplicate-names branch August 13, 2020 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant