Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(deduplication): Deduplicate values in phrase field #132

Merged
merged 1 commit into from
Jun 10, 2020

Conversation

orangejulius
Copy link
Member

#118 added support for removing duplicate values from the name field. This logic was not also applied to the phrase field.

Duplicate values do not affect whether or not a particular document will match for a given query, but they do affect the scoring.

In some cases, the scoring boost for having tokens match twice from duplicates will over-rank a particular result. In other cases, the scoring penalty for having longer fields will under-rank a particular result.

To make sure our scoring is as fair as possible (pending other issues such as pelias/openstreetmap#507), we should apply our current deduplication on both the name and phrase fields.

Verified

This commit was signed with the committer’s verified signature. The key has expired.
orangejulius Julian Simioni
#118 added support for removing
duplicate values from the name field. This logic was not also applied to the `phrase` field.

Duplicate values do not affect whether or not a particular document will
match for a given query, but they _do_ affect the scoring.

In some cases, the scoring boost for having tokens match twice from
duplicates will over-rank a particular result.

In other cases, the scoring penalty for having longer fields will
under-rank a particular result.

To make sure our scoring is as fair as possible (pending other issues
such as pelias/openstreetmap#507), we should
apply our current deduplication on both the `name` and `phrase` fields.
orangejulius added a commit to pelias/whosonfirst that referenced this pull request Jun 9, 2020

Verified

This commit was signed with the committer’s verified signature. The key has expired.
orangejulius Julian Simioni
While diagnosing an issue related to scoring, I discovered that WOF
records are sometimes created with duplicate name values. While the
pelias/model code can detect some of them (and more will be fixed with
pelias/model#132), we should fix this issue at
the source.

Here's an example of what a document might look like today, before this
PR:

```
{
  name: { default: [ 'Kansas City' ] },
  phrase: { default: [ 'Kansas City', 'Kansas City' ] },
  ...
}
```

This can be fixed by checking each potential alternate name against the
"primary" name value.
@orangejulius orangejulius requested a review from missinglink June 9, 2020 23:05
@orangejulius
Copy link
Member Author

While we definitely want to merge this PR, also be sure to read the discussion over at pelias/whosonfirst#511 before doing so. I'd like to do some testing before rolling it out everywhere as well.

orangejulius added a commit to pelias/whosonfirst that referenced this pull request Jun 9, 2020

Verified

This commit was signed with the committer’s verified signature. The key has expired.
orangejulius Julian Simioni
While diagnosing an issue related to scoring, I discovered that WOF
records are sometimes created with duplicate name values. While the
pelias/model code can detect some of them (and more will be fixed with
pelias/model#132), we should fix this issue at
the source.

Here's an example of what a document might look like today, before this
PR:

```
{
  name: { default: [ 'Kansas City' ] },
  phrase: { default: [ 'Kansas City', 'Kansas City' ] },
  ...
}
```

This can be fixed by checking each potential alternate name against the
"primary" name value.
Copy link
Member

@missinglink missinglink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@orangejulius
Copy link
Member Author

I realized this could help slightly mitigate the problems from pelias/openstreetmap#507 with OSM venues, so I'm just going to merge it and roll it out everywhere. Hopefully we see some improvements!

@orangejulius orangejulius merged commit 9c8379d into master Jun 10, 2020
@orangejulius orangejulius deleted the deduplicate-phrase branch June 10, 2020 14:35
orangejulius added a commit to pelias/whosonfirst that referenced this pull request Aug 13, 2020

Verified

This commit was signed with the committer’s verified signature. The key has expired.
orangejulius Julian Simioni
While diagnosing an issue related to scoring, I discovered that WOF
records are sometimes created with duplicate name values. While the
pelias/model code can detect some of them (and more will be fixed with
pelias/model#132), we should fix this issue at
the source.

Here's an example of what a document might look like today, before this
PR:

```
{
  name: { default: [ 'Kansas City' ] },
  phrase: { default: [ 'Kansas City', 'Kansas City' ] },
  ...
}
```

This can be fixed by checking each potential alternate name against the
"primary" name value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants