fix(deduplication): Deduplicate values in phrase field #132

orangejulius · 2020-06-09T22:52:19Z

#118 added support for removing duplicate values from the name field. This logic was not also applied to the phrase field.

Duplicate values do not affect whether or not a particular document will match for a given query, but they do affect the scoring.

In some cases, the scoring boost for having tokens match twice from duplicates will over-rank a particular result. In other cases, the scoring penalty for having longer fields will under-rank a particular result.

To make sure our scoring is as fair as possible (pending other issues such as pelias/openstreetmap#507), we should apply our current deduplication on both the name and phrase fields.

#118 added support for removing duplicate values from the name field. This logic was not also applied to the `phrase` field. Duplicate values do not affect whether or not a particular document will match for a given query, but they _do_ affect the scoring. In some cases, the scoring boost for having tokens match twice from duplicates will over-rank a particular result. In other cases, the scoring penalty for having longer fields will under-rank a particular result. To make sure our scoring is as fair as possible (pending other issues such as pelias/openstreetmap#507), we should apply our current deduplication on both the `name` and `phrase` fields.

While diagnosing an issue related to scoring, I discovered that WOF records are sometimes created with duplicate name values. While the pelias/model code can detect some of them (and more will be fixed with pelias/model#132), we should fix this issue at the source. Here's an example of what a document might look like today, before this PR: ``` { name: { default: [ 'Kansas City' ] }, phrase: { default: [ 'Kansas City', 'Kansas City' ] }, ... } ``` This can be fixed by checking each potential alternate name against the "primary" name value.

orangejulius · 2020-06-09T23:05:27Z

While we definitely want to merge this PR, also be sure to read the discussion over at pelias/whosonfirst#511 before doing so. I'd like to do some testing before rolling it out everywhere as well.

While diagnosing an issue related to scoring, I discovered that WOF records are sometimes created with duplicate name values. While the pelias/model code can detect some of them (and more will be fixed with pelias/model#132), we should fix this issue at the source. Here's an example of what a document might look like today, before this PR: ``` { name: { default: [ 'Kansas City' ] }, phrase: { default: [ 'Kansas City', 'Kansas City' ] }, ... } ``` This can be fixed by checking each potential alternate name against the "primary" name value.

missinglink

👍

orangejulius · 2020-06-10T14:35:49Z

I realized this could help slightly mitigate the problems from pelias/openstreetmap#507 with OSM venues, so I'm just going to merge it and roll it out everywhere. Hopefully we see some improvements!

While diagnosing an issue related to scoring, I discovered that WOF records are sometimes created with duplicate name values. While the pelias/model code can detect some of them (and more will be fixed with pelias/model#132), we should fix this issue at the source. Here's an example of what a document might look like today, before this PR: ``` { name: { default: [ 'Kansas City' ] }, phrase: { default: [ 'Kansas City', 'Kansas City' ] }, ... } ``` This can be fixed by checking each potential alternate name against the "primary" name value.

orangejulius mentioned this pull request Jun 9, 2020

fix(doc): Do not allow duplicate names to be created pelias/whosonfirst#511

Merged

orangejulius requested a review from missinglink June 9, 2020 23:05

missinglink approved these changes Jun 10, 2020

View reviewed changes

orangejulius merged commit 9c8379d into master Jun 10, 2020

orangejulius deleted the deduplicate-phrase branch June 10, 2020 14:35

orangejulius mentioned this pull request Apr 19, 2022

remove phrase field in Document model #148

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(deduplication): Deduplicate values in phrase field #132

fix(deduplication): Deduplicate values in phrase field #132

orangejulius commented Jun 9, 2020

orangejulius commented Jun 9, 2020

missinglink left a comment

orangejulius commented Jun 10, 2020

fix(deduplication): Deduplicate values in phrase field #132

fix(deduplication): Deduplicate values in phrase field #132

Conversation

orangejulius commented Jun 9, 2020

orangejulius commented Jun 9, 2020

missinglink left a comment

Choose a reason for hiding this comment

orangejulius commented Jun 10, 2020