-
-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(deduplication): Deduplicate values in phrase field #132
Conversation
#118 added support for removing duplicate values from the name field. This logic was not also applied to the `phrase` field. Duplicate values do not affect whether or not a particular document will match for a given query, but they _do_ affect the scoring. In some cases, the scoring boost for having tokens match twice from duplicates will over-rank a particular result. In other cases, the scoring penalty for having longer fields will under-rank a particular result. To make sure our scoring is as fair as possible (pending other issues such as pelias/openstreetmap#507), we should apply our current deduplication on both the `name` and `phrase` fields.
While diagnosing an issue related to scoring, I discovered that WOF records are sometimes created with duplicate name values. While the pelias/model code can detect some of them (and more will be fixed with pelias/model#132), we should fix this issue at the source. Here's an example of what a document might look like today, before this PR: ``` { name: { default: [ 'Kansas City' ] }, phrase: { default: [ 'Kansas City', 'Kansas City' ] }, ... } ``` This can be fixed by checking each potential alternate name against the "primary" name value.
While we definitely want to merge this PR, also be sure to read the discussion over at pelias/whosonfirst#511 before doing so. I'd like to do some testing before rolling it out everywhere as well. |
While diagnosing an issue related to scoring, I discovered that WOF records are sometimes created with duplicate name values. While the pelias/model code can detect some of them (and more will be fixed with pelias/model#132), we should fix this issue at the source. Here's an example of what a document might look like today, before this PR: ``` { name: { default: [ 'Kansas City' ] }, phrase: { default: [ 'Kansas City', 'Kansas City' ] }, ... } ``` This can be fixed by checking each potential alternate name against the "primary" name value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I realized this could help slightly mitigate the problems from pelias/openstreetmap#507 with OSM venues, so I'm just going to merge it and roll it out everywhere. Hopefully we see some improvements! |
While diagnosing an issue related to scoring, I discovered that WOF records are sometimes created with duplicate name values. While the pelias/model code can detect some of them (and more will be fixed with pelias/model#132), we should fix this issue at the source. Here's an example of what a document might look like today, before this PR: ``` { name: { default: [ 'Kansas City' ] }, phrase: { default: [ 'Kansas City', 'Kansas City' ] }, ... } ``` This can be fixed by checking each potential alternate name against the "primary" name value.
#118 added support for removing duplicate values from the name field. This logic was not also applied to the
phrase
field.Duplicate values do not affect whether or not a particular document will match for a given query, but they do affect the scoring.
In some cases, the scoring boost for having tokens match twice from duplicates will over-rank a particular result. In other cases, the scoring penalty for having longer fields will under-rank a particular result.
To make sure our scoring is as fair as possible (pending other issues such as pelias/openstreetmap#507), we should apply our current deduplication on both the
name
andphrase
fields.