Skip to content

Commit

Permalink
[New Check] SimilarTagValueCheck (osmlab#500)
Browse files Browse the repository at this point in the history
* SimilarTagCheck

* Create instructions

Create instructions.
Added Similar class.
Replaced filter similars function chain to its own function to help clean up flag function body.

* Duplicates fix suggestion

* Documentation

* Tests

* Spotless

* Live examples

* Single character variable names

* Fix suggestion doc

* Configurables description

* More tests

* More configurables

* Code smell

* SpotlessApply

* Function ordering

* HTML characters in javadoc

* Configurables in docs
  • Loading branch information
brianjor authored Feb 10, 2021
1 parent 69f79b3 commit ffe3d21
Show file tree
Hide file tree
Showing 6 changed files with 572 additions and 0 deletions.
29 changes: 29 additions & 0 deletions config/configuration.json
Original file line number Diff line number Diff line change
Expand Up @@ -959,6 +959,35 @@
"tags":"highway"
}
},
"SimilarTagValueCheck": {
"filter": {
"commonSimilars": [
["american", "mexican"], ["cafe", "cake"], ["male", "female"], ["woman", "man"],
["women", "men"], ["male_toilet", "female_toilet"], ["radiology", "cardiology"],
["baseball", "basketball"], ["bowls", "boules"], ["padel", "paddel"], ["formal", "informal"],
["hotel", "hostel"], ["hump", "bump"], ["seed", "feed"]
],
"tags": [
"asset_ref", "collection_times", "except", "is_in", "junction:ref", "maxspeed:conditional",
"old_name", "old_ref", "opening_hours", "ref", "restriction_hours", "route_ref", "supervised",
"source_ref", "target", "telescope"
],
"tagsWithSubCategories": [
"addr", "alt_name", "destination", "name", "seamark", "turn"
]
},
"similarity.threshold": {
"min": 0.0,
"max": 1.0
},
"value.length.min": 4.0,
"challenge": {
"description": "Tasks identify duplicate/similar values in tags.",
"blurb": "Duplicate/Similar tag values.",
"instruction": "Determine if the duplicate/similar value is necessary or can be removed.",
"difficulty": "EASY"
}
},
"SingleSegmentMotorwayCheck": {
"challenge": {
"description": "Tasks that identify ways tagged with highway=motorway that are not connected to any ways tagged the same.",
Expand Down
1 change: 1 addition & 0 deletions docs/available_checks.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ This document is a list of tables with a description and link to documentation f
| [MixedCaseNameCheck](checks/mixedCaseNameCheck.md) | The purpose of this check is to identify names that contain invalid mixed cases so that they can be edited to be the standard format. |
| [RoadNameGapCheck](checks/RoadNameGapCheck.md) | The purpose of this check is to identify edge connected between two edges whose name tag is same. Flag the edge if the edge has a name tag different to name tag of edges connected to it or if there is no name tag itself.
| [RoadNameSpellingConsistencyCheck](checks/RoadNameSpellingConsistencyCheck.md) | The purpose of this check is to identify road segments that have a name Tag with a different spelling from that of other segments of the same road. This check is primarily meant to catch small errors in spelling, such as a missing letter, letter accent mixups, or capitalization errors. |
| [SimilarTagValueCheck](checks/SimilarTagValueCheck.md) | The purpose of this check is to identify tags whose values are either duplicates or similar enough to warrant someone to look at them. |
| ShortNameCheck | The short name check will validate that any and all names contain at least 2 letters in the name. |
| [StreetNameIntegersOnlyCheck](checks/streetNameIntegersOnlyCheck.md) | The purpose of this check is to identify streets whose names contain integers only. |
| [TollValidationCheck](checks/tollValidationCheck) | The purpose of this check is to identify ways that need to have their toll tags investigated/added/removed.
Expand Down
52 changes: 52 additions & 0 deletions docs/checks/similarTagValueCheck.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# SimilarTagValueCheck

#### Description

The purpose of this check is to identify tags whose values are either duplicates or similar
enough to warrant someone to look at them.

Configurables:
* "value.length.min": Minimum length an individual value must be to be considered for inspection, value.length >= min.
* "similarity.threshold.min": Minimum edit distance between two values to be added to the flag where a value of 0 is
used to include duplicates, value >= min.
* "similarity.threshold.max": Maximum edit distance between two values to be added to the flag, value <= max.
* "filter.commonSimilars": values that can commonly be found together validly on a tag that are similar but with no
action needed to be taken.
* "filter.tags": tags that commonly have values that are duplicates/similars that are valid.
* "filter.tagsWithSubCategories": tags that contain one or many sub-categories that commonly have valid
duplicate/similar values.

#### Live Examples
Similar tag values
1. The node [5142510561](https://www.openstreetmap.org/way/5142510561) has the similar values: "crayfish" and "Crayfish"

Duplicate tag values
1. The way [173171120](https://www.openstreetmap.org/way/173171120) has multiple duplicate values in the "source" tag

#### Code Review

This check evaluates all atlas objects that can hold OSM tags.
Any duplicate tags are removed in a feature change, while similars are flagged for user review.

#### Validating the object
The incoming object must:
* have at least one tag with multiple values (contains a ";")

#### Flagging the object
We filter out all tags that:
* are tags that commonly contain valid duplicate/similar values
* values that are similar to others that commonly occur on the same tag
* values that either contain: length shorter than the defined min length, a number, non-latin characters
* the last filtering step we remove any tags that do not contain multiple values

We then take the valid tags and compare each value computing similarity between each, using the
Levenshtein Edit Distance algorithm. We keep value pairs with a similarity that falls within our
similarity threshold.

From there we split the gathered pairs between those that are duplicate values, and those that are similar.
The duplicates are added to the instructions and used to create a fix suggestion.
The similars are just added to the instructions.

#### Fix Suggestion
We create fix suggestions only on duplicate values, as similar values are difficult to determine which one (if not both)
should be kept. The fix, for duplicates, is to remove all but one occurrence of the duplicate value from the tag.
Loading

0 comments on commit ffe3d21

Please sign in to comment.