Skip to content

Commit

Permalink
Mixed Case Name Check (osmlab#66)
Browse files Browse the repository at this point in the history
* create MixedCaseNameCheck

* added MixedCaseNameCheck config

* Added unit test

* bug fixes

* Inverted function of iso configurable and added comments

* change default config value

* improved instruction

* Added comments and UID

* created mixedCaseNameCheck.md

* update test rule

* Added support for leading numbers, added test for articles, changed to split on config of chars, changed list of prepositions

* added character to config, updated docs

* added/fixed config values

* added lower case, apostrophe, end of word combination handling.

* corrected config error

* added 'in' to prepositions

* Changed unit test and rule names

* removed comment

* updated docs

* updated docs

* Change default config value; check for all lower case name

* update config

* update config

* added mixed case units configurable

* fix typo

* change config names

* code clean up

* remove space

* change code review structure

* refactor long regex to methods

* add comments and code cleanup

* move comment

* update docs

* condense conditionals; clean up docs

* add to comment and docs

* fix comments

* move regex compile to global

* update docs
  • Loading branch information
Bentleysb authored and jklamer committed Sep 14, 2018
1 parent 3cdd6d5 commit 6ba271d
Show file tree
Hide file tree
Showing 5 changed files with 1,018 additions and 0 deletions.
22 changes: 22 additions & 0 deletions config/configuration.json
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,28 @@
"difficulty": "MEDIUM"
}
},
"MixedCaseNameCheck": {
"check_name.countries":["AIA", "ATG", "AUS", "BHS", "BRB", "BLZ", "BMU", "BWA", "VGB",
"CMR", "CAN", "CYM", "DMA", "FJI", "GMB", "GHA", "GIB", "GRD", "GUY", "IRL", "JAM",
"KEN", "LSO", "MWI", "MLT", "MUS", "MSR", "NAM", "NZL", "NGA", "PNG", "SYC", "SLE",
"SGP", "SLB", "ZAF", "SWZ", "TZA", "TON", "TTO", "TCA", "UGA", "GBR", "USA", "VUT",
"ZMB", "ZWE"],
"name":{
"language.keys":["name:en"],
"affixes":["Mc", "Mac", "Mck","Mhic", "Mic"],
"articles": ["a", "an", "the"],
"prepositions": ["and", "from", "to", "of", "by", "upon", "on", "off", "at", "as",
"into", "like", "near", "onto", "per", "till", "up", "via", "with", "for", "in"],
"units":["kv"]
},
"regex.split":" -/&@–",
"challenge": {
"description": "Tasks containing objects with mixed case names.",
"blurb": "Mixed Case Name",
"instruction": "Correct the listed names tags so they conform to capitalization standards",
"difficulty": "MEDIUM"
}
},
"OneMemberRelationCheck": {
"challenge": {
"description": "Tasks containing relations with only one member.",
Expand Down
79 changes: 79 additions & 0 deletions docs/checks/mixedCaseNameCheck.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Mixed Case Name Check

This check flags objects with name tags that improperly use mixed cases.

Proper case use is defined by set standards and configurable exceptions.

The standards are as follows:

* Words must start with a capital unless:
* The first letter is preceded by a number (ex. 20th)
* All the words in the name are lower case (ex. ferry dock)
* All other letters must be lower case unless:
* They follow an apostrophe (ex. O'Flin) and, they are not the last letter of the word (ex. Smith's not Smith'S)
* The entire word is uppercase, except the last letter if it follows or is followed by an apostrophe (ex. MAX'S or MAX's)

The standards are broken by the following configurable exceptions (with default values):

* Articles that are capitalised only if they are the first word:
* a, an, the
* Prepositions that do not need to start with a capital:
* and, from, to, of, by, upon, on, off, at, as, into, like, near, onto, per, till, up, via, with, for, in
* Name affixes that may be followed by a capital:
* Mc, Mac, Mck, Mhic, Mic
* Mixed case units of measurement that are valid after a number:
* kV

The above configurables allow this check to be adapted to test different languages.
The check should only test names in languages it is configured to handle.
OSM uses the `name` tag for the name in a locations primary language, and `name:[ISOcode]` for other languages.
This check uses two configurable to control what languages are checked.

The first is a list of ISO codes for countries that should have there `name` tag checked.
The official language(s) of the countries in this list should be (a) language(s) that the check is configured to handle.
It has default values of:

* AIA, ATG, AUS, BHS, BRB, BLZ, BMU, BWA, VGB, CMR, CAN, CYM, DMA, FJI, GMB, GHA, GIB, GRD, GUY, IRL, JAM, KEN, LSO, MWI, MLT, MUS, MSR, NAM, NZL, NGA, PNG, SYC, SLE, SGP, SLB, ZAF, SWZ, TZA, TON, TTO, TCA, UGA, GBR, USA, VUT, ZMB, ZWE

The second is a list of `name:[ISOcode]` tags to check. These should be for the languages the check is configured to handle.
Default values are:

* name:en

A final configurable is a list of characters that names are split by, to form words. Its default values are:

* SPACE, \-, /, &, @, –

#### Live Examples

1. Way [id:4780932622](https://www.openstreetmap.org/node/4780932622) has the name `NZ Convenience store`. It is flagged because the S in store should be capitalized.

#### Code Review

In [Atlas](https://github.com/osmlab/atlas), OSM elements are represented as Edges, Points, Lines, Nodes, Areas & Relations; in our case, we’re are looking at
[Edges](https://github.com/osmlab/atlas/blob/dev/src/main/java/org/openstreetmap/atlas/geography/atlas/items/Edge.java),
[Lines](https://github.com/osmlab/atlas/blob/dev/src/main/java/org/openstreetmap/atlas/geography/atlas/items/Line.java),
[Nodes](https://github.com/osmlab/atlas/blob/dev/src/main/java/org/openstreetmap/atlas/geography/atlas/items/Node.java),
[Points](https://github.com/osmlab/atlas/blob/dev/src/main/java/org/openstreetmap/atlas/geography/atlas/items/Point.java), and
[Areas](https://github.com/osmlab/atlas/blob/dev/src/main/java/org/openstreetmap/atlas/geography/atlas/items/Area.java).

Our first goal is to validate the incoming Atlas object. Valid features for this check will satisfy the following conditions (see `validCheckForObject` method):

* It is an Edge, Line, Node, Point, or Area
* It is a country where the `name` tag should be checked and it has a `name` tag, or it has a one of the `name:[ISOcode]` tags.
* It has not already been flagged

Next the objects have each of their name tags, that are being checked, tested for proper use of case.
If the object's ISO code is in checkNameCountries its `name` tag is checked, else only the tags in `languageNameTags` are checked.

The test for proper use of case uses multiple regular expressions to check both the entire name and each word.
The most complex expression checks that all letters are lowercase, with the exception of the first letter and letters following apostrophes at the end of the word.

```java
return Pattern.compile(String.format(
"(\\p{L}.*(?<!'|%1$s)(\\p{Lu}))|(\\p{L}.*(?<=')\\p{Lu}(?!.))", this.nameAffixes))
.matcher(word).find();
```

To learn more about the code, please look at the comments in the source code for the check.
[MixedCaseNameCheck](../../src/main/java/org/openstreetmap/atlas/checks/validation/tag/MixedCaseNameCheck.java)
Original file line number Diff line number Diff line change
@@ -0,0 +1,277 @@
package org.openstreetmap.atlas.checks.validation.tag;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.Optional;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.openstreetmap.atlas.checks.base.BaseCheck;
import org.openstreetmap.atlas.checks.flag.CheckFlag;
import org.openstreetmap.atlas.geography.atlas.items.AtlasObject;
import org.openstreetmap.atlas.geography.atlas.items.LocationItem;
import org.openstreetmap.atlas.geography.atlas.items.Relation;
import org.openstreetmap.atlas.tags.ISOCountryTag;
import org.openstreetmap.atlas.tags.annotations.validation.Validators;
import org.openstreetmap.atlas.tags.names.NameTag;
import org.openstreetmap.atlas.utilities.configuration.Configuration;

/**
* This check flags objects with name tags that improperly use mixed cases.
*
* @author bbreithaupt
*/
public class MixedCaseNameCheck extends BaseCheck
{

private static final long serialVersionUID = 7109483897229499466L;

private static final List<String> FALLBACK_INSTRUCTIONS = Arrays.asList(
"{0} {1,number,#} has (an) invalid mixed-case value(s) for the following tag(s): {2}.");
private static final List<String> CHECK_NAME_COUNTRIES_DEFAULT = Arrays.asList("AIA", "ATG",
"AUS", "BHS", "BRB", "BLZ", "BMU", "BWA", "VGB", "CMR", "CAN", "CYM", "DMA", "FJI",
"GMB", "GHA", "GIB", "GRD", "GUY", "IRL", "JAM", "KEN", "LSO", "MWI", "MLT", "MUS",
"MSR", "NAM", "NZL", "NGA", "PNG", "SYC", "SLE", "SGP", "SLB", "ZAF", "SWZ", "TZA",
"TON", "TTO", "TCA", "UGA", "GBR", "USA", "VUT", "ZMB", "ZWE");
private static final List<String> LANGUAGE_NAME_TAGS_DEFAULT = Arrays.asList("name:en");
private static final List<String> LOWER_CASE_PREPOSITIONS_DEFAULT = Arrays.asList("and", "from",
"to", "of", "by", "upon", "on", "off", "at", "as", "into", "like", "near", "onto",
"per", "till", "up", "via", "with", "for", "in");
private static final List<String> LOWER_CASE_ARTICLES_DEFAULT = Arrays.asList("a", "an", "the");
private static final String SPLIT_CHARACTERS_DEFAULT = " -/&@–";
private static final List<String> NAME_AFFIXES_DEFAULT = Arrays.asList("Mc", "Mac", "Mck",
"Mhic", "Mic");
private static final List<String> MIXED_CASE_UNITS_DEFAULT = Arrays.asList("kV");

// A list of countries where the name tag should be checked
private final List<String> checkNameCountries;
// A list of language specific name tags to check
private final List<String> languageNameTags;
// A list of prepositions that are normally lower case in names
private final List<String> lowerCasePrepositions;
// A list of articles that are normally lower case in names, unless at the start
private final List<String> lowerCaseArticles;
// A string of characters that can proceed a capital letter
private final String splitCharacters;
// A list of name affixes that can proceed a capital letter
private final String nameAffixes;
// Know intentionally mixed case words
private final String mixedCaseUnits;

// Regex Patterns
private final Pattern upperCasePattern;
private final Pattern anyLetterPattern;
private final Pattern lowerCasePattern;
private final Pattern mixedCaseUnitsPattern;
private final Pattern mixedCaseApostrophePattern;
private final Pattern nonFirstCapitalPattern;

/**
* The default constructor that must be supplied. The Atlas Checks framework will generate the
* checks with this constructor, supplying a configuration that can be used to adjust any
* parameters that the check uses during operation.
*
* @param configuration
* the JSON configuration for this check
*/
public MixedCaseNameCheck(final Configuration configuration)
{
super(configuration);
this.checkNameCountries = (List<String>) configurationValue(configuration,
"check_name.countries", CHECK_NAME_COUNTRIES_DEFAULT);
this.languageNameTags = (List<String>) configurationValue(configuration,
"name.language.keys", LANGUAGE_NAME_TAGS_DEFAULT);
this.lowerCasePrepositions = (List<String>) configurationValue(configuration,
"name.prepositions", LOWER_CASE_PREPOSITIONS_DEFAULT);
this.lowerCaseArticles = (List<String>) configurationValue(configuration, "name.articles",
LOWER_CASE_ARTICLES_DEFAULT);
this.splitCharacters = (String) configurationValue(configuration, "regex.split",
SPLIT_CHARACTERS_DEFAULT);
this.nameAffixes = (String) configurationValue(configuration, "name.affixes",
NAME_AFFIXES_DEFAULT, value -> String.join("|", (List<String>) value));
this.mixedCaseUnits = (String) configurationValue(configuration, "name.units",
MIXED_CASE_UNITS_DEFAULT, value -> String.join("|", (List<String>) value));

// Compile regex
this.upperCasePattern = Pattern.compile("\\p{Lu}");
this.anyLetterPattern = Pattern.compile("\\p{L}");
this.lowerCasePattern = Pattern.compile("\\p{Ll}");
this.mixedCaseUnitsPattern = Pattern.compile(
String.format("[^\\p{L}]*\\p{Digit}[%1$s][^\\p{L}]*", this.mixedCaseUnits));
this.mixedCaseApostrophePattern = Pattern
.compile("([^\\p{Ll}]+'\\p{Ll})|([^\\p{Ll}]+\\p{Ll}')");
this.nonFirstCapitalPattern = Pattern.compile(String.format(
"(\\p{L}.*(?<!'|%1$s)(\\p{Lu}))|(\\p{L}.*(?<=')\\p{Lu}(?!.))", this.nameAffixes));
}

/**
* This function will validate if the supplied atlas object is valid for the check.
*
* @param object
* the atlas object supplied by the Atlas-Checks framework for evaluation
* @return {@code true} if this object should be checked
*/
@Override
public boolean validCheckForObject(final AtlasObject object)
{
// Valid objects are items that were OSM nodes or ways (Equivalent to Atlas nodes, points,
// edges, lines and areas)
return !(object instanceof Relation) && !this.isFlagged(object.getOsmIdentifier())
&& ((object.getTags().containsKey(ISOCountryTag.KEY)
// Must have an ISO code that is in checkNameCountries...
&& this.checkNameCountries
.contains(object.tag(ISOCountryTag.KEY).toUpperCase())
// And have a name tag
&& Validators.hasValuesFor(object, NameTag.class))
// Or it must have a specific language name tag from languageNameTags
|| this.languageNameTags.stream()
.anyMatch(key -> object.getOsmTags().containsKey(key)));
}

/**
* This is the actual function that will check to see whether the object needs to be flagged.
*
* @param object
* the atlas object supplied by the Atlas-Checks framework for evaluation
* @return an optional {@link CheckFlag} object that
*/
@Override
protected Optional<CheckFlag> flag(final AtlasObject object)
{
final List<String> mixedCaseNameTags = new ArrayList<>();
final Map<String, String> osmTags = object.getOsmTags();

// Check ISO against list of countries for testing name tag
if (this.checkNameCountries.contains(object.tag(ISOCountryTag.KEY).toUpperCase())
&& Validators.hasValuesFor(object, NameTag.class)
&& isMixedCase(osmTags.get(NameTag.KEY)))
{
mixedCaseNameTags.add(NameTag.KEY);
}
// Check all language name tags
for (final String key : this.languageNameTags)
{
if (osmTags.containsKey(key) && isMixedCase(osmTags.get(key)))
{
mixedCaseNameTags.add(key);
}
}

// If mix case id detected, flag
if (!mixedCaseNameTags.isEmpty())
{
this.markAsFlagged(object.getOsmIdentifier());
// Instruction includes type of OSM object and list of flagged tags
return Optional.of(this.createFlag(object,
this.getLocalizedInstruction(0, object instanceof LocationItem ? "Node" : "Way",
object.getOsmIdentifier(), String.join(", ", mixedCaseNameTags))));
}
return Optional.empty();
}

@Override
protected List<String> getFallbackInstructions()
{
return FALLBACK_INSTRUCTIONS;
}

/**
* Tests each word in a string for proper use of case in a name.
*
* @param value
* String to check
* @return true when there is improper case in any of the words
*/
private boolean isMixedCase(final String value)
{
// Check if it is all lower case
if (this.upperCasePattern.matcher(value).find())
{
// Split into words based on configurable characters
final String[] wordArray = value.split("[\\Q" + this.splitCharacters + "\\E]");
boolean firstWord = true;
// Check each word
for (final String word : wordArray)
{
// Check if the word is intentionally mixed case
if (!isMixedCaseUnit(word))
{
final Matcher firstLetterMatcher = this.anyLetterPattern.matcher(word);
// If the word is not in the list of prepositions, and the
// word is not both in the article list and not the first word: check that
// the first letter is a capital
if ((!this.lowerCasePrepositions.contains(word)
&& !(!firstWord && this.lowerCaseArticles.contains(word))
// If the first letter is lower case: return true if it is not preceded
// by a number
&& firstLetterMatcher.find()
&& Character.isLowerCase(firstLetterMatcher.group().charAt(0))
&& !(firstLetterMatcher.start() != 0 && Character
.isDigit(word.charAt(firstLetterMatcher.start() - 1))))
// If the word is not all upper case: check if all the letters not
// following apostrophes, unless at the end of the word, are lower case
|| (this.lowerCasePattern.matcher(word).find()
&& !isMixedCaseApostrophe(word)
&& isProperNonFirstCapital(word)))
{
return true;
}
}
firstWord = false;
}
}
return false;
}

/**
* Tests a {@link String} against a configurable list of unit abbreviations.
*
* @param word
* {@link String} to test
* @return true if {@code word} contains a mixed case unit abbreviation preceded by a number,
* and it does not contain any other alphabetic characters.
*/
private boolean isMixedCaseUnit(final String word)
{
// This returns true if one of the items in this.mixedCaseUnits is preceded by a number -
// `\p{Digit}`
// There may be 0 or more non-alphabetic characters proceeding or following the
// digit+mixedCaseUnits - `[^\p{L}]*`
return this.mixedCaseUnitsPattern.matcher(word).find();
}

/**
* Tests a {@link String} for being all upper case, except the last letter which is adjacent to
* an apostrophe (ex. MAX's).
*
* @param word
* {@link String} to test
* @return true if a lower case letter is found preceding or following an apostrophe that is the
* last or second to last character in the string, and all other letters are upper case
*/
private boolean isMixedCaseApostrophe(final String word)
{
// This returns true if the last 2 characters are an apostrophe and a lower case letter, and
// all other letters are upper case.
return this.mixedCaseApostrophePattern.matcher(word).matches();
}

/**
* Tests a {@link String} for incorrect capitalization, excluding the first letter.
*
* @param word
* {@link String} to test
* @return true if a capital letter is incorrectly used
*/
private boolean isProperNonFirstCapital(final String word)
{
// This checks each capital letter for incorrect usage
// It does not check the first letter - `(\p{L}.*`
// To be incorrect usage a capital letter:
// Must not be preceded by an apostrophe or name affix - `(?<!'|%1$s)(\p{Lu})`
// Must not be the last character if it follows an apostrophe - `(?<=')\p{Lu}(?!.)`
return this.nonFirstCapitalPattern.matcher(word).find();
}
}
Loading

0 comments on commit 6ba271d

Please sign in to comment.