Mixed Case Name Check (osmlab#66)

* create MixedCaseNameCheck * added MixedCaseNameCheck config * Added unit test * bug fixes * Inverted function of iso configurable and added comments * change default config value * improved instruction * Added comments and UID * created mixedCaseNameCheck.md * update test rule * Added support for leading numbers, added test for articles, changed to split on config of chars, changed list of prepositions * added character to config, updated docs * added/fixed config values * added lower case, apostrophe, end of word combination handling. * corrected config error * added 'in' to prepositions * Changed unit test and rule names * removed comment * updated docs * updated docs * Change default config value; check for all lower case name * update config * update config * added mixed case units configurable * fix typo * change config names * code clean up * remove space * change code review structure * refactor long regex to methods * add comments and code cleanup * move comment * update docs * condense conditionals; clean up docs * add to comment and docs * fix comments * move regex compile to global * update docs
sayas01 · Sep 14, 2018 · 6ba271d · 6ba271d
1 parent 3cdd6d5
commit 6ba271d
Show file tree

Hide file tree

Showing 5 changed files with 1,018 additions and 0 deletions.
diff --git a/config/configuration.json b/config/configuration.json
@@ -171,6 +171,28 @@
       "difficulty": "MEDIUM"
     }
   },
+  "MixedCaseNameCheck": {
+    "check_name.countries":["AIA", "ATG", "AUS", "BHS", "BRB", "BLZ", "BMU", "BWA", "VGB",
+      "CMR", "CAN", "CYM", "DMA", "FJI", "GMB", "GHA", "GIB", "GRD", "GUY", "IRL", "JAM",
+      "KEN", "LSO", "MWI", "MLT", "MUS", "MSR", "NAM", "NZL", "NGA", "PNG", "SYC", "SLE",
+      "SGP", "SLB", "ZAF", "SWZ", "TZA", "TON", "TTO", "TCA", "UGA", "GBR", "USA", "VUT",
+      "ZMB", "ZWE"],
+    "name":{
+      "language.keys":["name:en"],
+      "affixes":["Mc", "Mac", "Mck","Mhic", "Mic"],
+      "articles": ["a", "an", "the"],
+      "prepositions": ["and", "from", "to", "of", "by", "upon", "on", "off", "at", "as",
+        "into", "like", "near", "onto", "per", "till", "up", "via", "with", "for", "in"],
+      "units":["kv"]
+    },
+    "regex.split":" -/&@–",
+    "challenge": {
+      "description": "Tasks containing objects with mixed case names.",
+      "blurb": "Mixed Case Name",
+      "instruction": "Correct the listed names tags so they conform to capitalization standards",
+      "difficulty": "MEDIUM"
+    }
+  },
   "OneMemberRelationCheck": {
     "challenge": {
       "description": "Tasks containing relations with only one member.",

diff --git a/docs/checks/mixedCaseNameCheck.md b/docs/checks/mixedCaseNameCheck.md
@@ -0,0 +1,79 @@
+# Mixed Case Name Check
+
+This check flags objects with name tags that improperly use mixed cases.
+
+Proper case use is defined by set standards and configurable exceptions. 
+
+The standards are as follows:
+
+* Words must start with a capital unless:
+    * The first letter is preceded by a number (ex. 20th)
+    * All the words in the name are lower case (ex. ferry dock)
+* All other letters must be lower case unless: 
+    * They follow an apostrophe (ex. O'Flin) and, they are not the last letter of the word (ex. Smith's not Smith'S)
+    * The entire word is uppercase, except the last letter if it follows or is followed by an apostrophe (ex. MAX'S or MAX's)
+
+The standards are broken by the following configurable exceptions (with default values):
+
+* Articles that are capitalised only if they are the first word:
+    * a, an, the
+* Prepositions that do not need to start with a capital:
+    * and, from, to, of, by, upon, on, off, at, as, into, like, near, onto, per, till, up, via, with, for, in
+* Name affixes that may be followed by a capital:
+    * Mc, Mac, Mck, Mhic, Mic
+* Mixed case units of measurement that are valid after a number:
+    * kV
+
+The above configurables allow this check to be adapted to test different languages.    
+The check should only test names in languages it is configured to handle.   
+OSM uses the `name` tag for the name in a locations primary language, and `name:[ISOcode]` for other languages.
+This check uses two configurable to control what languages are checked.
+
+The first is a list of ISO codes for countries that should have there `name` tag checked. 
+The official language(s) of the countries in this list should be (a) language(s) that the check is configured to handle. 
+It has default values of:
+
+* AIA, ATG, AUS, BHS, BRB, BLZ, BMU, BWA, VGB, CMR, CAN, CYM, DMA, FJI, GMB, GHA, GIB, GRD, GUY, IRL, JAM, KEN, LSO, MWI, MLT, MUS, MSR, NAM, NZL, NGA, PNG, SYC, SLE, SGP, SLB, ZAF, SWZ, TZA, TON, TTO, TCA, UGA, GBR, USA, VUT, ZMB, ZWE
+
+The second is a list of `name:[ISOcode]` tags to check. These should be for the languages the check is configured to handle.
+ Default values are:
+
+* name:en
+
+A final configurable is a list of characters that names are split by, to form words. Its default values are: 
+
+* SPACE, \-, /, &, @, –
+
+#### Live Examples
+
+1. Way [id:4780932622](https://www.openstreetmap.org/node/4780932622) has the name `NZ Convenience store`. It is flagged because the S in store should be capitalized. 
+
+#### Code Review
+
+In [Atlas](https://github.com/osmlab/atlas), OSM elements are represented as Edges, Points, Lines, Nodes, Areas & Relations; in our case, we’re are looking at
+[Edges](https://github.com/osmlab/atlas/blob/dev/src/main/java/org/openstreetmap/atlas/geography/atlas/items/Edge.java),
+[Lines](https://github.com/osmlab/atlas/blob/dev/src/main/java/org/openstreetmap/atlas/geography/atlas/items/Line.java),
+[Nodes](https://github.com/osmlab/atlas/blob/dev/src/main/java/org/openstreetmap/atlas/geography/atlas/items/Node.java),
+[Points](https://github.com/osmlab/atlas/blob/dev/src/main/java/org/openstreetmap/atlas/geography/atlas/items/Point.java), and
+[Areas](https://github.com/osmlab/atlas/blob/dev/src/main/java/org/openstreetmap/atlas/geography/atlas/items/Area.java).
+
+Our first goal is to validate the incoming Atlas object. Valid features for this check will satisfy the following conditions (see `validCheckForObject` method):
+
+* It is an Edge, Line, Node, Point, or Area
+* It is a country where the `name` tag should be checked and it has a `name` tag, or it has a one of the `name:[ISOcode]` tags.
+* It has not already been flagged
+
+Next the objects have each of their name tags, that are being checked, tested for proper use of case.  
+If the object's ISO code is in checkNameCountries its `name` tag is checked, else only the tags in `languageNameTags` are checked.
+
+The test for proper use of case uses multiple regular expressions to check both the entire name and each word.  
+The most complex expression checks that all letters are lowercase, with the exception of the first letter and letters following apostrophes at the end of the word.
+
+```java
+return Pattern.compile(String.format(
+    "(\\p{L}.*(?<!'|%1$s)(\\p{Lu}))|(\\p{L}.*(?<=')\\p{Lu}(?!.))", this.nameAffixes))
+    .matcher(word).find();
+```
+
+To learn more about the code, please look at the comments in the source code for the check.  
+[MixedCaseNameCheck](../../src/main/java/org/openstreetmap/atlas/checks/validation/tag/MixedCaseNameCheck.java)
diff --git a/src/main/java/org/openstreetmap/atlas/checks/validation/tag/MixedCaseNameCheck.java b/src/main/java/org/openstreetmap/atlas/checks/validation/tag/MixedCaseNameCheck.java
@@ -0,0 +1,277 @@
+package org.openstreetmap.atlas.checks.validation.tag;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+import java.util.Optional;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+import org.openstreetmap.atlas.checks.base.BaseCheck;
+import org.openstreetmap.atlas.checks.flag.CheckFlag;
+import org.openstreetmap.atlas.geography.atlas.items.AtlasObject;
+import org.openstreetmap.atlas.geography.atlas.items.LocationItem;
+import org.openstreetmap.atlas.geography.atlas.items.Relation;
+import org.openstreetmap.atlas.tags.ISOCountryTag;
+import org.openstreetmap.atlas.tags.annotations.validation.Validators;
+import org.openstreetmap.atlas.tags.names.NameTag;
+import org.openstreetmap.atlas.utilities.configuration.Configuration;
+
+/**
+ * This check flags objects with name tags that improperly use mixed cases.
+ *
+ * @author bbreithaupt
+ */
+public class MixedCaseNameCheck extends BaseCheck
+{
+
+    private static final long serialVersionUID = 7109483897229499466L;
+
+    private static final List<String> FALLBACK_INSTRUCTIONS = Arrays.asList(
+            "{0} {1,number,#} has (an) invalid mixed-case value(s) for the following tag(s): {2}.");
+    private static final List<String> CHECK_NAME_COUNTRIES_DEFAULT = Arrays.asList("AIA", "ATG",
+            "AUS", "BHS", "BRB", "BLZ", "BMU", "BWA", "VGB", "CMR", "CAN", "CYM", "DMA", "FJI",
+            "GMB", "GHA", "GIB", "GRD", "GUY", "IRL", "JAM", "KEN", "LSO", "MWI", "MLT", "MUS",
+            "MSR", "NAM", "NZL", "NGA", "PNG", "SYC", "SLE", "SGP", "SLB", "ZAF", "SWZ", "TZA",
+            "TON", "TTO", "TCA", "UGA", "GBR", "USA", "VUT", "ZMB", "ZWE");
+    private static final List<String> LANGUAGE_NAME_TAGS_DEFAULT = Arrays.asList("name:en");
+    private static final List<String> LOWER_CASE_PREPOSITIONS_DEFAULT = Arrays.asList("and", "from",
+            "to", "of", "by", "upon", "on", "off", "at", "as", "into", "like", "near", "onto",
+            "per", "till", "up", "via", "with", "for", "in");
+    private static final List<String> LOWER_CASE_ARTICLES_DEFAULT = Arrays.asList("a", "an", "the");
+    private static final String SPLIT_CHARACTERS_DEFAULT = " -/&@–";
+    private static final List<String> NAME_AFFIXES_DEFAULT = Arrays.asList("Mc", "Mac", "Mck",
+            "Mhic", "Mic");
+    private static final List<String> MIXED_CASE_UNITS_DEFAULT = Arrays.asList("kV");
+
+    // A list of countries where the name tag should be checked
+    private final List<String> checkNameCountries;
+    // A list of language specific name tags to check
+    private final List<String> languageNameTags;
+    // A list of prepositions that are normally lower case in names
+    private final List<String> lowerCasePrepositions;
+    // A list of articles that are normally lower case in names, unless at the start
+    private final List<String> lowerCaseArticles;
+    // A string of characters that can proceed a capital letter
+    private final String splitCharacters;
+    // A list of name affixes that can proceed a capital letter
+    private final String nameAffixes;
+    // Know intentionally mixed case words
+    private final String mixedCaseUnits;
+
+    // Regex Patterns
+    private final Pattern upperCasePattern;
+    private final Pattern anyLetterPattern;
+    private final Pattern lowerCasePattern;
+    private final Pattern mixedCaseUnitsPattern;
+    private final Pattern mixedCaseApostrophePattern;
+    private final Pattern nonFirstCapitalPattern;
+
+    /**
+     * The default constructor that must be supplied. The Atlas Checks framework will generate the
+     * checks with this constructor, supplying a configuration that can be used to adjust any
+     * parameters that the check uses during operation.
+     *
+     * @param configuration
+     *            the JSON configuration for this check
+     */
+    public MixedCaseNameCheck(final Configuration configuration)
+    {
+        super(configuration);
+        this.checkNameCountries = (List<String>) configurationValue(configuration,
+                "check_name.countries", CHECK_NAME_COUNTRIES_DEFAULT);
+        this.languageNameTags = (List<String>) configurationValue(configuration,
+                "name.language.keys", LANGUAGE_NAME_TAGS_DEFAULT);
+        this.lowerCasePrepositions = (List<String>) configurationValue(configuration,
+                "name.prepositions", LOWER_CASE_PREPOSITIONS_DEFAULT);
+        this.lowerCaseArticles = (List<String>) configurationValue(configuration, "name.articles",
+                LOWER_CASE_ARTICLES_DEFAULT);
+        this.splitCharacters = (String) configurationValue(configuration, "regex.split",
+                SPLIT_CHARACTERS_DEFAULT);
+        this.nameAffixes = (String) configurationValue(configuration, "name.affixes",
+                NAME_AFFIXES_DEFAULT, value -> String.join("|", (List<String>) value));
+        this.mixedCaseUnits = (String) configurationValue(configuration, "name.units",
+                MIXED_CASE_UNITS_DEFAULT, value -> String.join("|", (List<String>) value));
+
+        // Compile regex
+        this.upperCasePattern = Pattern.compile("\\p{Lu}");
+        this.anyLetterPattern = Pattern.compile("\\p{L}");
+        this.lowerCasePattern = Pattern.compile("\\p{Ll}");
+        this.mixedCaseUnitsPattern = Pattern.compile(
+                String.format("[^\\p{L}]*\\p{Digit}[%1$s][^\\p{L}]*", this.mixedCaseUnits));
+        this.mixedCaseApostrophePattern = Pattern
+                .compile("([^\\p{Ll}]+'\\p{Ll})|([^\\p{Ll}]+\\p{Ll}')");
+        this.nonFirstCapitalPattern = Pattern.compile(String.format(
+                "(\\p{L}.*(?<!'|%1$s)(\\p{Lu}))|(\\p{L}.*(?<=')\\p{Lu}(?!.))", this.nameAffixes));
+    }
+
+    /**
+     * This function will validate if the supplied atlas object is valid for the check.
+     *
+     * @param object
+     *            the atlas object supplied by the Atlas-Checks framework for evaluation
+     * @return {@code true} if this object should be checked
+     */
+    @Override
+    public boolean validCheckForObject(final AtlasObject object)
+    {
+        // Valid objects are items that were OSM nodes or ways (Equivalent to Atlas nodes, points,
+        // edges, lines and areas)
+        return !(object instanceof Relation) && !this.isFlagged(object.getOsmIdentifier())
+                && ((object.getTags().containsKey(ISOCountryTag.KEY)
+                        // Must have an ISO code that is in checkNameCountries...
+                        && this.checkNameCountries
+                                .contains(object.tag(ISOCountryTag.KEY).toUpperCase())
+                        // And have a name tag
+                        && Validators.hasValuesFor(object, NameTag.class))
+                        // Or it must have a specific language name tag from languageNameTags
+                        || this.languageNameTags.stream()
+                                .anyMatch(key -> object.getOsmTags().containsKey(key)));
+    }
+
+    /**
+     * This is the actual function that will check to see whether the object needs to be flagged.
+     *
+     * @param object
+     *            the atlas object supplied by the Atlas-Checks framework for evaluation
+     * @return an optional {@link CheckFlag} object that
+     */
+    @Override
+    protected Optional<CheckFlag> flag(final AtlasObject object)
+    {
+        final List<String> mixedCaseNameTags = new ArrayList<>();
+        final Map<String, String> osmTags = object.getOsmTags();
+
+        // Check ISO against list of countries for testing name tag
+        if (this.checkNameCountries.contains(object.tag(ISOCountryTag.KEY).toUpperCase())
+                && Validators.hasValuesFor(object, NameTag.class)
+                && isMixedCase(osmTags.get(NameTag.KEY)))
+        {
+            mixedCaseNameTags.add(NameTag.KEY);
+        }
+        // Check all language name tags
+        for (final String key : this.languageNameTags)
+        {
+            if (osmTags.containsKey(key) && isMixedCase(osmTags.get(key)))
+            {
+                mixedCaseNameTags.add(key);
+            }
+        }
+
+        // If mix case id detected, flag
+        if (!mixedCaseNameTags.isEmpty())
+        {
+            this.markAsFlagged(object.getOsmIdentifier());
+            // Instruction includes type of OSM object and list of flagged tags
+            return Optional.of(this.createFlag(object,
+                    this.getLocalizedInstruction(0, object instanceof LocationItem ? "Node" : "Way",
+                            object.getOsmIdentifier(), String.join(", ", mixedCaseNameTags))));
+        }
+        return Optional.empty();
+    }
+
+    @Override
+    protected List<String> getFallbackInstructions()
+    {
+        return FALLBACK_INSTRUCTIONS;
+    }
+
+    /**
+     * Tests each word in a string for proper use of case in a name.
+     *
+     * @param value
+     *            String to check
+     * @return true when there is improper case in any of the words
+     */
+    private boolean isMixedCase(final String value)
+    {
+        // Check if it is all lower case
+        if (this.upperCasePattern.matcher(value).find())
+        {
+            // Split into words based on configurable characters
+            final String[] wordArray = value.split("[\\Q" + this.splitCharacters + "\\E]");
+            boolean firstWord = true;
+            // Check each word
+            for (final String word : wordArray)
+            {
+                // Check if the word is intentionally mixed case
+                if (!isMixedCaseUnit(word))
+                {
+                    final Matcher firstLetterMatcher = this.anyLetterPattern.matcher(word);
+                    // If the word is not in the list of prepositions, and the
+                    // word is not both in the article list and not the first word: check that
+                    // the first letter is a capital
+                    if ((!this.lowerCasePrepositions.contains(word)
+                            && !(!firstWord && this.lowerCaseArticles.contains(word))
+                            // If the first letter is lower case: return true if it is not preceded
+                            // by a number
+                            && firstLetterMatcher.find()
+                            && Character.isLowerCase(firstLetterMatcher.group().charAt(0))
+                            && !(firstLetterMatcher.start() != 0 && Character
+                                    .isDigit(word.charAt(firstLetterMatcher.start() - 1))))
+                            // If the word is not all upper case: check if all the letters not
+                            // following apostrophes, unless at the end of the word, are lower case
+                            || (this.lowerCasePattern.matcher(word).find()
+                                    && !isMixedCaseApostrophe(word)
+                                    && isProperNonFirstCapital(word)))
+                    {
+                        return true;
+                    }
+                }
+                firstWord = false;
+            }
+        }
+        return false;
+    }
+
+    /**
+     * Tests a {@link String} against a configurable list of unit abbreviations.
+     *
+     * @param word
+     *            {@link String} to test
+     * @return true if {@code word} contains a mixed case unit abbreviation preceded by a number,
+     *         and it does not contain any other alphabetic characters.
+     */
+    private boolean isMixedCaseUnit(final String word)
+    {
+        // This returns true if one of the items in this.mixedCaseUnits is preceded by a number -
+        // `\p{Digit}`
+        // There may be 0 or more non-alphabetic characters proceeding or following the
+        // digit+mixedCaseUnits - `[^\p{L}]*`
+        return this.mixedCaseUnitsPattern.matcher(word).find();
+    }
+
+    /**
+     * Tests a {@link String} for being all upper case, except the last letter which is adjacent to
+     * an apostrophe (ex. MAX's).
+     *
+     * @param word
+     *            {@link String} to test
+     * @return true if a lower case letter is found preceding or following an apostrophe that is the
+     *         last or second to last character in the string, and all other letters are upper case
+     */
+    private boolean isMixedCaseApostrophe(final String word)
+    {
+        // This returns true if the last 2 characters are an apostrophe and a lower case letter, and
+        // all other letters are upper case.
+        return this.mixedCaseApostrophePattern.matcher(word).matches();
+    }
+
+    /**
+     * Tests a {@link String} for incorrect capitalization, excluding the first letter.
+     *
+     * @param word
+     *            {@link String} to test
+     * @return true if a capital letter is incorrectly used
+     */
+    private boolean isProperNonFirstCapital(final String word)
+    {
+        // This checks each capital letter for incorrect usage
+        // It does not check the first letter - `(\p{L}.*`
+        // To be incorrect usage a capital letter:
+        // Must not be preceded by an apostrophe or name affix - `(?<!'|%1$s)(\p{Lu})`
+        // Must not be the last character if it follows an apostrophe - `(?<=')\p{Lu}(?!.)`
+        return this.nonFirstCapitalPattern.matcher(word).find();
+    }
+}