ICU-22979 Support inverse rule for [] span in RBNF #3326

grhoten · 2025-01-06T22:28:54Z

This feature enhancement only affects the documentation and RBNF syntax. Some changes were made to keep both the ICU4J and ICU4C implementations in sync. Some compiler warnings were also fixed.

I’d like to extend the RBNF syntax to support more complex grammar. I’d like to change the omission rule with square brackets. By default, everything between the square brackets are omitted when the remainder is 0. My proposal will not change this behavior by default, unless a “|” (pipe symbol) is present between the square brackets. You can think of it performing like an else statement. Everything between the beginning square bracket and the pipe acts as it currently does. Everything between the pipe symbol and the end square bracket will be used instead of omitting the text.

This behavior is important for supporting large ordinals in slavic languages. It’s convenient for other languages, like English.

The test case in the prototype and the ticket provides more examples of the change. Below is a simplified example of the new syntax. Right now, we have the following ordinals in English.

%%tieth:
0: tieth;
1: ty-=%spellout-ordinal=;
%spellout-ordinal:
...
20: twen>%%tieth>;
30: thir>%%tieth>;
40: for>%%tieth>;
50: fif>%%tieth>;

That could be simplified to the following rules instead.

%spellout-ordinal:
...
20: twent[y->>|ieth];
30: thirt[y->>|ieth];
40: fort[y->>|ieth];
50: fift[y->>|ieth];

The cardinal and ordinal rules will work on either side of the pipe symbol.

Checklist

Required: Issue filed: ICU-22979
Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
Issue accepted (done by Technical Committee after discussion)
Tests included, if applicable
API docs and/or User Guide docs changed or added, if applicable

grhoten · 2025-01-06T22:55:54Z

icu4j/main/common_tests/src/test/java/com/ibm/icu/dev/test/format/RbnfTest.java

+                "20: twent[y->>|ieth];\n" +
+                "30: thirt[y->>|ieth];\n" +
+                "40: fort[y->>|ieth];\n" +
+                "50: fift[y->>|ieth];\n" +
+                "60: sixt[y->>|ieth];\n" +
+                "70: sevent[y->>|ieth];\n" +
+                "80: eight[y->>|ieth];\n" +
+                "90: ninet[y->>|ieth];\n" +


Languages that define a rule for 31 or private rules for 30 are candidates to use this rule syntax.

grhoten · 2025-01-06T23:07:14Z

icu4j/main/common_tests/src/test/java/com/ibm/icu/dev/test/format/RbnfTest.java

+                "100: <%cardinal< [$(cardinal,one{hundred}other{hundreds})$ >>|$(cardinal,one{hundredth}other{hundredths})$];\n" +
+                "1000: <%cardinal< [$(cardinal,one{thousand}other{thousands})$ >>|$(cardinal,one{thousandth}other{thousandths})$];\n" +
+                "1000000: <%cardinal< [$(cardinal,one{million}other{millions})$ >>|$(cardinal,one{millionth}other{millionths})$];\n" +
+                "1000000000: <%cardinal< [$(cardinal,one{billion}other{billions})$ >>|$(cardinal,one{billionth}other{billionths})$];\n" +
+                "1000000000000: <%cardinal< [$(cardinal,one{trillion}other{trillions})$ >>|$(cardinal,one{trillionth}other{trillionths})$];\n" +


This style of large ordinals is currently impossible to support when you have more than 2 cardinal states to consider. It's excessively tedious to repeatedly split and copy the rules. It's worse for languages that are highly inflectional with many grammatical cases.

Languages that use private rules like English's %%tieth and %%th ordinals rules are candidates to use this syntax. Those private rules only define 0 and 1.

grhoten · 2025-01-06T23:10:53Z

icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedNumberFormat.java

+        char ch;
        while (start < descriptionLength) {
-            // seek to the first non-whitespace character...
+            // Seek to the first non-whitespace character...
+            // If the first non-whitespace character is semicolon, skip it and continue
            while (start < descriptionLength
-                   && PatternProps.isWhiteSpace(description.charAt(start)))
+                   && (PatternProps.isWhiteSpace(ch = description.charAt(start)) || ch == ';'))
            {
                ++start;
            }

-            //if the first non-whitespace character is semicolon, skip it and continue
-            if (start < descriptionLength && description.charAt(start) == ';') {
-                start += 1;
-                continue;
-            }
-


Some of this logic in stripWhitespace was out of sync between C++ and Java. Both sides have been brought in sync in this pull request.

richgillam

I think this looks great. A couple things you included as comments on this PR probably ought to be comments in the actual tests or code, and I had a couple small suggestions on your documentation, but the code looks great.

richgillam · 2025-01-07T01:01:58Z

icu4j/main/common_tests/src/test/java/com/ibm/icu/dev/test/format/RbnfTest.java

+                "1000000: <%cardinal< [$(cardinal,one{million}other{millions})$ >>|$(cardinal,one{millionth}other{millionths})$];\n" +
+                "1000000000: <%cardinal< [$(cardinal,one{billion}other{billions})$ >>|$(cardinal,one{billionth}other{billionths})$];\n" +
+                "1000000000000: <%cardinal< [$(cardinal,one{trillion}other{trillions})$ >>|$(cardinal,one{trillionth}other{trillionths})$];\n" +
+                "1000000000000000: =#,##0=$(ordinal,one{st}two{nd}few{rd}other{th})$;";


I'm a little unclear on the $ syntax. It looks like you're just using normal plural rules. Do the English plural rules define few as just 3, or you doing that somewhere else in here just for the purposes of the test?

The other languages need to use the cardinal form for large ordinals. English doesn’t define few. I just wanted to verify that the syntax is usable, and not add additional plural rule tests. I think that these new tests demonstrate that.

I could try to pick a difficult language to test, like Lithuanian. I didn’t have patience to write such rules, and I wanted something readable, reviewable, and quick to write.

Do you really want the masculine singular nominative Lithuanian ordinals written in a test? I was hoping to defer such work and use the Number Format Tester instead of hard coded in tests. As an alternative, I can give you just the thousands line for documentation purposes. That’s simpler.

I think a small sample of Lithianian thousands of the positive degree nominative masculine non-pronomial ordinal would be the following:

1000: [tūkstantis >>|tūkstantas]; 2000: <%spellout-cardinal-masculine< [$(cardinal,one{tūkstantis}few{tūkstančiai}other{tūkstančių})$ >>|$(cardinal,one{tūkstantas}few{tūkstanti}other{tūkstantų})$];

Of course, this sample is unvetted, but the structure is what is needed. I got the spellings from Wiktionary. See tūkstantis, tūkstantas, and pirmas for the inflection tables. This pull request is needed to support ordinals larger than 9,999 in Lithuanian. To fully support Lithuanian ordinals, I'd likely need to copy the structure of these 2 rules less than 162 times for only the ordinals. The current Lithianian ordinal rules are a little clunky.

The structure is pretty close to the English example in the test.

What you had is okay with me; I was just trying to understand it.

I have added a couple of sentences before these 2 tests to clarify to describe what's going on. Please check it out.

icu4j/main/common_tests/src/test/java/com/ibm/icu/dev/test/format/RbnfTest.java

icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedNumberFormat.java

icu4c/source/i18n/nfrule.cpp

jira-pull-request-webhook · 2025-01-07T23:35:36Z

Notice: the branch changed across the force-push!

icu4c/source/i18n/nfrule.cpp is different
icu4c/source/i18n/unicode/rbnf.h is different
icu4c/source/test/intltest/itrbnf.cpp is different
icu4j/main/common_tests/src/test/java/com/ibm/icu/dev/test/format/RbnfTest.java is different
icu4j/main/core/src/main/java/com/ibm/icu/text/NFRule.java is different
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedNumberFormat.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

richgillam

I took another look. I like your updates to the documentation. No notes.

grhoten requested a review from richgillam January 6, 2025 22:29

grhoten commented Jan 6, 2025

View reviewed changes

richgillam previously approved these changes Jan 7, 2025

View reviewed changes

ICU-22979 Support inverse rule for [] span in RBNF

a8e7728

grhoten dismissed richgillam’s stale review via a8e7728 January 7, 2025 23:35

grhoten force-pushed the main branch from 9a03a8b to a8e7728 Compare January 7, 2025 23:35

richgillam approved these changes Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICU-22979 Support inverse rule for [] span in RBNF #3326

ICU-22979 Support inverse rule for [] span in RBNF #3326

grhoten commented Jan 6, 2025 •

edited

Loading

grhoten Jan 6, 2025

grhoten Jan 6, 2025

grhoten Jan 6, 2025

richgillam left a comment

richgillam Jan 7, 2025

grhoten Jan 7, 2025

grhoten Jan 7, 2025

richgillam Jan 7, 2025

grhoten Jan 7, 2025

jira-pull-request-webhook bot commented Jan 7, 2025

richgillam left a comment

ICU-22979 Support inverse rule for [] span in RBNF #3326

Are you sure you want to change the base?

ICU-22979 Support inverse rule for [] span in RBNF #3326

Conversation

grhoten commented Jan 6, 2025 • edited Loading

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richgillam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jira-pull-request-webhook bot commented Jan 7, 2025

richgillam left a comment

Choose a reason for hiding this comment

grhoten commented Jan 6, 2025 •

edited

Loading