Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ICU-22984 Generate old Java monkeys #3301

Open
wants to merge 41 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
23d9a3e
ICU-22986 GL takes CM
eggrobin Dec 10, 2024
ea8748c
Regenerate line.brk (@markusicu to flip the bytes)
eggrobin Dec 10, 2024
b644118
Fix the old monkeys from Java who are still hardcoded
eggrobin Dec 10, 2024
46c29d1
meow
eggrobin Dec 10, 2024
6e992c3
Something that compiles at last
eggrobin Dec 12, 2024
e3a8136
Update the tailorings too
eggrobin Dec 12, 2024
6f6fdde
Update .brk files (to be flipped by @markusicu)
eggrobin Dec 12, 2024
ed3d6d5
Somehow it compiles
eggrobin Dec 12, 2024
71eb398
It seems to work
eggrobin Dec 12, 2024
9befb32
Dumber escaping
eggrobin Dec 12, 2024
d428c8d
🍎.xml
eggrobin Dec 12, 2024
938ef97
(?!.) ftw
eggrobin Dec 12, 2024
c165ab8
Merge branch '22986' into surili
eggrobin Dec 13, 2024
fe75c00
Greedier regices, prevent remap rules from creating surrogate pairs
eggrobin Dec 13, 2024
24ec66f
I’ll be back
eggrobin Dec 13, 2024
a954b16
Merge branch '22986' into surili
eggrobin Dec 13, 2024
29351ce
monkeys
eggrobin Dec 13, 2024
3decf2c
🐪
eggrobin Dec 13, 2024
698633d
Merge branch '22986' into surili
eggrobin Dec 13, 2024
1301f94
Port the surrogate assembly preventer
eggrobin Dec 13, 2024
eb6c9b1
sot last
eggrobin Dec 13, 2024
b102ed7
Joys of 4-space indent
eggrobin Dec 13, 2024
8f78834
Merge branch '22986' into surili
eggrobin Dec 13, 2024
71098c6
Apparently I hallucinated that bug
eggrobin Dec 13, 2024
0ac10b1
Merge branch '22986' into surili
eggrobin Dec 13, 2024
3b85a6c
d not zu
eggrobin Dec 18, 2024
6c3b350
ICU-22986 big-endian line brk
markusicu Dec 18, 2024
a7c3bbe
ICU-22986 GL takes CM
eggrobin Dec 10, 2024
2b48687
Merge commit '6c3b350aa76' into surili
eggrobin Dec 20, 2024
08b5545
Merge branch '22986' into HEAD
eggrobin Dec 20, 2024
db891f6
Merge commit '08b5545' into surili
eggrobin Dec 20, 2024
4c9e977
Merge remote-tracking branch 'la-vache/main' into HEAD
eggrobin Dec 20, 2024
79251e6
Merge commit '4c9e977' into surili
eggrobin Dec 20, 2024
65bf72a
Il-Milione
eggrobin Dec 23, 2024
cbb4376
Cats can have normal a regex replacement
eggrobin Dec 23, 2024
6c99853
Revert .project files
eggrobin Dec 23, 2024
550372a
Try something
eggrobin Jan 21, 2025
ed70b82
Trace
eggrobin Jan 21, 2025
164c327
Try another thing?
eggrobin Jan 21, 2025
3718cf9
comment
eggrobin Jan 21, 2025
6b1fcc3
After Markus’s review
eggrobin Jan 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -1433,6 +1433,12 @@ void RunMonkey(BreakIterator bi, RBBIMonkeyKind mk, String name, int seed, int
if (c < 0) { // TODO: deal with sets containing strings.
errln("c < 0");
}
// Do not emit surrogates on Java 8, as the behaviour of regular expressions that
// match surrogates differs there.
if (System.getProperty("java.version").startsWith("1.") &&
eggrobin marked this conversation as resolved.
Show resolved Hide resolved
Character.isSurrogate((char)c)) {
eggrobin marked this conversation as resolved.
Show resolved Hide resolved
continue;
}
// Do not assemble a supplementary character from randomly generated separate surrogates.
// (It could be a dictionary character)
if (c < 0x10000 && Character.isLowSurrogate((char)c) && testText.length() > 0 &&
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,9 @@ void apply(StringBuilder remapped, BreakContext[] resolved) {
.findFirst();
if (!position.isPresent()) {
throw new IllegalArgumentException(("Rule " + name() +
" found a break at a position which does not correspond to an index in " +
" matched at position " + afterSearch.start() +
" in " + remapped +
" which does not correspond to an index in " +
"the original string"));
}
if (position.get().appliedRule == null &&
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,11 @@ protected String expandUnicodeSets(String regex) {
if (regex.charAt(i) == '[' || regex.charAt(i) == '\\') {
ParsePosition pp = new ParsePosition(i);
final UnicodeSet set = new UnicodeSet(regex, pp, null);
// Regular expressions that match unpaired surrogates apparently behave
// differently in Java 8. Let’s not go there.
if (System.getProperty("java.version").startsWith("1.")) {
set.removeAll(new UnicodeSet("[\\uD800-\\uDFFF]"));
}
// Escape everything. We could use _generatePattern, but then we would have to
// convert \U escapes to sequences of \‌u escapes, and to escape # ourselves.
result.append('[');
Expand Down